Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Principal Component Analysis for Binary Variables

    Dear All,

    I have a panel dataset, the dependent variable (Y) is contineous and all my explanatory variables are binary taking the value of 1 or 0. I have followed a specialized literature that uses machine learning based least absolute shrinkage and selection operator (LASSO) method that identifies relevant set of dummy explantory variables that have non-negligible impact on Y. However the set of these choosen dummy explanatory variables is still large around 34. Because of high multicollinearity between the dummy variable and overfitting problem, it is inadvisible to include all the relevant 34 variable additively in the model.

    Having said that, I want to use Principal Component Analysis (PCA) to combine these multiple factors (dummies) into one single factor. I was looking for which PCA method works best for such dataset with binary variables.? There are various alternatives available to combine multiple variables into a single factor using PCA such as; pca, polychoricpca, tetrachoric, multiple correspondance analysis (mca), factor analysis (factor) etc. I am not sure which method is more suitable with the given data.

    I shall be thankful for any suggestions or recommendations.


    Thanks and regards,
    (Ridwan)



  • #2
    It all depends on the goal of your analysis: do you want to understand or predict?
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      As far as I know, PCA uses the Pearson Correlations to compute factor scores, while polychoricpca uses the polychoric correlations. Given your variables are dummies, pearson correlations are not the correct way to go about it, since they give you correleations on continuous variables. Polychoric correlation computes correlations taking into account the discrete nature of your variables, so that's a better fit. IF all your dummies are binary, polychoric becomes tetrachoric. . If you are considering creating an index of a set of binary variables, there are other methods too; like Anderson's GLS weighting index, which you can find at https://are.berkeley.edu/~mlanderson...on%202008a.pdf.
      I would suggest you look at the literature surrounding your field of work and see how they have approached this problem given the type of data you have. Different fields prioritize such approaches differently. Coming simply to your question, polychoricpca seems to be the better choice than normal pca, but you should take this advice with a pinch of salt.

      Comment


      • #4
        It's hard (really hard) to say in advance what will work better. The justification, calculation and interpretation of tetrachoric correlation seem oversold and over-elaborate to me. An orthodox PCA of binsry variables -- whether based on the correlation matrix or the covariance matrix -- seems easier all round. As (0, 1) binary variables have in general different variances, that is an area of choice.

        On a note of terminology, PCA, factor analysis and multiple correspondence analysis are usually regarded as separate techniques. It's not necessary and not helpful, in my view, to regard them as flavours of PCA in some unduly broad sense.

        Comment


        • #5
          Here is some relevant thread, hope they are useful:
          Index Analysis using Stata PCA and MCA command
          polychoric and tetrachoric commands (factor analysis on binary variables)
          Re: st: Interpreting Polychoric PCA results in STATA 11

          Comment


          • #6

            #3 edited for typos

            Pearson correlations are not the correct way to go about it, since they give you correlations on continuous variables.
            This is some kind of mix of myth, dogma and irrelevance. Pearson correlation is based on variances and covariance. But the variance of a binary variable is perfecly well defined, as any accounf of Bernoulli distributions explains. And the definition of covariance is just an extension of the same idea. There is absolutely nothing in the definition that presupposes continuous variables. So, there is nothing incorrect about using correlation directly for binary variables. How useful it might be is unpredictable.

            Any binary variable that is always 0 or always 1 can't be used in correlation, but this is on all fours with the fact that any variable at all that is constant can't be used either becaise its variance is zero.

            What I guess is being alluded to is a fantasy that observed 0, 1 values arise from a continuous underlying or latent variable, for example, one which is really normal, in which case there is a recipe for estimating the underlying correlation. But this recipe is voluntary, not compulsory. Researchers will vary in how plausible or relevant they think such a fantasy is to their data and goals. This is an old and sometimes angry debate that goes back a century and more, if not further, say to Karl Pearson and G.U. Yule. In anachronistic terms, Pearson seems never to have accepted the idea that categorical data of any kind deserved their own special methods. His bias was (crudely) to regard categorical data as degraded continuous data.

            I don't object, naturally, to anyone using polychoric or tetrachoric correlation if they think if fits their set-up. A minimum requirement for any study worth publishing would. however, seem to be some kind of comparison between different ways of getting component scores, or some equivalent.



            Comment


            • #7
              Thank you Maarten Buis Rajdeep Chaudhuri Chen Samulsion for responding to my query. I really appreciate it.

              Thank you Nick Cox for the explanation that helps to understand it better.

              An orthodox PCA of binary variables -- whether based on the correlation matrix or the covariance matrix -- seems easier all round.
              Following your discussion in #4 and #6, I will go with canonical PCA as recommended. However, I am curious to explore polychoricpca just for the comparison. Unfortuanlety, I cannot find polychoricpca either on ssc or any other external link.

              Code:
              search polychoricpca
              net describe polychoric, from(https://staskolenikov.net/stata/)
              ssc install polychoricpca
              ssc describe p
              None of the codes locate/install polychoricpca.

              (1) Can anyone help me installing polychoricpca or post an .ado file here.

              (2) Using a canonical pca, the first four components explain 80% variance. I ran the following codes to combine these dummies to a single factor using first four components that assigns equal weights to each component.

              Code:
              pca d1-d18, components(4)
              estat loadings
              screeplot, xlabel (1(1)18)
              
              predict pc1 pc2 pc3 pc4
              gen pca_factor = (pc1 + pc2 + pc3 + pc4)/4
              Alternatively, I calulate the variance explained by each component 'k' as; Variance_k = Eigenvalue_k/18 ; k=1,2,3,4
              where Eigenvalue_k denotes the eigen value of component 'k'. I then combine these variable using first four components but assign variance explained by each factor as weights, not an equal weigh as above

              Code:
              gen pca_factor = (Variance_1*pc1 +Variance_2*pc2 + Variance_3*pc3 +Variance_4*pc4)/4
              My question is, am I doing it correctly in (2) and which of the pca_factor is preferred, the one based on equal weights or variance based weights.


              Thanks and regards,
              (Ridwan)

              Comment


              • #8
                My understanding is that PC1 is the best single summary of your data. I am curious why people think that an average of more components than 1 can be any better. You'd then be mushing together components that by construction are uncorrelated. Weighting by eigenvalues gives another kind of mush. I don't recollect either procedure being recommended in any text. I do detect this idea as a myth passed around in some literature, which is not surprising.

                I'd welcome authoritative references explaining why this is a good idea, meaning not just an empirical paper where someone did this, but say a text on PCA.

                polychoricpca has never been on SSC to my knowledge.

                Code:
                net from https://staskolenikov.net/stata 
                net install polychoric
                should work. The points are that describing a package doesn't install it, and that polychoricpca is not the name of the package. It's the name of a command within it. ssc install installs packages.

                Comment


                • #9
                  Type the following link in blue, after installing, type help polychoricpca in command window.

                  Code:
                  . net describe polychoric, from(https://staskolenikov.net/stata/)
                  
                  -----------------------------------------------------------------------------------------------------------------
                  package polychoric from https://staskolenikov.net/stata
                  -----------------------------------------------------------------------------------------------------------------
                  
                  TITLE
                        polychoric -- The polychoric correlation package
                  
                  DESCRIPTION/AUTHOR(S)
                        
                        Author: Stas Kolenikov, [email protected]
                        
                        This package provides routines to estimate
                        the polychoric, tetrachoric, polyserial and biserial
                        correlations and use them in principal component analysis.
                        Current version: 1.4
                  
                  INSTALLATION FILES                           (type net install polychoric)
                        polychoric.ado
                        polychoricpca.ado
                        polych_ll.ado
                        polyser_ll.ado
                        polychoric.hlp
                        polychoricpca.hlp
                  -----------------------------------------------------------------------------------------------------------------

                  Comment


                  • #10
                    Thanks Nick Cox and Chen Samulsion

                    My understanding is that PC1 is the best single summary of your data. I am curious why people think that an average of more components than 1 can be any better.
                    In reference to the quoted text, I am attaching my results.

                    Click image for larger version

Name:	4.PNG
Views:	1
Size:	22.9 KB
ID:	1775275


                    The PC1 explains 46% of total variance and PC2 explains 22%. Therefore, the cumulative variance of first two components is 68% and that being a reason to use both PC1 and PC2 (first 2 components if not 4) to create a composite summary of the variables in the dataset.
                    Code:
                    gen pca_factor = (0.461*pc1 + 0.224*pc2)/2
                    gen pca_factor = (pc1 + pc2)/2


                    I understand PC1 is uncorrelated with PC2 by construction. Combining the two components using either equal or proportional variance as weights (as in above) was based on my own intution and I may be wrong. I am not aware of any literature that recommends this approach. Based on the output (results table) above, If you have any further suggestions, I shall be very thankful.

                    The codes/links posted above to extract the polychoric still do not work in my case and produce and error message.

                    Code:
                    net from https://staskolenikov.net/stata
                    net install polychoric
                    net describe polychoric, from(https://staskolenikov.net/stata/)
                    Thank you

                    Comment


                    • #11
                      Why not just average your binary variables? One answer is that averaging different answers to quite different questions makes no obvious sense. At least you know that some of your variables are correlated.

                      Now you need to explain why averaging PCs is a good idea, not to me or to us necessarily on Statalist, but in principle to whatever examiners, reviewers, and so on are going to evaluate your work.

                      (Averaging and adding are naturally equivalent here as far as using them in regresson is concerned.)

                      The existence of papers doing this is not convincing. At some level we all work with details we don't fully understand and papers get published that should not have been. I am currently writing about a procedure (nothing to do with PCA) which most textbooks don't explain correctly. Those I have seen just seem to copy from other equally unreliable textbooks or from confident but confused internet sources.

                      I can't and don't rule out using more than one PC as separate predictors.

                      Sorry, but the only advice on local IT problems is to ask locally. It's good that we can't see any details of your IT set-up -- whether you are linked to the internet directly or through a university, other employer, or otherwise -- but that means complete ignorance of why certain links won't work for you. In any case you don't show us the error message to comment.

                      Comment


                      • #12
                        Thank You Nick Cox for your help,

                        I can't and don't rule out using more than one PC as separate predictors
                        I think it is better to use influential PCs as a seperate predictors in the model, not combining or averaging them into a single measure.

                        In any case you don't show us the error message to comment.
                        I am linked to the internet directly not through university or otherwise. The error message I am getting trying installing the polychoricpca is

                        sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPath
                        > BuilderException: unable to find valid certification path to requested target

                        https://staskolenikov.net/stata/ either
                        1) is not a valid URL, or
                        2) could not be contacted, or
                        3) is not a Stata download site (has no stata.toc file).

                        r(5100);

                        Thanks and regards,

                        Comment


                        • #13
                          I recommend contacting your IT provider. 1) 2) 3) are wrong, but your Internet connection is being fastidious.

                          Comment


                          • #14
                            Thank you very much Nick Cox . I will contact my IT provider.

                            Regards,
                            (Ridwan)

                            Comment


                            • #15
                              Nick makes excellent points here. You probably want to figure out how many dimensions your group of variables has, rather than averaging the 2 PCs. You may find a scree plot to be useful.

                              Originally posted by Nick Cox View Post
                              Why not just average your binary variables? One answer is that averaging different answers to quite different questions makes no obvious sense. At least you know that some of your variables are correlated.

                              Now you need to explain why averaging PCs is a good idea, not to me or to us necessarily on Statalist, but in principle to whatever examiners, reviewers, and so on are going to evaluate your work.

                              (Averaging and adding are naturally equivalent here as far as using them in regresson is concerned.)

                              The existence of papers doing this is not convincing. At some level we all work with details we don't fully understand and papers get published that should not have been. I am currently writing about a procedure (nothing to do with PCA) which most textbooks don't explain correctly. Those I have seen just seem to copy from other equally unreliable textbooks or from confident but confused internet sources.

                              I can't and don't rule out using more than one PC as separate predictors.

                              Sorry, but the only advice on local IT problems is to ask locally. It's good that we can't see any details of your IT set-up -- whether you are linked to the internet directly or through a university, other employer, or otherwise -- but that means complete ignorance of why certain links won't work for you. In any case you don't show us the error message to comment.

                              Comment

                              Working...
                              X