Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Principal component analysis

    I wish to know in detail the steps to construct an index from principal component analysis. I have been able to finalise on the components (there are two components whose eigen values are greater than 1). However, how do I assign weights to each component?

  • #2
    What you are proposing to do is unusual. Ordinarily, when we do principal components analysis on a set of variables, we either want to use all (or just some) of the components as they are in our subsequent work. Using all of them creates orthogonal variables out of variables that are intercorrelated. Using some of them gives you orthogonal variables and also reduces the number of degrees of freedom being used for the model. Combining the coefficients by doing some kind of weighted average is an unusual thing to do, and basically defeats the usual purpose of doing principal components analysis in the first place. By combining two components in that way, you will actually be discarding some of the information that's available from each of the two interesting components separately.

    Why are you thinking of doing this? If you are trying to create a single index from multiple variables and you want to do it via PCA, typically you just select one component for that purpose (usually the first.)

    Comment


    • #3
      Thank you for your reply.

      I have been reading up a bit on multivariate analysis and I could find that factor analysis is used to create indices. In order to create an index using PCA, I had the conception that all components whose eigen-values are greater than 1 should be used to construct the index.

      It would be very kind of you if you could suggest any reference which explains the theory behind the construction of indices using PCA. I have found references which explain the theory of PCA only and not the construction of indices using the principal components.

      Comment


      • #4
        I'd back up and tell us more about the statistical goal(s). With PCA if there is a need to use just a single index based entirely on the PC results that summarizes all the variability best, then that is PC1, pretty much by definition. That's a very big if. But much depends on the project, about which you are saying nothing. And who or what implies that you must or should use a single index? That's often primitive thinking quantitatively.

        Although writers on PCA tend to be very positive about it, my experience is that most of the time PCA doesn't add much to what is evident directly from looking at scatter plots and correlations. I might use PCA results to help me select which predictors I use, but I wouldn't use PCs themselves. It is often better to base that also on scientific thinking about what the variables are.

        It's been a while since I read it but

        Wallis, J. R. 1968. Factor Analysis in Hydrology—An Agnostic View. Water Resources. Research 4(3): 521–527 doi:10.1029/WR004i003p00521.

        is one of many cautionary papers here.

        Comment


        • #5
          Thank you for the responses on this. I'm also working on constructing an index using PCA, and I am wondering if the way to select the first component is to subsequently type predict pc1, score. What if I want to select the first 5 components? A colleague said it might be good to choose the first 5 because in my case they add up to explaining about 60% of the variation (the first component alone explains about 30%).

          Comment


          • #6
            Stephanie: How many variables go into the PCA? If it's 50 or 500, choosing 5 PCs might be defended as parsimonious.

            If it's 10 or 20, I would rather try to use regression (or whatever else lies downstream of your PCA) directly to choose which predictors make most sense.

            Much depends on your goals and on your data. If 5 variables are very tightly correlated, then using the first PC as summary could be a good idea, but personally I'd prefer a reduction in simpler form, e.g. a straight average of (possibly standardized) variables. If it's a case of throwing a ragbag of variables into a pot (mixed metaphor, but there you go), it's often hard to see how (social?) science is going to move forward like this.

            Put this socially: The aim of a project is not just to do a good job with your data; it's usually to provide a story that is interesting or useful to others. It's often hard to read a story with enthusiasm in which someone else's data are reduced to mushed-together PCs which then dominate the analysis. Even my PCs on similar variables are likely to be harder to think about than the original variables, even for me; what about others?

            Naturally I realise that there's a legion of books and papers dedicated to technology for identifying latent variables from manifest data, which is another side of a long debate here.

            Comment


            • #7
              If you want to score the first 5 components it's
              Code:
              predict pc1 pc2 pc3 pc4 pc5, score
              That said, do ponder Nick Cox's wisdom in #4. While I probably would characterize a wider range of circumstances as being suitable for using the components than he suggests there, I do agree with him that if you don't just pick a single component for inclusion in your model, it is usually better to let the components guide you in choosing which among the original variables to include in the model than to use several of the components themselves.

              Added: Crossed with #6.

              Comment


              • #8
                Thank you very much to both of you. This is very helpful. I have 16 variables for a wealth index and many are tightly correlated, so I'll play around with what you suggested.

                Comment


                • #9
                  Nick Cox, I also want to construct an index (a stock market index, more precisely) using N stock returns. Let's say that I got a final component and its weighted score. Now, what approach can I apply? Is there any approach, after all? The idea is that I need the logarithmic returns of this index, and hence it cannot have negative values. I thnk it should be done a sort of rescaling (I also want a base value of 100 or 1,000), but I'm stucked here. Hope to find an answer. Thank you.
                  Last edited by Nicu Sprincean; 25 Jan 2018, 08:28.

                  Comment


                  • #10
                    @Friedrich Mises

                    I don't have the experience or expertise with finance data that is needed for a good answer; in any case my modal answer on PCA is agnostic about whether it's a good idea at all in most applications.

                    However, it seems most unlikely to me that arbitrary rescaling of a PC will ensure the behaviour you want in a way that is defensible.

                    What's wrong with a weighted average, which naively I suppose most stock market indexes to be?

                    Comment


                    • #11
                      Nick Cox, thank you for your response. I have already tried to replicate the methodology of a known index (STOXX), which is free-float-weighted, but I have some issue because for some of my stocks there is a big difference between adjusted and unadjusted price, and I get a negative divisor. That's why I want to apply PCA. Thank you anyway.

                      Comment


                      • #12
                        Originally posted by Clyde Schechter View Post
                        If you want to score the first 5 components it's
                        Code:
                        predict pc1 pc2 pc3 pc4 pc5, score
                        That said, do ponder Nick Cox's wisdom in #4. While I probably would characterize a wider range of circumstances as being suitable for using the components than he suggests there, I do agree with him that if you don't just pick a single component for inclusion in your model, it is usually better to let the components guide you in choosing which among the original variables to include in the model than to use several of the components themselves.

                        Added: Crossed with #6.
                        Hi Clyde,

                        While running PCA for generating a index, I found 4 components with eigen value greater than 1. How can I used these 4 components to generate single weights to the variables that I am using?

                        Thanks in advance.

                        Santosh

                        Comment


                        • #13
                          How can I used these 4 components to generate single weights to the variables that I am using?
                          I do not know what you mean by "single weights to the variables."

                          If you want to find the coefficients that -predict- uses to calculate the component scores, they are left behind in r(scoef) after -predict- runs.

                          Comment


                          • #14
                            Originally posted by Clyde Schechter View Post
                            I do not know what you mean by "single weights to the variables."

                            If you want to find the coefficients that -predict- uses to calculate the component scores, they are left behind in r(scoef) after -predict- runs.
                            Clyde Schechter Thank you for the reply. I was confused with how should multiple components (with eigen value >1) be handled while calculating a composite index. I am trying to calculate household level flood vulnerability index using some 30 indicators. Literatures mention using component 1 score as a weight for variables but my pc1 score is explaining only 11% variation. So I am supposed to include more components to make my index credible. Under this scenario, how can I incorporate more components to assign credible weight to the variables that I am using.

                            My current index calculation method is:

                            1. pc x1 x2.......x30 (*The variables have been normalized before running PCA)

                            Household level vulnerability index = (pc1 of x1 * x1) + (pc1 of x2 * x2) +................ + (pc1 of x30 * x30)

                            Comment


                            • #15
                              I'm sorry, but now I'm even more confused. You are using terminology and notation in ways that are, well, unconventional and I do not know what you mean by them. I have no idea of what "pc1 of x1 * x1" means. I don't know what "pc x1 x2.......x30
                              (*The variables have been normalized before running PCA)" means. I do not understand what "Literatures mention using component 1 score as a weight for variables " means.

                              If you want to show how you are calculating something, show the actual Stata code you are using--that will be clear!

                              Comment

                              Working...
                              X