Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Principal Component Analysis and index construction with variables 0-1

    This is my first post. I am an undergraduate student and I am carrying out my thesis.

    I am working on the construction of an index, based on three variables which take values between 0 and 1. I should say that these variables are mean of another variables which can take values between 0 and 1, in turn. I think that my variables are highly correlated, so I use Principal Component Analysis to have a specification for my index. My first question is: Am I right? Can I use PCA for index construction when the variables can take values between 0 and 1 and they are correlated? (Q1). Secondly, searching for bibliography I found that different papers use this methodology for index construction -but not necessary with variables 0-1-, and what I understand for them is that the components are used to relate the variables in a formula, which is the index; this is: if the components are 0.3, 0.4, and 0,7, my index would be:

    I=0.3var1 + 0.4var2 + 0.7var3

    Is my understanding correct? Is it the way of using PCA?(Q2)

    But other researchers use the variability explicated to relate the variables, instead of the components - whose sum is 1.. What strategy is more common used? Does it depend on my data/specification/goal/etc?(Q3)

    Thank You.

  • #2
    Q1. You can do PCA. Whether it's a good idea is a different matter. You can find literatures in which this is a popular pastime and literatures in which it's regarded as pointless pseudoscience.

    For just three variables, I would use them all as predictors directly in a regression-type model and then look at the results. They're already averages of other variables. Don't mush them together. Or if they are really highly correlated, choose one.

    Q2. I don't think I understand what you're getting at here. But to calculate a single index after PCA. in Stata you'd just use the predict command. It gives you a weighted average of your original variables, along the lines of your equation.

    Q3. I find it hard to say what is popular in my own field because I don't claim to read literature systematically in it. I think at a minimum you'd need to declare your field to get any comments on that.

    I'd always, always recommend looking at the correlation matrix and a scatter plot matrix.

    Comment


    • #3
      Hi Nick,

      does the predict command multiply the loadings by the respective variables' values? I can't comprehend how the predicted values are calculated?

      Thanks!

      Comment


      • #4
        Originally posted by Nick Cox View Post
        Q1. You can do PCA. Whether it's a good idea is a different matter. You can find literatures in which this is a popular pastime and literatures in which it's regarded as pointless pseudoscience.

        For just three variables, I would use them all as predictors directly in a regression-type model and then look at the results. They're already averages of other variables. Don't mush them together. Or if they are really highly correlated, choose one.

        Q2. I don't think I understand what you're getting at here. But to calculate a single index after PCA. in Stata you'd just use the predict command. It gives you a weighted average of your original variables, along the lines of your equation.

        Q3. I find it hard to say what is popular in my own field because I don't claim to read literature systematically in it. I think at a minimum you'd need to declare your field to get any comments on that.

        I'd always, always recommend looking at the correlation matrix and a scatter plot matrix.
        Hi,

        I also have the same question as the last post in this thread.

        Does predict multiply the loadings with the variables or do I need to do it separately to arrive at the index?

        Additional question: do I need to standardize my variables for PCA?

        Comment


        • #5
          Hi Alina
          So you as Nick mentioned, predict automatically multiplies the loadings with the variables to obtain the desired number of components
          If you say, "predict c1". only the first component will be created. If you say "predict c1 c2 c3" the first three components will be created. Of course, a good idea would be for you to do this by hand once, so you are confident you know how Stata does what it does (instead of trying to apply it blindly).
          Regarding the second question. It depends. The default option of PCA is to "internally" standardize all variables, and create the loadings and PCA using standardized data. You can request as an option not to do so.
          Perhaps asked previously. Can you use PCA when variables are indices themselves between 0-1. Yes you can, but perhaps you shouldnt. PCA assumes that all underlying variables follow a normal distribution.You can still apply to any other type of data, but just to be conscious that PCA had different assumptions. As you posted too, you can use Multiple Correspondence analysis too, and may be better to handle categorical values, but not sure if it handles rank variables well.
          HTH
          PS. MCA does allow for weights, but only if weights are integers (see helpfile)

          Comment


          • #6
            Originally posted by FernandoRios View Post
            Hi Alina
            So you as Nick mentioned, predict automatically multiplies the loadings with the variables to obtain the desired number of components
            If you say, "predict c1". only the first component will be created. If you say "predict c1 c2 c3" the first three components will be created. Of course, a good idea would be for you to do this by hand once, so you are confident you know how Stata does what it does (instead of trying to apply it blindly).
            Regarding the second question. It depends. The default option of PCA is to "internally" standardize all variables, and create the loadings and PCA using standardized data. You can request as an option not to do so.
            Perhaps asked previously. Can you use PCA when variables are indices themselves between 0-1. Yes you can, but perhaps you shouldnt. PCA assumes that all underlying variables follow a normal distribution.You can still apply to any other type of data, but just to be conscious that PCA had different assumptions. As you posted too, you can use Multiple Correspondence analysis too, and may be better to handle categorical values, but not sure if it handles rank variables well.
            HTH
            PS. MCA does allow for weights, but only if weights are integers (see helpfile)
            Thank you very much for your response.

            Unfortunately my weights are continuous.

            So it seems I've two options for my categorical data: PCA with aweight or MCA with no weight. Would you please kindly share your thoughts regarding which is the better one?
            Last edited by Alina Faruk; 29 Jun 2019, 06:11.

            Comment


            • #7
              I would try both options. PCA with weights, and MCA with Round weights (mca x1 x2 x3 [fw=round(weight)]
              They should give you results that are pretty much the same, so you can use them as robustness

              Comment


              • #8
                Originally posted by FernandoRios View Post
                I would try both options. PCA with weights, and MCA with Round weights (mca x1 x2 x3 [fw=round(weight)]
                They should give you results that are pretty much the same, so you can use them as robustness
                Thank you once again for your valuable input. I shall proceed accordingly.

                Comment


                • #9
                  Originally posted by FernandoRios View Post
                  I would try both options. PCA with weights, and MCA with Round weights (mca x1 x2 x3 [fw=round(weight)]
                  They should give you results that are pretty much the same, so you can use them as robustness
                  Dear Mr Fernando,

                  Here is what I got after PCA (wealth_index) & MCA (wealth_score):

                  . codebook wealth_index


                  wealth_index Scores for component 1


                  type: numeric (float)

                  range: [-2.9482613,9.4468021] units: 1.000e-11
                  unique values: 8,686 missing .: 0/156,987

                  mean: .534504
                  std. dev: 2.58956

                  percentiles: 10% 25% 50% 75% 90%
                  -2.2407 -1.59959 -.23172 2.29043 4.42641

                  . codebook wealth_score


                  wealth_score rowscore (dim=1; standard norm.)


                  type: numeric (float)

                  range: [-1.2089345,3.8855598] units: 1.000e-11
                  unique values: 8,686 missing .: 0/156,987

                  mean: .219402
                  std. dev: 1.06443

                  percentiles: 10% 25% 50% 75% 90%
                  -.922417 -.658399 -.094667 .939474 1.8211

                  .
                  It seems the mca scores are like a scalar transformation of the pca ones? Moreover, Component 1 from PCA explained 17.53% of the variation, whereas it was 70% for dimension 1 in MCA.

                  So are the results fine here?

                  Comment


                  • #10
                    Hello, i need help i having two variables(psychological wellbeing-happiness(4 categories), satisfaction-10 categories) am trying to transform them so as they can be readily interpreted in RCT trial(variable treat(0,1). Any help please.

                    Comment


                    • #11
                      #10 is a repeat of a post elsewhere. Please don't do this. It's a waste of everybody's time, starting with yours.

                      Comment


                      • #12
                        Hello stata users,
                        I'm a student in my thesis, and I have to construct a food security indicator based on principal component analysis on stata.
                        Food security has 4 dimensions:
                        Therefore, I chose I chose an indicator for each dimension of food security:
                        -To measure availability, I chose the variable food availability, which takes into account the availability of food in sufficient quantity and of appropriate quality.
                        -To measure access, I chose the Gross domestic product per capita (in purchasing power equivalent).
                        -To measure utilization, I chose the variable people using at least basic health services).
                        -To measure dietary stability, I chose the variable variability of food production per capita.
                        I have two questions:
                        1) Should I choose only one variable for each dimension or would it be better to choose if possible two or more variables for each dimension of food security to do the principal component analysis?
                        2) How is principal component analysis done on stata? I.e. the controls to be used and the process.
                        Please help me.

                        Yours sincerely

                        Translated with www.DeepL.com/Translator (free version)

                        Comment

                        Working...
                        X