Principal Component Analysis and index construction with variables 0-1

Facundo Arganaraz

Join Date: Feb 2018

Posts: 1
#1

Principal Component Analysis and index construction with variables 0-1

07 Feb 2018, 15:15

This is my first post. I am an undergraduate student and I am carrying out my thesis.

I am working on the construction of an index, based on three variables which take values between 0 and 1. I should say that these variables are mean of another variables which can take values between 0 and 1, in turn. I think that my variables are highly correlated, so I use Principal Component Analysis to have a specification for my index. My first question is: Am I right? Can I use PCA for index construction when the variables can take values between 0 and 1 and they are correlated? (Q1). Secondly, searching for bibliography I found that different papers use this methodology for index construction -but not necessary with variables 0-1-, and what I understand for them is that the components are used to relate the variables in a formula, which is the index; this is: if the components are 0.3, 0.4, and 0,7, my index would be:

I=0.3var1 + 0.4var2 + 0.7var3

Is my understanding correct? Is it the way of using PCA?(Q2)

But other researchers use the variability explicated to relate the variables, instead of the components - whose sum is 1.. What strategy is more common used? Does it depend on my data/specification/goal/etc?(Q3)

Thank You.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35405
#2

08 Feb 2018, 02:11

Q1. You can do PCA. Whether it's a good idea is a different matter. You can find literatures in which this is a popular pastime and literatures in which it's regarded as pointless pseudoscience.

For just three variables, I would use them all as predictors directly in a regression-type model and then look at the results. They're already averages of other variables. Don't mush them together. Or if they are really highly correlated, choose one.

Q2. I don't think I understand what you're getting at here. But to calculate a single index after PCA. in Stata you'd just use the predict command. It gives you a weighted average of your original variables, along the lines of your equation.

Q3. I find it hard to say what is popular in my own field because I don't claim to read literature systematically in it. I think at a minimum you'd need to declare your field to get any comments on that.

I'd always, always recommend looking at the correlation matrix and a scatter plot matrix.
1 like
Comment
Timo Ar

Join Date: Feb 2019

Posts: 1
#3

26 Feb 2019, 10:22

Hi Nick,

does the predict command multiply the loadings by the respective variables' values? I can't comprehend how the predicted values are calculated?

Thanks!
Comment
Alina Faruk

Join Date: Oct 2018

Posts: 96
#4

29 Jun 2019, 00:42

Originally posted by Nick Cox View Post

Q1. You can do PCA. Whether it's a good idea is a different matter. You can find literatures in which this is a popular pastime and literatures in which it's regarded as pointless pseudoscience.

For just three variables, I would use them all as predictors directly in a regression-type model and then look at the results. They're already averages of other variables. Don't mush them together. Or if they are really highly correlated, choose one.

Q2. I don't think I understand what you're getting at here. But to calculate a single index after PCA. in Stata you'd just use the predict command. It gives you a weighted average of your original variables, along the lines of your equation.

Q3. I find it hard to say what is popular in my own field because I don't claim to read literature systematically in it. I think at a minimum you'd need to declare your field to get any comments on that.

I'd always, always recommend looking at the correlation matrix and a scatter plot matrix.

Hi,

I also have the same question as the last post in this thread.

Does predict multiply the loadings with the variables or do I need to do it separately to arrive at the index?

Additional question: do I need to standardize my variables for PCA?
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2429
#5

29 Jun 2019, 05:02

Hi Alina
So you as Nick mentioned, predict automatically multiplies the loadings with the variables to obtain the desired number of components
If you say, "predict c1". only the first component will be created. If you say "predict c1 c2 c3" the first three components will be created. Of course, a good idea would be for you to do this by hand once, so you are confident you know how Stata does what it does (instead of trying to apply it blindly).
Regarding the second question. It depends. The default option of PCA is to "internally" standardize all variables, and create the loadings and PCA using standardized data. You can request as an option not to do so.
Perhaps asked previously. Can you use PCA when variables are indices themselves between 0-1. Yes you can, but perhaps you shouldnt. PCA assumes that all underlying variables follow a normal distribution.You can still apply to any other type of data, but just to be conscious that PCA had different assumptions. As you posted too, you can use Multiple Correspondence analysis too, and may be better to handle categorical values, but not sure if it handles rank variables well.
HTH
PS. MCA does allow for weights, but only if weights are integers (see helpfile)
Comment
Alina Faruk

Join Date: Oct 2018

Posts: 96
#6

29 Jun 2019, 05:09

Originally posted by FernandoRios View Post

Hi Alina
So you as Nick mentioned, predict automatically multiplies the loadings with the variables to obtain the desired number of components
If you say, "predict c1". only the first component will be created. If you say "predict c1 c2 c3" the first three components will be created. Of course, a good idea would be for you to do this by hand once, so you are confident you know how Stata does what it does (instead of trying to apply it blindly).
Regarding the second question. It depends. The default option of PCA is to "internally" standardize all variables, and create the loadings and PCA using standardized data. You can request as an option not to do so.
Perhaps asked previously. Can you use PCA when variables are indices themselves between 0-1. Yes you can, but perhaps you shouldnt. PCA assumes that all underlying variables follow a normal distribution.You can still apply to any other type of data, but just to be conscious that PCA had different assumptions. As you posted too, you can use Multiple Correspondence analysis too, and may be better to handle categorical values, but not sure if it handles rank variables well.
HTH
PS. MCA does allow for weights, but only if weights are integers (see helpfile)

Thank you very much for your response.

Unfortunately my weights are continuous.

So it seems I've two options for my categorical data: PCA with aweight or MCA with no weight. Would you please kindly share your thoughts regarding which is the better one?

Last edited by Alina Faruk; 29 Jun 2019, 05:11.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2429
#7

29 Jun 2019, 05:40

I would try both options. PCA with weights, and MCA with Round weights (mca x1 x2 x3 [fw=round(weight)]
They should give you results that are pretty much the same, so you can use them as robustness
1 like
Comment
Alina Faruk

Join Date: Oct 2018

Posts: 96
#8

29 Jun 2019, 06:01

Originally posted by FernandoRios View Post

I would try both options. PCA with weights, and MCA with Round weights (mca x1 x2 x3 [fw=round(weight)]
They should give you results that are pretty much the same, so you can use them as robustness

Thank you once again for your valuable input. I shall proceed accordingly.
Comment
Alina Faruk

Join Date: Oct 2018

Posts: 96
#9

29 Jun 2019, 20:52

Originally posted by FernandoRios View Post

I would try both options. PCA with weights, and MCA with Round weights (mca x1 x2 x3 [fw=round(weight)]
They should give you results that are pretty much the same, so you can use them as robustness

Dear Mr Fernando,

Here is what I got after PCA (wealth_index) & MCA (wealth_score):

. codebook wealth_index

wealth_index Scores for component 1

type: numeric (float)

range: [-2.9482613,9.4468021] units: 1.000e-11
unique values: 8,686 missing .: 0/156,987

mean: .534504
std. dev: 2.58956

percentiles: 10% 25% 50% 75% 90%
-2.2407 -1.59959 -.23172 2.29043 4.42641

. codebook wealth_score

wealth_score rowscore (dim=1; standard norm.)

type: numeric (float)

range: [-1.2089345,3.8855598] units: 1.000e-11
unique values: 8,686 missing .: 0/156,987

mean: .219402
std. dev: 1.06443

percentiles: 10% 25% 50% 75% 90%
-.922417 -.658399 -.094667 .939474 1.8211

.
It seems the mca scores are like a scalar transformation of the pca ones? Moreover, Component 1 from PCA explained 17.53% of the variation, whereas it was 70% for dimension 1 in MCA.

So are the results fine here?
Comment
Guest
#10

04 Mar 2020, 01:55

Hello, i need help i having two variables(psychological wellbeing-happiness(4 categories), satisfaction-10 categories) am trying to transform them so as they can be readily interpreted in RCT trial(variable treat(0,1). Any help please.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35405
#11

04 Mar 2020, 02:11

#10 is a repeat of a post elsewhere. Please don't do this. It's a waste of everybody's time, starting with yours.
Comment
Etudiant Thiam

Join Date: Mar 2020

Posts: 5
#12

08 Mar 2020, 09:19

Hello stata users,
I'm a student in my thesis, and I have to construct a food security indicator based on principal component analysis on stata.
Food security has 4 dimensions:
Therefore, I chose I chose an indicator for each dimension of food security:
-To measure availability, I chose the variable food availability, which takes into account the availability of food in sufficient quantity and of appropriate quality.
-To measure access, I chose the Gross domestic product per capita (in purchasing power equivalent).
-To measure utilization, I chose the variable people using at least basic health services).
-To measure dietary stability, I chose the variable variability of food production per capita.
I have two questions:
1) Should I choose only one variable for each dimension or would it be better to choose if possible two or more variables for each dimension of food security to do the principal component analysis?
2) How is principal component analysis done on stata? I.e. the controls to be used and the process.
Please help me.

Yours sincerely

Translated with www.DeepL.com/Translator (free version)
Comment

Announcement

Principal Component Analysis and index construction with variables 0-1

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment