Principal Component Analysis for Binary Variables

Ridwan Sheikh

Join Date: Apr 2021

Posts: 173
#1

Principal Component Analysis for Binary Variables

02 Apr 2025, 00:23

Dear All,

I have a panel dataset, the dependent variable (Y) is contineous and all my explanatory variables are binary taking the value of 1 or 0. I have followed a specialized literature that uses machine learning based least absolute shrinkage and selection operator (LASSO) method that identifies relevant set of dummy explantory variables that have non-negligible impact on Y. However the set of these choosen dummy explanatory variables is still large around 34. Because of high multicollinearity between the dummy variable and overfitting problem, it is inadvisible to include all the relevant 34 variable additively in the model.

Having said that, I want to use Principal Component Analysis (PCA) to combine these multiple factors (dummies) into one single factor. I was looking for which PCA method works best for such dataset with binary variables.? There are various alternatives available to combine multiple variables into a single factor using PCA such as; pca, polychoricpca, tetrachoric, multiple correspondance analysis (mca), factor analysis (factor) etc. I am not sure which method is more suitable with the given data.

I shall be thankful for any suggestions or recommendations.

Thanks and regards,
(Ridwan)
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3467
#2

02 Apr 2025, 00:29

It all depends on the goal of your analysis: do you want to understand or predict?

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Rajdeep Chaudhuri

Join Date: Apr 2024

Posts: 36
#3

02 Apr 2025, 00:41

As far as I know, PCA uses the Pearson Correlations to compute factor scores, while polychoricpca uses the polychoric correlations. Given your variables are dummies, pearson correlations are not the correct way to go about it, since they give you correleations on continuous variables. Polychoric correlation computes correlations taking into account the discrete nature of your variables, so that's a better fit. IF all your dummies are binary, polychoric becomes tetrachoric. . If you are considering creating an index of a set of binary variables, there are other methods too; like Anderson's GLS weighting index, which you can find at https://are.berkeley.edu/~mlanderson...on%202008a.pdf.
I would suggest you look at the literature surrounding your field of work and see how they have approached this problem given the type of data you have. Different fields prioritize such approaches differently. Coming simply to your question, polychoricpca seems to be the better choice than normal pca, but you should take this advice with a pinch of salt.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35806
#4

02 Apr 2025, 02:08

It's hard (really hard) to say in advance what will work better. The justification, calculation and interpretation of tetrachoric correlation seem oversold and over-elaborate to me. An orthodox PCA of binsry variables -- whether based on the correlation matrix or the covariance matrix -- seems easier all round. As (0, 1) binary variables have in general different variances, that is an area of choice.

On a note of terminology, PCA, factor analysis and multiple correspondence analysis are usually regarded as separate techniques. It's not necessary and not helpful, in my view, to regard them as flavours of PCA in some unduly broad sense.
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 932
#5

02 Apr 2025, 03:06

Here is some relevant thread, hope they are useful:
Index Analysis using Stata PCA and MCA command
polychoric and tetrachoric commands (factor analysis on binary variables)
Re: st: Interpreting Polychoric PCA results in STATA 11
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35806
#6

02 Apr 2025, 04:12

#3 edited for typos

Pearson correlations are not the correct way to go about it, since they give you correlations on continuous variables.

This is some kind of mix of myth, dogma and irrelevance. Pearson correlation is based on variances and covariance. But the variance of a binary variable is perfecly well defined, as any accounf of Bernoulli distributions explains. And the definition of covariance is just an extension of the same idea. There is absolutely nothing in the definition that presupposes continuous variables. So, there is nothing incorrect about using correlation directly for binary variables. How useful it might be is unpredictable.

Any binary variable that is always 0 or always 1 can't be used in correlation, but this is on all fours with the fact that any variable at all that is constant can't be used either becaise its variance is zero.

What I guess is being alluded to is a fantasy that observed 0, 1 values arise from a continuous underlying or latent variable, for example, one which is really normal, in which case there is a recipe for estimating the underlying correlation. But this recipe is voluntary, not compulsory. Researchers will vary in how plausible or relevant they think such a fantasy is to their data and goals. This is an old and sometimes angry debate that goes back a century and more, if not further, say to Karl Pearson and G.U. Yule. In anachronistic terms, Pearson seems never to have accepted the idea that categorical data of any kind deserved their own special methods. His bias was (crudely) to regard categorical data as degraded continuous data.

I don't object, naturally, to anyone using polychoric or tetrachoric correlation if they think if fits their set-up. A minimum requirement for any study worth publishing would. however, seem to be some kind of comparison between different ways of getting component scores, or some equivalent.
2 likes
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 173
#7

03 Apr 2025, 00:33

Thank you Maarten Buis Rajdeep Chaudhuri Chen Samulsion for responding to my query. I really appreciate it.

Thank you Nick Cox for the explanation that helps to understand it better.

An orthodox PCA of binary variables -- whether based on the correlation matrix or the covariance matrix -- seems easier all round.

Following your discussion in #4 and #6, I will go with canonical PCA as recommended. However, I am curious to explore polychoricpca just for the comparison. Unfortuanlety, I cannot find polychoricpca either on ssc or any other external link.

Code:

search polychoricpca net describe polychoric, from(https://staskolenikov.net/stata/) ssc install polychoricpca ssc describe p

None of the codes locate/install polychoricpca.

(1) Can anyone help me installing polychoricpca or post an .ado file here.

(2) Using a canonical pca, the first four components explain 80% variance. I ran the following codes to combine these dummies to a single factor using first four components that assigns equal weights to each component.

Code:

pca d1-d18, components(4) estat loadings screeplot, xlabel (1(1)18) predict pc1 pc2 pc3 pc4 gen pca_factor = (pc1 + pc2 + pc3 + pc4)/4

Alternatively, I calulate the variance explained by each component 'k' as; Variance_k = Eigenvalue_k/18 ; k=1,2,3,4
where Eigenvalue_k denotes the eigen value of component 'k'. I then combine these variable using first four components but assign variance explained by each factor as weights, not an equal weigh as above

Code:

gen pca_factor = (Variance_1*pc1 +Variance_2*pc2 + Variance_3*pc3 +Variance_4*pc4)/4

My question is, am I doing it correctly in (2) and which of the pca_factor is preferred, the one based on equal weights or variance based weights.

Thanks and regards,
(Ridwan)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35806
#8

03 Apr 2025, 00:52

My understanding is that PC1 is the best single summary of your data. I am curious why people think that an average of more components than 1 can be any better. You'd then be mushing together components that by construction are uncorrelated. Weighting by eigenvalues gives another kind of mush. I don't recollect either procedure being recommended in any text. I do detect this idea as a myth passed around in some literature, which is not surprising.

I'd welcome authoritative references explaining why this is a good idea, meaning not just an empirical paper where someone did this, but say a text on PCA.

polychoricpca has never been on SSC to my knowledge.

Code:

net from https://staskolenikov.net/stata net install polychoric

should work. The points are that describing a package doesn't install it, and that polychoricpca is not the name of the package. It's the name of a command within it. ssc install installs packages.
Comment

Chen Samulsion

Join Date: Jan 2018
Posts: 932

03 Apr 2025, 01:44

Type the following link in blue, after installing, type help polychoricpca in command window.

Code:

. net describe polychoric, from(https://staskolenikov.net/stata/)

-----------------------------------------------------------------------------------------------------------------
package polychoric from https://staskolenikov.net/stata
-----------------------------------------------------------------------------------------------------------------

TITLE
      polychoric -- The polychoric correlation package

DESCRIPTION/AUTHOR(S)
      
      Author: Stas Kolenikov, [email protected]
      
      This package provides routines to estimate
      the polychoric, tetrachoric, polyserial and biserial
      correlations and use them in principal component analysis.
      Current version: 1.4

INSTALLATION FILES                           (type net install polychoric)
      polychoric.ado
      polychoricpca.ado
      polych_ll.ado
      polyser_ll.ado
      polychoric.hlp
      polychoricpca.hlp
-----------------------------------------------------------------------------------------------------------------

Comment

Ridwan Sheikh

Join Date: Apr 2021

Posts: 173
#10

03 Apr 2025, 03:39

Thanks Nick Cox and Chen Samulsion

My understanding is that PC1 is the best single summary of your data. I am curious why people think that an average of more components than 1 can be any better.

In reference to the quoted text, I am attaching my results.

The PC1 explains 46% of total variance and PC2 explains 22%. Therefore, the cumulative variance of first two components is 68% and that being a reason to use both PC1 and PC2 (first 2 components if not 4) to create a composite summary of the variables in the dataset.

Code:

gen pca_factor = (0.461*pc1 + 0.224*pc2)/2 gen pca_factor = (pc1 + pc2)/2

I understand PC1 is uncorrelated with PC2 by construction. Combining the two components using either equal or proportional variance as weights (as in above) was based on my own intution and I may be wrong. I am not aware of any literature that recommends this approach. Based on the output (results table) above, If you have any further suggestions, I shall be very thankful.

The codes/links posted above to extract the polychoric still do not work in my case and produce and error message.

Code:

net from https://staskolenikov.net/stata net install polychoric net describe polychoric, from(https://staskolenikov.net/stata/)

Thank you
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35806
#11

03 Apr 2025, 04:10

Why not just average your binary variables? One answer is that averaging different answers to quite different questions makes no obvious sense. At least you know that some of your variables are correlated.

Now you need to explain why averaging PCs is a good idea, not to me or to us necessarily on Statalist, but in principle to whatever examiners, reviewers, and so on are going to evaluate your work.

(Averaging and adding are naturally equivalent here as far as using them in regresson is concerned.)

The existence of papers doing this is not convincing. At some level we all work with details we don't fully understand and papers get published that should not have been. I am currently writing about a procedure (nothing to do with PCA) which most textbooks don't explain correctly. Those I have seen just seem to copy from other equally unreliable textbooks or from confident but confused internet sources.

I can't and don't rule out using more than one PC as separate predictors.

Sorry, but the only advice on local IT problems is to ask locally. It's good that we can't see any details of your IT set-up -- whether you are linked to the internet directly or through a university, other employer, or otherwise -- but that means complete ignorance of why certain links won't work for you. In any case you don't show us the error message to comment.
1 like
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 173
#12

03 Apr 2025, 23:52

Thank You Nick Cox for your help,

I can't and don't rule out using more than one PC as separate predictors

I think it is better to use influential PCs as a seperate predictors in the model, not combining or averaging them into a single measure.

In any case you don't show us the error message to comment.

I am linked to the internet directly not through university or otherwise. The error message I am getting trying installing the polychoricpca is

sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPath
> BuilderException: unable to find valid certification path to requested target
https://staskolenikov.net/stata/ either
1) is not a valid URL, or
2) could not be contacted, or
3) is not a Stata download site (has no stata.toc file).
r(5100);

Thanks and regards,
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35806
#13

04 Apr 2025, 00:15

I recommend contacting your IT provider. 1) 2) 3) are wrong, but your Internet connection is being fastidious.
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 173
#14

04 Apr 2025, 03:03

Thank you very much Nick Cox . I will contact my IT provider.

Regards,
(Ridwan)
Comment
Alecia Cassidy

Join Date: Sep 2014

Posts: 58
#15

12 Apr 2025, 18:18

Nick makes excellent points here. You probably want to figure out how many dimensions your group of variables has, rather than averaging the 2 PCs. You may find a scree plot to be useful.

Originally posted by Nick Cox View Post

Why not just average your binary variables? One answer is that averaging different answers to quite different questions makes no obvious sense. At least you know that some of your variables are correlated.

Now you need to explain why averaging PCs is a good idea, not to me or to us necessarily on Statalist, but in principle to whatever examiners, reviewers, and so on are going to evaluate your work.

(Averaging and adding are naturally equivalent here as far as using them in regresson is concerned.)

The existence of papers doing this is not convincing. At some level we all work with details we don't fully understand and papers get published that should not have been. I am currently writing about a procedure (nothing to do with PCA) which most textbooks don't explain correctly. Those I have seen just seem to copy from other equally unreliable textbooks or from confident but confused internet sources.

I can't and don't rule out using more than one PC as separate predictors.

Sorry, but the only advice on local IT problems is to ask locally. It's good that we can't see any details of your IT set-up -- whether you are linked to the internet directly or through a university, other employer, or otherwise -- but that means complete ignorance of why certain links won't work for you. In any case you don't show us the error message to comment.
Comment

Announcement

Principal Component Analysis for Binary Variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment