Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Index Analysis using Stata PCA and MCA command

    Dear Stata users,

    I am constructing several types of indices using PCA and MCA commands in Stata based upon various types of data inputs (e.g. continuous and/or categorical) in a survey. The general understanding is when data types are continuous, we should use Principal Component Analysis (PCA) and in cases where data types are categorical i.e. binary (0/1), we should use Multiple Correspondence Analysis (MCA) in developing an index through an induction method. My questions are as follows:
    • In PCA, what is the use of rotation (i.e. “rotate”) before “predict” and why the results vary (with/without rotate)? Why we do not tend to do rotation in case of MCA?
    • In case of an Asset index where respondents had reported in a YES/NO on availing 20 different items (e.g. radio/television/truck etc.) in absence of a TOTAL NUMBER; which method is more appropriate, PCA or MCA? [I am asking this because the commodities listed might not be indifferent like perception based choices].
    Sincerely,
    Azreen Karim.

  • #2
    PCA takes the data defined by a set of variables and extracts from it "components." These components are linear combinations of those variables that are orthogonal to each other and jointly account for all of the variance in the original variables. The first component has the largest variance of them all, and the rest appear in descending order by variance.

    So, if you wish, the first principal component is the linear combination that extracts the maximum possible amount of variance from the original set of variables.

    Generally speaking, people use PCA when they want to reduce the dimensionality of their data. That is, rather than using some fairly large number of variables in their model, they wish to summarize that data by a smaller number of variables derived from it but still carrying much of the information in the original variables. One way to do this is to do PCA and then retain some, but not all of the components. Typically the first two or three components will, together, account for a big proportion of the variance of the original variables, making them a reasonable trimmed-down approximation of the original data's information.

    But sometimes it doesn't work out that way. Sometimes PCA gives you a bunch of components each of which has a large loading for only one, or maybe two, variables and omits all the rest, so that you end up needing to use nearly all of the components in order to capture an adequate amount of the original variables' variance. That would defeat the usual purpose of doing PCA. Rotation is the solution to this problem. Rotation finds new linear combinations, that are still orthogonal (unless you chose to do an oblique rotation). In particular, varimax, promax, and quartimax rotations look specifically for linear combinations that have many variables loading heavily on a component, with the rest loading close to zero. That way using the first, or first few rotated components will get you more of the information contained in the original variables than you could get from the unrotated components.

    I cannot say anything with regard to MCA as I am not at all familiar with that technique.

    Comment


    • #3
      Thank you so much, Clyde for providing such an elaborative and productive response on PCA. As you said, PCA is mostly used to reduce the dimensionality of data, therefore, in creating an Asset index where a household responded whether they own in Yes/No mode of 20 different items, do you suggest using PCA?

      Kind regards,
      Azreen.

      Comment


      • #4
        Well, this is one approach. But PCA is based on Pearson correlations, we may not be the best way to evaluate associations among dichotomous variables. There is a community-contributed command, polychoric, written by Stas Kolenikov which calculates a polychoric correlation matrix instead. That package also includes a command, -polychoricpca- which feeds that matrix into principal components analysis. So you might want to use that. To get these commands, launch stata and run -search polychoricpca-. Then click on the blue link.

        On the other hand, the conceptual model behind polychoric correlations is that the binary responses really represent a dichotomization of a latent continuous variable. When the question is about whether they do or don't own some item, the notion that there is an underlying latent continuous variable behind this is rather a stretch. So maybe ordinary PCA is fine here.
        Last edited by Clyde Schechter; 11 Apr 2020, 14:39.

        Comment


        • #5
          Excellent! Thanks a lot, Clyde for such a wonderful clarification on use of PCA in the context of dichotomous variables.

          I would highly appreciate if anyone could say whether I have taken the right approach of using Multiple Correspondence Analysis (MCA) in the following Perception Index where the choices are all binary (0/1).

          mca opt1 opt2 opt3 opt4 opt5
          predict pi

          Thanks & regards,
          Azreen.

          Comment


          • #6
            Hello everyone!
            I have a follow-up question to this. I am running MCA with categorical variables and wanted to know if there were options to obtain promax rotation for MCA in stata. Does anyone know? I couldn't figure it out from the mca stata manual or the post-estimation commands.
            Thank you!

            Comment


            • #7
              Dear Clyde Schechter

              I am using statas pca command
              Code:
              pca d1-d18
              It produces an output in two tables. The Table 1 gives results on Eigenvalue, Difference, Proportion, and Cumulative and the Table 2 in output gives results on component loadings. Based on the results produced in Table 1, I want to draw a plot that shows the proportion of variance explained by each component and their cumulative variance. I understand that screeplot that plots Eigenvalues can be used to infer the same information, but as of now I want a plot that looks like the one attached below:

              Click image for larger version

Name:	screeplot.PNG
Views:	1
Size:	56.7 KB
ID:	1775373



              Please help me with the STATA codes, I shall be very thankful.

              Thanks and regards,

              (Ridwan)

              Comment


              • #8
                Part of this is getting the eigenvalues into a variable, after which there are some simple calculations.

                To get two different y axis scales, see

                Code:
                help axis choice options
                I am not a fan, and so suggest showing the same information differently in two complementary graphs.

                There is still a challenge in getting graphs to align -- and in my case to avoid 0.0 as a label, which I dislike.

                nicelabels is from the Stata Journal.

                See also pcacoefsave from SSC. It's not needed here, but may be useful for nearby problems.


                https://www.statalist.org/forums/for...-for-pca-users

                Code:
                sysuse auto, clear
                
                pca length weight trunk headroom displacement
                
                local J = colsof(e(Ev))
                
                gen eigenvalue = e(Ev)[1, _n]
                
                gen PC = _n in 1/`J'
                
                gen proportion = eigenvalue/`J'
                
                gen cumulative = sum(eigenvalue) / `J'
                
                nicelabels proportion, local(yla)
                
                local yla : subinstr local yla "0 " " "
                
                twoway bar proportion PC, fcolor(stc1*0.1) barw(0.8)  yla(`yla' 0 "0", format(%02.1f) labcolor(stc1)) ///
                || scatter proportion PC, ms(none) mla(proportion) mlabformat(%03.2f) mlabsize(medium) mlabpos(12) mlabc(stc1) ///
                ytitle(Proportion of variance, color(stc1)) name(G1, replace) xsc(off) legend(off)
                
                twoway connected cumulative PC , lc(stc2) yla(0.2(0.2)0.8 0 "0" 1 "1", format(%02.1f) labcolor(stc2)) mcolor(stc2) ///
                ytitle(Cumulative proportion, color(stc2)) name(G2, replace) xla(1/`J') mla(cumulative) mlabpos(12) ///
                mlabformat(%03.2f) mlabsize(medium) mlabcolor(stc2)
                Click image for larger version

Name:	eigenvalue.png
Views:	1
Size:	76.6 KB
ID:	1775382




                Please note also Stata FAQ Advice #18.
                Last edited by Nick Cox; 05 Apr 2025, 03:27.

                Comment


                • #9
                  There is always this version.

                  Code:
                  twoway bar proportion PC, fcolor(stc1*0.1) barw(0.8)  yaxis(1 2) ///
                  yla(0.2(0.2)0.8 0 "0" 1 "1", format(%02.1f) labcolor(stc1) axis(1)) ///
                  || scatter proportion PC, ms(none) mla(proportion) mlabformat(%03.2f) mlabsize(medsmall) mlabpos(12) mlabc(stc1) ///
                  ytitle(Proportion of variance, color(stc1) axis(1)) legend(off) ///
                  || connected cumulative PC , lc(stc2) yla(0.2(0.2)0.8 0 "0" 1 "1", format(%02.1f) axis(2) labcolor(stc2)) mcolor(stc2) ///
                  ytitle(Cumulative proportion, color(stc2) axis(2)) xla(1/`J') mla(cumulative) mlabpos(12) ///
                  mlabformat(%03.2f) mlabsize(medsmall) mlabcolor(stc2)
                  Click image for larger version

Name:	eigenvalue2.png
Views:	1
Size:	87.8 KB
ID:	1775384

                  Comment


                  • #10
                    Thank you Nick Cox

                    I tried executing your code, but got an error message in the folowing line
                    Code:
                    gen eigenvalue = e(Ev)[1, _n]
                    invalid syntax
                    r(198)

                    Comment


                    • #11
                      Are you using Stata 18? Recall our longstanding convention that posters are expected to be using the latest version of Stata unless they explain otherwise.

                      Show us what is visible given

                      Code:
                      eret li
                      after your pca call.
                      Last edited by Nick Cox; 05 Apr 2025, 05:11.

                      Comment


                      • #12
                        Dear Nick Cox . I am using Stata 14.2.

                        I have manged to generate the PC, eigenvalue, proportion, and cumulative using an indirect method.

                        Code:
                        sysuse auto, clear
                        
                        pca length weight trunk headroom displacement
                        matrix define ev = e(Ev)'
                        clear
                        svmat ev, names(col)
                        rename r1 eigenvalue
                        
                        gen PC = _n
                        
                        summarize eigenvalue, meanonly
                        scalar total_var = r(sum)
                        gen proportion = eigenvalue / total_var
                        gen cumulative = sum(proportion)
                        This generates all the variables necessary to create a plot.

                        But I am still unable to create this nice-looking twoway plot as you showed above. With the given codes, how to generate the same plot as in above.

                        Thanks

                        Comment


                        • #13
                          Thanks for the detail.

                          You don't say why you are unable to create the graph, but presumably you got an error message, which would have been informative.

                          it's the same problem, at least in part. Colours stc1 and stc2 were not defined in Stata 14.2, so you need colours that you can use.

                          Also, your code does not define local macroJ.

                          This should work in Stata 14.2, which I can no longer access.

                          Code:
                          sysuse auto, clear
                          
                          pca length weight trunk headroom displacement
                          matrix define ev = e(Ev)'
                          clear
                          svmat ev, names(col)
                          rename r1 eigenvalue
                          
                          gen PC = _n
                          
                          summarize eigenvalue, meanonly
                          scalar total_var = r(sum)
                          gen proportion = eigenvalue / total_var
                          gen cumulative = sum(proportion)
                          
                          nicelabels proportion, local(yla)
                          
                          local yla : subinstr local yla "0 " " "
                          
                          local J = _N 
                          
                          twoway bar proportion PC, fcolor(blue*0.1) barw(0.8)  yla(`yla' 0 "0", format(%02.1f) labcolor(blue)) ///
                          || scatter proportion PC, ms(none) mla(proportion) mlabformat(%03.2f) mlabsize(medium) mlabpos(12) mlabc(blue) ///
                          ytitle(Proportion of variance, color(blue)) name(G1, replace) xsc(off) legend(off)
                          
                          twoway connected cumulative PC , lc(red) yla(0.2(0.2)0.8 0 "0" 1 "1", format(%02.1f) labcolor(red)) mcolor(red) ///
                          ytitle(Cumulative proportion, color(red)) name(G2, replace) xla(1/`J') mla(cumulative) mlabpos(12) ///
                          mlabformat(%03.2f) mlabsize(medium) mlabcolor(red)
                          
                          graph combine G1 G2, col(1) xcommon

                          This should also work in your Stata and is perhaps a little more direct.


                          Code:
                          sysuse auto, clear
                          
                          pca length weight trunk headroom displacement
                          matrix define ev = e(Ev)'
                          
                          gen eigenvalue = ev[_n, 1]
                          
                          local J = rowsof(ev)
                          
                          gen PC = _n if eigenvalue < . 
                          
                          summarize eigenvalue, meanonly
                          gen proportion = eigenvalue / r(sum)
                          gen cumulative = sum(proportion)
                          
                          nicelabels proportion, local(yla)
                          
                          local yla : subinstr local yla "0 " " "
                          
                          twoway bar proportion PC, fcolor(blue*0.1) barw(0.8)  yla(`yla' 0 "0", format(%02.1f) labcolor(blue)) ///
                          || scatter proportion PC, ms(none) mla(proportion) mlabformat(%03.2f) mlabsize(medium) mlabpos(12) mlabc(blue) ///
                          ytitle(Proportion of variance, color(blue)) name(G1, replace) xsc(off) legend(off)
                          
                          twoway connected cumulative PC , lc(red) yla(0.2(0.2)0.8 0 "0" 1 "1", format(%02.1f) labcolor(red)) mcolor(red) ///
                          ytitle(Cumulative proportion, color(red)) name(G2, replace) xla(1/`J') mla(cumulative) mlabpos(12) ///
                          mlabformat(%03.2f) mlabsize(medium) mlabcolor(red)
                          
                          graph combine G1 G2, col(1) xcommon

                          Comment


                          • #14
                            Thank you Nick Cox for sending me the code. I did a minor tweaking to your code and was able to create a requisite plot. Here is what I run

                            Code:
                            sysuse auto, clear
                            
                            pca length weight trunk headroom displacement
                            matrix define ev = e(Ev)'
                            clear
                            svmat ev, names(col)
                            rename r1 eigenvalue
                            
                            gen PC = _n
                            
                            summarize eigenvalue, meanonly
                            scalar total_var = r(sum)
                            gen proportion = eigenvalue / total_var
                            gen cumulative = sum(proportion)
                            
                            gen str5 prop_lbl = string(proportion, "%03.2f")
                            gen str5 cum_lbl = string(cumulative, "%03.2f")
                            
                            nicelabels proportion, local(yla)
                            local yla : subinstr local yla "0 " " "
                            
                            count
                            local J = r(N)
                            
                            local xlabels ""
                            
                            forvalues i = 1/`J' {
                               local xlabels `xlabels' `i'
                               
                            }
                            
                            twoway bar proportion PC, fcolor(blue*0.1) barw(0.8) yaxis(1 2) ///
                            yla(0.2(0.2)0.8 0 "0" 1 "1", format(%02.1f) labcolor(blue) axis(1)) ///
                            || scatter proportion PC, ms(none) mla(prop_lbl) mlabsize(medium) mlabpos(12) mlabc(blue) ///
                            ytitle(Proportion of variance, color(blue) axis(1)) ///
                            || connected cumulative PC, lc(red) yla(0.2(0.2)0.8 0 "0" 1 "1", format(%02.1f) axis(2) labcolor(red)) mcolor(red) ///
                            ytitle(Cumulative proportion, color(red) axis(2)) xla(`xlabels') ///
                            mla(cum_lbl) mlabpos(12) mlabsize(medium) mlabcolor(red)
                            Thank you again for again for the help.

                            Regards,
                            (Ridwan)

                            Comment


                            • #15
                              You're welcome.

                              But I can't see that the roundabout use of svmat is any improvement over my code. The use of a scalar to hold the total and a loop to create the x axis labels are not needed either.

                              Are you saying that my code doesn't work in 14.2?

                              Comment

                              Working...
                              X