Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Principal component analysis + plot explained variance

    Hi,

    I would like to have the Percentage of Variance Explained displayed on the right end side of the graph and keep the eigenvalue on the left side.

    Code:
    use http://www.stata-press.com/data/r13/audiometric
    pca l* r*, comp(4)
    screeplot, ci mean
    I'd like to change the Title of the graph to "Principal Component". How can I do that? Thanks

  • #2
    Code:
    help title options

    Comment


    • #3
      While I would appreciate to display the title (minor concern), the main issue I have is accessing the variance and be able to create a double y-axis graph (title of my post). Anyone can help with that? Thanks

      Comment


      • #4
        Typing ereturn list will show what what is stored after you run pca. I'm not quite clear what you mean by "the variance" but you will find that matrix e(Psi) contains the estimates of unexplained variance and from there you should be able to get what you want. If you want to plot the proportion of variance associated with each component, the eigenvalues are stored in e(Ev). and you can use that to recover the proportion of variance associated with each eigenvalue.
        Richard T. Campbell
        Emeritus Professor of Biostatistics and Sociology
        University of Illinois at Chicago

        Comment


        • #5
          Dick, thanks a lot for your help!

          I am trying to plot the fraction of variance explained by the nth principal component where the nth principal component is the nth largest eigenvalue of the correlation matrix divided by the number of components.

          In other words, I 'd to have the percentage of explained variance on the y-axis (left hand side) and the eigenvalues on the y-axis (right hand side). Do you know how to do that? Thanks!

          Comment


          • #6
            Here is an example with some techniques to produce the graph. I use the cumulative percent, which sounds more relevant to me, but you can change it to actual percent:
            Code:
            clear
            sysuse auto
            pca price mpg headroom trunk weight length gear_ratio
            mat eigenvalues = e(Ev)
            gen eigenvalues = eigenvalues[1,_n]
            egen pct=total(eigenvalues) if !mi(eigenvalues)
            replace pct=sum((eigenvalues/pct)*100) if !mi(eigenvalues)
            g component=_n if !mi(eigenvalues)
            twoway line eigenvalues component, sort ytitle(Eigenvalues) ///
            yline(1, lwidth(medium) lcolor(red)) ///
            || line pct component, sort yaxis(2) ytitle(Cumulative percent explained variance, axis(2)) ///
            xlabel(1/7) xtitle(Number of components) legend(off) ///
            title(Scree plot of eigenvalues after pca)

            Comment


            • #7
              See also http://www.statalist.org/forums/foru...-for-pca-users

              Comment


              • #8
                The question keeps changing: eigenvalues on left? eigenvalues on right? But no matter: here is some technique. Packages from SSC are required and installed as needed, but this will fail if you have a firewall preventing that. The code wires in the number of PCs used, here all of them.

                Code:
                set scheme s1color
                use http://www.stata-press.com/data/r10/audiometric
                pca l* r*
                
                capture ssc inst pcacoefsave
                pcacoefsave using audiopcaresults
                u audiopcaresults
                describe
                
                Contains data from audiopcaresults.dta
                  obs:            64                          
                 vars:             8                          10 Jul 2016 11:20
                 size:         1,728 (99.9% of memory free)
                ---------------------------------------------------------------------------------------------
                              storage  display     value
                variable name   type   format      label      variable label
                ---------------------------------------------------------------------------------------------
                varname         byte   %8.0g       names      variable
                varlabel        byte   %18.0g      labels     variable
                PC              byte   %8.0g                  
                corr            float  %9.0g                  correlation
                loading         float  %9.0g                  coefficient
                eigenvalue      float  %9.0g                  
                mean            float  %9.0g                  
                SD              float  %9.0g                  standard deviation
                ---------------------------------------------------------------------------------------------
                Sorted by:  
                
                capture ssc inst mylabels
                mylabels 0(10)50, myscale(8 *@/100) local(yla)
                
                twoway connected eigenvalue PC, sort ///
                yaxis(1 2)                           ///
                yla(`yla', ang(h) axis(1)) ytitle(% variance, axis(1))  ///
                yla(, ang(h) axis(2)) xla(1/8)
                Click image for larger version

Name:	scree.png
Views:	1
Size:	10.5 KB
ID:	1348749

                Comment


                • #9
                  Thanks a lot to both of you, that really really helps. In Nick's code, is it possible to keep the horizontal line and disclose the value of the intersect with the "% variance" ("mean" in my initial code). Thanks!

                  Comment


                  • #10
                    I have figured out the "mean" line that was available in Oded's code:
                    yline(1, lwidth(medium) lcolor(red)) /// How do I find out the intersect ("% variance") between the 2 lines though? Thanks.

                    Comment


                    • #11
                      Sorry, no idea what that means. What two lines, for example? I see only one on my graph.

                      Comment


                      • #12
                        There is no much meaning to move the horizontal line in #6 to the intersection point, since the reference line in pca, as criteria for extracting the number of components, is generally eigenvalues above 1.

                        Comment


                        • #13
                          Nick: I am referring to the line I had in my #1 post, by including the option "mean".

                          Oded: It does have a meaning. The horizontal line centered at eigenvalue =1 at XX% would be the percentage of variance explained by each component if all variables (l* r*) were uncorrelated with each other. I'd like to know this value and display it on the graph if possible. Thanks!

                          Comment


                          • #14
                            The mean eigenvalue is just 1, so yline(1) suffices.

                            Comment


                            • #15
                              Yes I know that the eigenvalue is one... What is the corresponding exact % variance explained. By looking at your graph it should be around 12.5-13%. How can I get the exact estimate?

                              Btw this is not a strange/weird request:, I am trying to replicate the results from a paper from a top finance journal (page 7):
                              http://users.cla.umn.edu/~jianfeng/Anomalies_JFE_12.pdf

                              Comment

                              Working...
                              X