Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Principle component analysis

    Hello, I am using PCA trying to create a composite measure based on a set of five variables. I have two questions I want to ask here:
    1. Does it matter that the variables I put in PCA are not interval or ratio? The variables I am handling are all ordinal range from scale 1 to 5.
    2. After PCA, I was able to get two components that capture most of the variance in the data and the two new variables were added to the data using the following syntax

    Code:
    predict pc1 pc2, score
    However, since I have to run a mixed model using the newly created composite measure (aka, the newly created component variables: pc1, pc2), which one should I use then (I only need one outcome)? Can I further aggregate the two components in one or what should I do in order to make one single outcome for the mixed model?

    Thanks

  • #2
    Your last question really depends on theoretical rather than statistical considerations. Were all of the items intended to be generated by a single latent construct? If so, you might have 2 principal components for several reasons. Perhaps some items are poor measures of the main construct and should be excluded from the composite measure. Sometimes methods bias in a questionnaire will result in factors on which positively worded items load on one factor while negatively worded items load on a 2nd. Perhaps the 2 principal components that you found represent distinct sub concepts that themselves are highly associated (e.g., in terms of a 2nd order factor model there is a more general latent construct that explains the sub-components. But you've certainly not presented anything here that would allow any of use to say you should use pc1 or pc2 as the single outcome in a mixed model. I'm generally not a fan of using factor scores. While they are "optimally weighted" based on the principal components analysis of your current sample, applying the same techniques in a replication sample would likely result in different weights that would be optimal in a separate sample. At least in behavioral medicine and psychology folks often use existing measures that are scored using a known and published protocol. Using factor scores generated for each individual sample basically means that the exact same measure is not being used in the different studies. I'd generally use the PCA to guide item selection and then simply create summated indexes. As to the use of PCA with ordered categorical variables I'd say you're probably OK if the distribution of the items is unimodal and there are not strong floor or ceiling effects where lots of observations are clustered as the minimum or maximum values. It would be better to use techniques appropriate for ordered categorical variables. I also wonder about the choice of PCA rather than exploratory factor analysis. FA is generally recommended if the observed indicators are conceptualized as reflecting a common latent construct.

    Comment


    • #3
      Hi Brad, thanks so much for your detailed explanation. Below are the syntax I have used.

      Code:
      pca ES14_NEW PP10 PC26_NEW PC21_NEW PC20_NEW, comp(2) blanks(.3)
      
      Principal components/correlation                  Number of obs    =     50582
                                                        Number of comp.  =         2
                                                        Trace            =         5
          Rotation: (unrotated = principal)             Rho              =    0.6477
      
          --------------------------------------------------------------------------
             Component |   Eigenvalue   Difference         Proportion   Cumulative
          -------------+------------------------------------------------------------
                 Comp1 |      1.94192      .645316             0.3884       0.3884
                 Comp2 |       1.2966      .493381             0.2593       0.6477
                 Comp3 |      .803219      .205737             0.1606       0.8083
                 Comp4 |      .597482        .2367             0.1195       0.9278
                 Comp5 |      .360782            .             0.0722       1.0000
          --------------------------------------------------------------------------
      
      Principal components (eigenvectors)  (blanks are abs(loading)<.3)
      
          ------------------------------------------------
              Variable |    Comp1     Comp2 | Unexplained 
          -------------+--------------------+-------------
              ES14_NEW |             0.6469 |       .2988 
                  PP10 |             0.6421 |       .3052 
              PC26_NEW |   0.4123           |       .6531 
              PC21_NEW |   0.5840           |       .2406 
              PC20_NEW |   0.5699           |       .2638 
          ------------------------------------------------

      Code:
       rotate, varimax blanks(.3)
      
      Principal components/correlation                  Number of obs    =     50582
                                                        Number of comp.  =         2
                                                        Trace            =         5
          Rotation: orthogonal varimax (Kaiser off)     Rho              =    0.6477
      
          --------------------------------------------------------------------------
             Component |     Variance   Difference         Proportion   Cumulative
          -------------+------------------------------------------------------------
                 Comp1 |      1.83192      .425326             0.3664       0.3664
                 Comp2 |       1.4066            .             0.2813       0.6477
          --------------------------------------------------------------------------
      
      Rotated components  (blanks are abs(loading)<.3)
      
          ------------------------------------------------
              Variable |    Comp1     Comp2 | Unexplained 
          -------------+--------------------+-------------
              ES14_NEW |             0.7072 |       .2988 
                  PP10 |             0.7034 |       .3052 
              PC26_NEW |   0.4225           |       .6531 
              PC21_NEW |   0.6449           |       .2406 
              PC20_NEW |   0.6368           |       .2638 
          ------------------------------------------------
      
      Component rotation matrix
      
          ----------------------------------
                       |    Comp1     Comp2 
          -------------+--------------------
                 Comp1 |   0.9108    0.4129 
                 Comp2 |  -0.4129    0.9108 
          ----------------------------------
      Below are how I get the two components.
      Code:
       estat loadings
      
      Principal component loadings 
          component normalization: sum of squares(column) = 1
      
          ----------------------------------
                       |    Comp1     Comp2 
          -------------+--------------------
              ES14_NEW |    .2858     .6469 
                  PP10 |    .2873     .6421 
              PC26_NEW |    .4123    -.1138 
              PC21_NEW |     .584    -.2737 
              PC20_NEW |    .5699    -.2852 
          ----------------------------------
      
      . predict pc1 pc2, score
      
      Scoring coefficients for orthogonal varimax rotation
          sum of squares(column-loading) = 1
      
          ----------------------------------
              Variable |    Comp1     Comp2 
          -------------+--------------------
              ES14_NEW |  -0.0068    0.7072 
                  PP10 |  -0.0035    0.7034 
              PC26_NEW |   0.4225    0.0665 
              PC21_NEW |   0.6449   -0.0082 
              PC20_NEW |   0.6368   -0.0245 
          ----------------------------------
      It seems component 1 explains two of the five variables and component 2 explains the other three variables. In this occasion, do you think I should just focus on the two variables explained by component 1 OR the three variables explained by component 2 and then like you said, sum them up to get a new measure?

      Comment


      • #4
        Without knowing anything about what you're doing conceptually, there's no way to advise. The eigenvalue > 1 criterion is generally regarded as extracting too many factors. There's no single, always regarded as correct, rule for determining how many factors to retain. You could install the minap program from SSC which uses Velicer's minimum average partial technique. You could also use Horn's parallel analysis (ssc install paran), plot eigenvalues, run FA using ML and examine BIC values. But ultimately effective use of these kinds of data reduction techniques depends on theoretical considerations and substantive interpretability. What would the loadings look like if you extracted only a single principal component? Also, it's generally considered the case that factors with fewer than 3 items are poorly defined. You can't simply rely on the default eigenvalue > 1 criterion for determining how many factors to retain and interpret. But I'd again emphasize that you need to use your substantive knowledge of what the observed variables are intended to represent to guide how you use the results. I'd also add that are situations in which traditional psychometric techniques are completely inappropriate. Specifically, when the empirical variables are conceptualized as causes of, rather than effects of, a more general construct. For example, stress. Getting divorced causes stress, but so does getting married. Having a baby causes stress and so does losing a loved one. There's not reason to think that the life events that cause stress would be associated with one another. Perhaps some or all of your observed measures are causal rather than effect indicators.

        Comment


        • #5
          Hi Brad, conceptually all I want to do is to create a composite measure of social skills (the target outcome I am interested) based on a set of related variables. In the dataset, there are many variables representing different skill sets but all I need to care is the social skill ones. The five variables above, in my perspectives, are all related to social skills. For instance, PP10 represents the # of friends, PC26 represents the ability to let others know what you need. Since they are all related to social skills, I was wondering whether there is a way for me to take all of them into considerations to create a composite variable. It is suggested that PCA could be applied in this occasion, and that's why I am using it. If PCA is not the ideal method, what do you suggest me to do? Thanks.

          Comment


          • #6
            Again, without seeing all of the items and knowing something about the literature, it's very had to give more than suggestive advice. Based on what you've indicated, I'm not at all sure there is any reason to think the specific behaviors representing social skills would necessarily be associated. Do you think there is a latent trait that causes the behaviors, or are the specific behaviors improving social skills. I'm not sure. If the latter, there is no necessary reason to think that traditional psychometric techniques would be of use in determining whether to construct an index. Mechanically, you can construct a summated index very easily using egen and rowtotal, or alpha, generate.

            Comment

            Working...
            X