Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • PCA with correlated variables

    I'm analyzing data from around 10 survey questions focused on regulatory issues. I've noticed these questions are highly correlated (of course since they are all about regulation), and I'm concerned about the implications of simply summing the responses to create an index. My worry is that this approach might exaggerate differences between responses, especially since they're ordinal. For instance, the perceived difference between firms rated 4 and 5 on a regulation scale could be artificially inflated once you simply add them across all questions.

    I have two main questions:
    1. Is my concern about the potential for distortion by summing responses justified?
    2. Assuming my concern is valid, would Principal Component Analysis (PCA) be an appropriate method to address this issue? I've come across advice suggesting the removal of highly correlated variables, but I'm inclined to think that in this context, retaining them is necessary.
    I'd greatly appreciate any insights or recommendations. Thank you!

  • #2
    It depends on (the validity of) your assumptions resp. what you want to achieve. My two cents:

    If the 10 variables (items) can be treated as indicators of the latent construct "regulation" (or are all intended to measure aspects of "regulation") such that the responses to (measurement of) the single items can be assumed to be influenced by the latent (unobserved) construct "regulation", then you can use either confirmatory or exploratory factor analysis (FA, the latter is often confused with principal components analysis) to verify or explore whether the 10 items load on one (or perhaps several correlated) factors. Here comparatively high correlations between the items are to be expected. For more details on factor analysis I would recommend: Preacher, K. J., & MacCallum, R. C. (2003). Repairing Tom Swift’s electric factor analysis machine. Understanding Statistics, 2(1), 13–43. https://doi.org/10.1207/S15328031US0201_02. (see also https://quantpsy.org/pubs/preacher_maccallum_2003.pdf).

    However, some argue that ordinal variables (such as Likert items) should not be treated as (quasi-)continuous and hence item response theory (IRT) is more appropriate. But conclusions with regard to the quality of scales based on either FA or IRT (or Mokken scale analysis in case of multidimensionality) are generally the same, especially if you start FA (exploratory or confirmatory) based on a matrix of polychoric correlations, see for example: Kappenburg-ten Holt, J. (2014). A comparison between factor analysis and item response theory modeling in scale analysis (University of Groningen). University of Groningen, Groningen, NL. Retrieved from https://research.rug.nl/files/130804...mw_TenHolt.pdf; Cho, E. (2023). Interchangeability between factor analysis, logistic IRT, and normal ogive IRT. Frontiers in Psychology, 14. https://doi.org/10.3389/fpsyg.2023.1267219; or the brief discussion at StackExchange.

    But if your concern is "only" multicollinearity and you want to reduce dimensionality, PCA can help to identify variables that load on the same factor so that you can decide to drop some of them or to combine them into one common score.


    Last edited by Dirk Enzmann; 05 Apr 2024, 21:30.

    Comment


    • #3
      Dirk Enzmann Thank you so much for your comment. is my concern about simply adding the survey questions might actually exaggerate differences in regulation correct?

      Comment


      • #4
        I can't see why differences in ratings of objects (firms) can be exaggerated if you sum the items (variables) used for rating them: The sum divided by the number of ratings is the mean of ratings and the differences of the means should be smaller than the differences of the single items even if the items are highly correlated. Each rating is associated with a measurement error and averaging the items should reduce the measurement error if the items are intended to measure the same latent construct.

        But it may be that I am wrong here because I don't understand your data: Please give us an example of your data (see FAQ #12).

        Comment


        • #5
          Dirk Enzmann Oh never mind.. I think I was confused. This is what I meant: Let's say the true regulation level between firm A and firm B is only 1 level apart (whatever that means). So out of a scale of 5, firm A has 4, and firm B has 5. However, there are 10 survey questions about regulation and obviously firm B will have an answer that is higher number than firm A throughout the whole survey. In the end, even if their regulation level isn't that different, they might look further apart if we simply add them up. But I think once we average them, it won't be a problem. In addition, even a simple additive model might be fine because the relative difference won't change.

          Comment


          • #6
            As far as I understand your reasoning, the issue is the meaningfulness of units of measurement. Generally (at least in psychology) ratings have no meaningful measurement units. See: Cohen, P., Cohen, J., Aiken, L. S., & West, S. G. (1999). The problem of units and the circumstance for POMP. Multivariate Behavioral Research, 34, 315–346. https://www.tandfonline.com/doi/abs/...27906MBR3403_2

            Comment

            Working...
            X