Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a composite variable with 3 highly correlated variables with different scales

    Hi all,

    After days of searching online without success, I'd thought I put my question on this forum here on Statalist to see if anyone could help me out.

    I am running a panel data regression for 64 countries and 23 years. My independent variable is 'socio-economic development', and my dependent variable is Individualism (data gathered from the World Value Survey - which is aggregated to the country level).

    I want to create a composite measure for the independent variable 'socio-economic development'. The three components are (1) Log of GDP per person employed, (2) Urbanization rate in % and (3) life expectancy in years. As all these variables are measures on different scales, I am not sure how to make a composite. Should I log transform them all before adding them up?

    All help is really appreciated !! (and yes, I am a rookie as my question might suggest).

    Thank you!

  • #2
    Those three variables are on quite different scales and presumably have different distribution shapes. You are already logging GDP per person, which seems fair enough, but I wouldn't log urbanization or life expectancy without a very good specific reason.

    I would certainly not add those variables, even if they were all logged, as that just imparts arbitrary weights to the sum depending on the original units of measurement.

    People wanting a composite quite often use principal component analysis (PCA) and then the first PC, but I am not persuaded that that really helps scientifically. There still too much arbitrariness and it's hard to get beyond the mushiness of "that predictor is a composite". It's hard to develop this point if it doesn't convince, except to call in the evidence of all the texts and papers that don't do this!

    You have a big enough dataset to use all three variables directly as predictors. I think the results will be easier to think about. If the results really are highly correlated your model results should signal that one predictor should be omitted, but I doubt that will be the case.

    Conversely, if you are under instruction that you must use a single composite, then you will see that I think that the instructions are wrong-headed, statistically or even economically. (I am a geographer by background, but have read enough papers of this kind to have some opinions on what helps.)

    Comment


    • #3
      Thank you so much for your helpful answer, Nick.

      I am not under direct instruction to create a composite, however, the idea was brought up some while ago.

      Would you suggest using VIFs to identify the correlation and strength of the variables' relationship in order to determine if I can proceed with adding them seperately in the regression?

      Again, thank you very much!

      Last edited by Wessel Ster; 03 May 2022, 06:07.

      Comment


      • #4
        Nothing against VIFs. I would always start with a correlation matrix and scatter plot matrix, but much depends on the model(s) you are using given variations between countries and between years.

        Comment


        • #5
          In my experience, I've never dealt with variables that were so collinear that it made a very big difference. The three tit have are obviously related, which is a good thing to a degree, but not such that you'd need to be worried about.

          I use PCA in a command I've written, and then I algorithmically select the top few singular values using cross-validation. I normalize all of the predictors, so outliers and stuff aren't a giant problem.


          Either way though, as Nick says, you likely don't need to do this, even though it's likely possibly. PCA is useful in certain circumstances, but needing to make an index is rarely needed (again my experience, development econ folks may object to me) for a paper to be sensibly done.

          Comment

          Working...
          X