Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Should we control for the sub-item of a composite variable

    Dear Statalist:

    I want to study the effect of X on Y.
    X is a composite variable, eg. X=A*exp(B)-C
    Should I control for A,B,C when do regression?
    On one hand, according to "A Crash Course in Good and Bad Controls", A,B,C are "Common Causes" for X and Y.
    On the other hand, controlling for A,B,C may generate multicollinearity problems.

    So in the regression, should we also control the effects of A,B,C separately?

  • #2
    Qingfeng:
    the risk of your approach is regression overfitting, like in the following toy-example (please see -foreign- variables, when considered continuous or categorical):
    Code:
    . use "C:\Program Files\Stata18\ado\base\a\auto.dta"
    (1978 automobile data)
    
    . gen wanted=( rep78*foreign )
    
    
    . regress wanted i.rep78 i.foreign
    
          Source |       SS           df       MS      Number of obs   =        69
    -------------+----------------------------------   F(5, 63)        =    821.57
           Model |  274.400362         5  54.8800725   Prob > F        =    0.0000
        Residual |  4.20833333        63  .066798942   R-squared       =    0.9849
    -------------+----------------------------------   Adj R-squared   =    0.9837
           Total |  278.608696        68   4.0971867   Root MSE        =    .25845
    
    ------------------------------------------------------------------------------
          wanted | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
           rep78 |
              2  |  -1.03e-14   .2043265    -0.00   1.000     -.408314     .408314
              3  |   -.087963   .1889489    -0.47   0.643    -.4655473    .2896213
              4  |   .0601852   .1974852     0.30   0.762    -.3344575    .4548279
              5  |   .9166667   .2110276     4.34   0.000     .4949618    1.338372
                 |
         foreign |
        Foreign  |    3.87963   .0869457    44.62   0.000     3.705883    4.053377
           _cons |   1.02e-14   .1827552     0.00   1.000    -.3652072    .3652072
    ------------------------------------------------------------------------------
    
    . estat vif
    
        Variable |       VIF       1/VIF  
    -------------+----------------------
           rep78 |
              2  |      4.42    0.226230
              3  |      9.06    0.110343
              4  |      7.77    0.128738
              5  |      6.16    0.162226
       1.foreign |      1.65    0.604870
    -------------+----------------------
        Mean VIF |      5.81
    
    . regress wanted c.rep78 c.foreign
    
          Source |       SS           df       MS      Number of obs   =        69
    -------------+----------------------------------   F(2, 66)        =   1139.65
           Model |  270.768288         2  135.384144   Prob > F        =    0.0000
        Residual |  7.84040724        66  .118794049   R-squared       =    0.9719
    -------------+----------------------------------   Adj R-squared   =    0.9710
           Total |  278.608696        68   4.0971867   Root MSE        =    .34467
    
    ------------------------------------------------------------------------------
          wanted | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
           rep78 |   .2377382   .0523998     4.54   0.000     .1331186    .3423578
         foreign |   3.985004   .1119138    35.61   0.000     3.761561    4.208447
           _cons |  -.7181674   .1659245    -4.33   0.000    -1.049446   -.3868885
    ------------------------------------------------------------------------------
    
    . estat vif
    
        Variable |       VIF       1/VIF  
    -------------+----------------------
         foreign |      1.54    0.649255
           rep78 |      1.54    0.649255
    -------------+----------------------
        Mean VIF |      1.54
    
    .
    It is noteworthy that, in both cases, the VIF is not alarming.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      It is by no means given that A, B, and C are common causes for X and Y. This would require that A, B, and C are related to Y independent of X. This is mainly a theoretical question, as it is hard (in practice, impossible) to figure that out empirically:

      The only reason that a model \(y=\beta_0 + \beta_1 A + \beta_2 B + \beta_3 C + \beta_4 X + \varepsilon \) is identified is the non-linear relationship between A, B, C and X. Relying on only the functional form to identify a model is way to fragile for my taste. I am not alone in that: the common strategy is to leave A, B, and C out and thus assume that all the effect of A, B, and C on Y is through X. If you are really worried about that, then you should find a different way to identify the model: maybe a direct measurement of X independent of A, B, and C or maybe an instrumental variable.
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        I agree with Maarten that this is mainly a theoretical question.

        The model \(y= b\left(Ae^{B}-C\right)\) imposes a linear relationship between the composite and the outcome. The model \(y=b_{1}A+b_{2}B+b_{3}C+b_{4}\left(Ae^{B}-C\right)\) implies, among other things, that the association between \(A\) and \(y\) depend on the values of \(B\) (and vice versa). But the (conditional) main effects seem misspecified. Anyway, you need to figure out the data-generating process that you wish to approximate with your model.
        Last edited by daniel klein; 08 Oct 2024, 04:10. Reason: I am not entirely sure whether the conditional main effects are indeed misspecified.

        Comment


        • #5
          Originally posted by Maarten Buis View Post
          It is by no means given that A, B, and C are common causes for X and Y. This would require that A, B, and C are related to Y independent of X.
          Perhaps more fundamentally, from the original post, it is not even clear that X exists independently from A, B, and C. Think of the body mass index. It's probably quite controversial, I believe non-sensical, to think of weight and height as "causes" of BMI.

          Edit: For those who are more interested, my statement relates closely to the discussion of reflective vs. formative models; I leave it at these keywords.
          Last edited by daniel klein; 08 Oct 2024, 04:35.

          Comment

          Working...
          X