Large R square & collinearity

Kehan Yan

Join Date: Jan 2024
Posts: 10

Large R square & collinearity

19 Apr 2024, 03:53

I have a panel data set for 31 Chinese provinces with 6 records for 6 5-year periods which include temperature, precipitation, and outmigration rate for each of them (1990-1995, 1995-2000, 2000-2005, 2005-2010, 2010-2015, 2015-2020).

Question 1: I would like to regress the outmigration rate on temperature and precipitation to find out if outmigration has any correlation with the climate factors. Especially, I would like to see if the effects are different for provinces of different income levels. I divided provinces to 3 categories based on their GDP - poor (dummy g1), middle-income (dummy g2) and rich (g3). Outmigration rate, temprature, precipitation are all in natural log forms.

Code:

reghdfe lnomr lntemp lnprecip, a(province year) cl(province)
reghdfe lnomr lntemp c.lntemp#i.g1 c.lntemp#i.g2  lnprecip  c.lnprecip#i.g1 c.lnprecip#i.g2, a(province year) cl(province)

The two regressions are done with and without adding income-level dummies. Both of the R^2 turned out to be around 0.85 to 0.9. Are they too high? What could possibly go wrong and what method should I use to fix it? (I am suspecting it may because of common time trends between temperature, outmigration rate and precipitation, but don't know how to test it).

Question 2:
Furthermore, I am testing if a province that depends more on agriculture can be more likely to be affected by temperature and precipitation. Thus, I run this regression:

Code:

reghdfe lnomr c.lntemp#i.g1 c.lntemp#i.g1#i.agri c.lntemp#i.g2 c.lntemp#i.g2#i.agri c.lntemp#i.g3 c.lntemp#i.g3#i.agri c.lnprecip#i.g1 c.lnprecip#i.g1#i.agri c.lnprecip#i.g2 c.lnprecip#i.g2#i.agri  c.lnprecip#i.g3  c.lnprecip#g3#i.agri  , a(province year) cl(province)

Here, agri is the dummy which equals one if a province is defined as agriculture dependent. I am expecting the results to be: coefficient on c.lntemp#i.g1 tells me the temperature effect for poor and not agri-dependent provinces. the coefficient for c.lntemp#i.g1#i.agri tells me the additional effect temperature have on emigration for poor provinces if they are also agricultural dependent. So on and so forth. However, I get this result:

Code:

note: 1.g2#c.lntemp omitted because of collinearity
note: 0b.g2#0b.agri#co.lntemp omitted because of collinearity
note: 1o.g2#0b.agri#co.lntemp omitted because of collinearity
note: 0b.g3#c.lntemp omitted because of collinearity
note: 1.g3#c.lntemp omitted because of collinearity
note: 0b.g3#0b.agri#co.lntemp omitted because of collinearity
note: 1.g2#c.lnprecip omitted because of collinearity
note: 0b.g2#0b.agri#co.lnprecip omitted because of collinearity
note: 1o.g2#0b.agri#co.lnprecip omitted because of collinearity
note: 0b.g3#c.lnprecip omitted because of collinearity
note: 1.g3#c.lnprecip omitted because of collinearity
note: 0b.g3#0b.agri#co.lnprecip omitted because of collinearity

HDFE Linear regression                            Number of obs   =        185
Absorbing 2 HDFE groups                           F(  10,     30) =       8.83
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.8951
                                                  Adj R-squared   =     0.8612
                                                  Within R-sq.    =     0.1293
Number of clusters (province) =         31        Root MSE        =     0.2505

                                    (Std. err. adjusted for 31 clusters in province)
------------------------------------------------------------------------------------
                   |               Robust
             lnomr | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------------+----------------------------------------------------------------
       g1#c.lntemp |
                0  |  -7.011179   6.152577    -1.14   0.263    -19.57642     5.55406
                1  |  -7.021618   8.863487    -0.79   0.434    -25.12327    11.08004
                   |
  g1#agri#c.lntemp |
              0 1  |  -.1939885   .3708059    -0.52   0.605    -.9512751    .5632981
              1 1  |  -2.915903   .4769751    -6.11   0.000    -3.890016    -1.94179
                   |
       g2#c.lntemp |
                0  |   8.655452   6.368943     1.36   0.184    -4.351666    21.66257
                1  |          0  (omitted)
                   |
  g2#agri#c.lntemp |
              0 1  |          0  (omitted)
              1 1  |          0  (omitted)
                   |
       g3#c.lntemp |
                0  |          0  (omitted)
                1  |          0  (omitted)
                   |
  g3#agri#c.lntemp |
              0 1  |          0  (omitted)
              1 1  |          0  (empty)
                   |
     g1#c.lnprecip |
                0  |    .066395   .5887866     0.11   0.911    -1.136068    1.268858
                1  |  -3.429277   1.359237    -2.52   0.017    -6.205209   -.6533449
                   |
g1#agri#c.lnprecip |
              0 1  |    .139488   .2353011     0.59   0.558    -.3410609    .6200369
              1 1  |   1.711133   .2695761     6.35   0.000     1.160586    2.261681
                   |
     g2#c.lnprecip |
                0  |   1.649785   .7958186     2.07   0.047     .0245069    3.275064
                1  |          0  (omitted)
                   |
g2#agri#c.lnprecip |
              0 1  |          0  (omitted)
              1 1  |          0  (omitted)
                   |
     g3#c.lnprecip |
                0  |          0  (omitted)
                1  |          0  (omitted)
                   |
g3#agri#c.lnprecip |
              0 1  |          0  (omitted)
              1 1  |          0  (empty)
                   |
             _cons |   7.405652    24.7511     0.30   0.767    -43.14283    57.95414
------------------------------------------------------------------------------------

Why there is be collinearity here? I haven't a clue. What's more, how can I let stata not show the result for 0.g1#c.temp but only for 1.g1# c.lntemp as I export the result using esttab (currently, both are reported and besomes annoyng as there are a lot of rows to delete when there are lots of dummies in my regression)?

Could anyone help me with these two questions, please? I would really appreciate it!!

Last edited by Kehan Yan; 19 Apr 2024, 03:55.

Tags: collinearity, fixed effects, R-squared

Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

19 Apr 2024, 11:37

The colinearity problem arises because you are using the g1-g3 variables incorrectly. These variables are indicators for mutually exclusive and exhaustive conditions: in every observation you have g1 + g2 + g3 = 1 because exactly one of the three variables has value 1 and the other two have value 0. When you are representing a categorical variable that has n levels, you use n-1 indicators ("dummies"), not n. One level has to be omitted as the base level. What you really should do is get rid of all three of those variables and replace them with a single variable g = 1 for poor, 2 for middle income, and 3 for rich. Then your regressions become:

Code:

reghdfe lnomr lntemp lnprecip, a(province year) cl(province) // UNCHANGED FROM BEFORE reghdfe lnomr c.(lntemp lnprecip)##i.g, a(province year) cl(province) reghdfe lnomr c.(lntemp lnprecip)##i.g##i.agri, a(province year) cl(province)

After the regressions, if you wish to test, for example, whether the lntemp effect differs among the income categories in your second regression, -testparm lntemp#i.g- will do that. Analogous use of -testparm- following the third regression will enable you to test agricultural-dependence modifies the other effects.
1 like
Comment

Kehan Yan

Join Date: Jan 2024
Posts: 10

19 Apr 2024, 16:14

Clyde Schechter Thank you very much for your reply! Parenthesis binding makes my code more succinct and easy to read! Testparm turns out to be a very useful new tool for me!
However, I still have some problems with collinearity with the third regression. My result looks like this:

Code:

reghdfe lnomr c.(lntemp lnprecip)##i.rank1##i.agri, a(province year) cl(province)
(MWFE estimator converged in 3 iterations)
note: 2bn.rank1 is probably collinear with the fixed effects (all partialled-out values are close to 
> zero; tol = 1.0e-09)
note: 3bn.rank1 is probably collinear with the fixed effects (all partialled-out values are close to 
> zero; tol = 1.0e-09)
warning: missing F statistic; dropped variables due to collinearity or too few clusters

HDFE Linear regression                            Number of obs   =        185
Absorbing 2 HDFE groups                           F(  12,     30) =          .
Statistics robust to heteroskedasticity           Prob > F        =          .
                                                  R-squared       =     0.8965
                                                  Adj R-squared   =     0.8610
                                                  Within R-sq.    =     0.1410
Number of clusters (province) =         31        Root MSE        =     0.2506

                                       (Std. err. adjusted for 31 clusters in province)
---------------------------------------------------------------------------------------
                      |               Robust
                lnomr | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
----------------------+----------------------------------------------------------------
               lntemp |   2.379587    7.64739     0.31   0.758    -13.23847    17.99764
             lnprecip |  -1.875438   1.117377    -1.68   0.104    -4.157426    .4065501
                      |
                rank1 |
                   2  |          0  (omitted)
                   3  |          0  (omitted)
                      |
       rank1#c.lntemp |
                   2  |  -8.908217   6.142866    -1.45   0.157    -21.45362    3.637189
                   3  |  -.3120696   6.966453    -0.04   0.965    -14.53947    13.91533
                      |
     rank1#c.lnprecip |
                   2  |   1.914983   1.173702     1.63   0.113    -.4820363    4.312003
                   3  |   3.565657   1.281518     2.78   0.009     .9484472    6.182867
                      |
               1.agri |   .8688549   1.008739     0.86   0.396    -1.191265    2.928975
                      |
        agri#c.lntemp |
                   1  |  -3.689547   .9985237    -3.70   0.001    -5.728805    -1.65029
                      |
      agri#c.lnprecip |
                   1  |   2.029371   .4617998     4.39   0.000      1.08625    2.972492
                      |
           rank1#agri |
                 2 1  |   6.032314   .9159596     6.59   0.000     4.161675    7.902953
                 3 1  |          0  (empty)
                      |
  rank1#agri#c.lntemp |
                 2 1  |   -.007151   .9850266    -0.01   0.994    -2.018844    2.004542
                 3 1  |          0  (empty)
                      |
rank1#agri#c.lnprecip |
                 2 1  |  -.8240062   .4586232    -1.80   0.082     -1.76064    .1126274
                 3 1  |          0  (empty)
                      |
                _cons |   5.527616   25.68912     0.22   0.831    -46.93657     57.9918
---------------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
    province |        31          31           0    *|
        year |         6           1           5     |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

Here, rank1 is the name I use for g (rank1=1 for poor, =2 for middle, =3 for rich). I understand why rank1 2 and 3 are collinear with fixed effects. However, if you look at rows for rank1#agri#c.lntemp and rank1#agri#c.lnprecip, I don't know why they second rows for them are empty. Moreover, I am not sure if I should include rank1#agri, agri and agri#c.(lntemp lnpercip) in my regression.

I also tried this code too:

Code:

reghdfe lnomr c.lntemp c.lntemp#i.agri c.lntemp#i.g1 c.lntemp#i.g1#i.agri c.lntemp#i.g2 c.lntemp#i.g2#i.agri lnprecip c.lnprecip#i.agri c.lnprecip#i.g1 c.lnprecip#i.g1#i.agri c.lnprecip#i.g2 c.lnprecip#i.g2#i.agri  , a(province year) cl(province)

I was hoping by deleting g3, I get the temperature effect for rich non-agridependent provinces by reading the coefficient on c.lntemp, the coefficient of c.lntemp#i.g1#i.agri becomes the additional temperautre effect for poor and agriculture-dependent provinces compared to rich and non-agridependent provinces.
However, the result omit g2#agri#c.lnprecip. Why is there collinearity about it? I assume the reason may be similar to what you have said, but would you mind explaining again in more detail, as I am still confused about it.

Code:

note: 1.g2#c.lntemp omitted because of collinearity
note: 0b.g2#0b.agri#co.lntemp omitted because of collinearity
note: 1o.g2#0b.agri#co.lntemp omitted because of collinearity
note: 1.g2#c.lnprecip omitted because of collinearity
note: 0b.g2#0b.agri#co.lnprecip omitted because of collinearity
note: 1o.g2#0b.agri#co.lnprecip omitted because of collinearity

HDFE Linear regression                            Number of obs   =        185
Absorbing 2 HDFE groups                           F(  10,     30) =       8.83
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.8951
                                                  Adj R-squared   =     0.8612
                                                  Within R-sq.    =     0.1293
Number of clusters (province) =         31        Root MSE        =     0.2505

                                    (Std. err. adjusted for 31 clusters in province)
------------------------------------------------------------------------------------
                   |               Robust
             lnomr | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------------+----------------------------------------------------------------
       g1#c.lntemp |
                0  |  -7.011179   6.152577    -1.14   0.263    -19.57642     5.55406
                1  |  -7.021618   8.863487    -0.79   0.434    -25.12327    11.08004
                   |
  g1#agri#c.lntemp |
              0 1  |  -.1939885   .3708059    -0.52   0.605    -.9512751    .5632981
              1 1  |  -2.915903   .4769751    -6.11   0.000    -3.890016    -1.94179
                   |
       g2#c.lntemp |
                0  |   8.655452   6.368943     1.36   0.184    -4.351666    21.66257
                1  |          0  (omitted)
                   |
  g2#agri#c.lntemp |
              0 1  |          0  (omitted)
              1 1  |          0  (omitted)
                   |
     g1#c.lnprecip |
                0  |    .066395   .5887866     0.11   0.911    -1.136068    1.268858
                1  |  -3.429277   1.359237    -2.52   0.017    -6.205209   -.6533449
                   |
g1#agri#c.lnprecip |
              0 1  |    .139488   .2353011     0.59   0.558    -.3410609    .6200369
              1 1  |   1.711133   .2695761     6.35   0.000     1.160586    2.261681
                   |
     g2#c.lnprecip |
                0  |   1.649785   .7958186     2.07   0.047     .0245069    3.275064
                1  |          0  (omitted)
                   |
g2#agri#c.lnprecip |
              0 1  |          0  (omitted)
              1 1  |          0  (omitted)
                   |
             _cons |   7.405652    24.7511     0.30   0.767    -43.14283    57.95414
------------------------------------------------------------------------------------

Cheers!!

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#4

19 Apr 2024, 17:39

With your third regression, you are losing observations because it seems there are no clusters for which rank1 == 3 & agri == 1. So your rank#agri interaction is incomplete, and that is the cause of the various empty outputs. The missing model F statistic is probably attributable to some cluster that has only one observation. It is not a problem anyway. The model F statistic just tests the hypothesis that all of the coefficients are jointly 0, which is almost never of any interest.

As for the second set of outputs you show, you have revised your model so that rank1 == 3 is the "base case." For reasons I don't grasp, you have reverted to using g1 and g2 instead of rank1. You can use ib3.rank1 to use the rank1 variable and also have 3 be the base for that variable. But your analysis is now running into trouble because, as previously noted, this base case actually doesn't exist for the interactions with agri: there are no observations with rank1 == 3 and agri == 1. It is further complicated by the fact that this model fails to include the uninteracted variables g1 and g2 themselves. So it's going to be a nightmare to interpret even if you get it running the way you hope to. I would stick to the models I proposed in #2.

Your maneuverings are understandable: you want to get more information out of your results. But Stata doesn't withhold information from you. The reason you're not getting that information is because it isn't in the data, and no amount of coding machinations will change that fact. If you do succeed in coding some model that, with this data, gives you all those outputs you're struggling with, I can promise you that the model will, in fact, be just wrong. If it is possible to get more data that fills in the gaps, that's what will help you.
2 likes
Comment
Kehan Yan

Join Date: Jan 2024

Posts: 10
#5

20 Apr 2024, 05:23

I get it now. I didn't know that I can use ib3.rank1 instead of creating a dummy for each group, but now I do. These are really helpful, thank you!
Comment

Announcement

Large R square & collinearity

Comment

Comment

Comment

Comment