Inconsistent results with -areg-, -xtreg- and factor variables.

Jorge Eduardo Perez Perez

Join Date: Mar 2014
Posts: 429

Inconsistent results with -areg-, -xtreg- and factor variables.

21 Dec 2018, 12:11

Hi Statalist.

I am estimating a model with fixed effects and factor variables. I can estimate it using -areg-, -reg-, -xtreg- or -reghdfe-. With all the commands but -reghdfe- I get different results when I use the factor variables in the regression, compared to when I generate the variables beforehand and then I include them in the regression. These are differences in coefficients with the same estimation method. The differences in standard errors across methods should be expected.

Here's the code I am using:

Code:

gen i0 = 1.intent#0.defier#1.m0
gen i1 = l1.1.intent#l1.0.defier#1.m1
gen i2 = l2.1.intent#l2.0.defier#1.m2
gen i3 = l3.1.intent#l3.0.defier#1.m3


foreach method in areg reghdfe {

# delimit ;
    `method' ${outcome}

    1.intent#0.defier#1.m0
    l1.1.intent#l1.0.defier#1.m1
    l2.1.intent#l2.0.defier#1.m2
    l3.1.intent#l3.0.defier#1.m3

    c.running#0.defier#1.m0
    l1.c.running#l1.0.defier#1.m1
    l2.c.running#l2.0.defier#1.m2
    l3.c.running#l3.0.defier#1.m3

    0.defier#m0
    0.l1.defier#m1
    0.l2.defier#m2
    0.l3.defier#m3
    
    i.year
    
            
    wave0 l1wave0 l2wave0 l3wave0
    l(4/6).rdtreat
    l(4/6).rdtreat2    
    
                        
    if inrange(year,1996,2006)
    & insample                
    , cluster(cty) absorb(cty);
# delimit cr
est store `method'1



# delimit ;
    `method' ${outcome}

    i0
    i1
    i2
    i3

    c.running#0.defier#1.m0
    l1.c.running#l1.0.defier#1.m1
    l2.c.running#l2.0.defier#1.m2
    l3.c.running#l3.0.defier#1.m3

    0.defier#m0
    0.l1.defier#m1
    0.l2.defier#m2
    0.l3.defier#m3
    
    i.year
    
            
    wave0 l1wave0 l2wave0 l3wave0
    l(4/6).rdtreat
    l(4/6).rdtreat2    
    
                        
    if inrange(year,1996,2006)
    & insample                
    , cluster(cty) absorb(cty);
# delimit cr
est store `method'2
}

The code for -xtreg- is similar, I am using the fe option. Method 1 uses the factor variables in the regression, method 2 creates them beforehand. Here's a table with the results. As you can see, the areg1 and areg2 (xtreg1 and xtreg2) columns are different, but the reghdfe1 and reghdfe2 columns are the same.

Code:

. est tab areg* reghdfe* xtreg*, b(%9.3f)

--------------------------------------------------------------------------------------
    Variable |   areg1       areg2     reghdfe1    reghdfe2     xtreg1      xtreg2    
-------------+------------------------------------------------------------------------
      intent#|
   defier#m0 |
      1 0 1  |     0.052                   0.014                   0.052              
             |
    L.intent#|
 L.defier#m1 |
      1 0 1  |     0.164                   0.210                   0.164              
             |
   L2.intent#|
L2.defier#m2 |
      1 0 1  |    -0.489                  -0.566                  -0.489              
             |
   L3.intent#|
L3.defier#m3 |
      1 0 1  |    -0.445                  -0.382                  -0.445              
             |
   defier#m0#|
   c.running |
        0 1  |    -0.010      -0.010      -0.010      -0.010      -0.010      -0.010  
             |
 L.defier#m1#|
  cL.running |
        0 1  |     0.005       0.006       0.006       0.006       0.005       0.006  
             |
         L2. |
      defier#|
          m2#|
 cL2.running |
        0 1  |    -0.008      -0.010      -0.010      -0.010      -0.008      -0.010  
             |
         L3. |
      defier#|
          m3#|
 cL3.running |
        0 1  |     0.009       0.009       0.009       0.009       0.009       0.009  
             |
   defier#m0 |
        0 1  |     0.265       0.276       0.276       0.276       0.265       0.276  
             |
 L.defier#m1 |
        0 1  |    -0.255      -0.304      -0.304      -0.304      -0.255      -0.304  
             |
L2.defier#m2 |
        0 1  |     0.290       0.364       0.364       0.364       0.290       0.364  
             |
L3.defier#m3 |
        0 1  |    -0.178      -0.217      -0.217      -0.217      -0.178      -0.217  
             |
        year |
       1997  |    -0.653      -0.657      -0.657      -0.657      -0.653      -0.657  
       1998  |    -0.798      -0.801      -0.801      -0.801      -0.798      -0.801  
       1999  |    -1.030      -1.031      -1.031      -1.031      -1.030      -1.031  
       2000  |    -0.489      -0.491      -0.491      -0.491      -0.489      -0.491  
       2001  |     1.685       1.680       1.680       1.680       1.685       1.680  
       2002  |     2.961       2.957       2.957       2.957       2.961       2.957  
       2003  |     2.468       2.461       2.461       2.461       2.468       2.461  
       2004  |     1.608       1.602       1.602       1.602       1.608       1.602  
       2005  |     1.369       1.366       1.366       1.366       1.369       1.366  
       2006  |     0.954       0.947       0.947       0.947       0.954       0.947  
             |
       wave0 |     0.928       0.917       0.917       0.917       0.928       0.917  
     l1wave0 |     0.898       0.908       0.908       0.908       0.898       0.908  
     l2wave0 |     0.572       0.556       0.556       0.556       0.572       0.556  
     l3wave0 |     0.605       0.606       0.606       0.606       0.605       0.606  
             |
     rdtreat |
         L4. |    -0.828      -0.836      -0.836      -0.836      -0.828      -0.836  
         L5. |     0.063       0.054       0.054       0.054       0.063       0.054  
         L6. |    -1.002      -0.999      -0.999      -0.999      -1.002      -0.999  
             |
    rdtreat2 |
         L4. |    -0.200      -0.195      -0.195      -0.195      -0.200      -0.195  
         L5. |    -0.126      -0.126      -0.126      -0.126      -0.126      -0.126  
         L6. |    -0.364      -0.354      -0.354      -0.354      -0.364      -0.354  
             |
          i0 |                 0.014                   0.014                   0.014  
          i1 |                 0.210                   0.210                   0.210  
          i2 |                -0.566                  -0.566                  -0.566  
          i3 |                -0.382                  -0.382                  -0.382  
       _cons |     5.597       5.601                               5.597       5.601  
--------------------------------------------------------------------------------------

Thank you,

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com

Tags: None

Joro Kolev

Join Date: Aug 2018

Posts: 3047
#2

23 Dec 2018, 16:57

This is way too complicated and nobody but you knows what all those variables, dummies, categories, mean.

Try to generate the error in some simple set up where the meaning of variables is transparent.

My guess is that you get different estimates because of different paramterisation.
1 like
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3047

23 Dec 2018, 17:00

Here is an example of the "same regression" in three different parametrisations:

Code:

. sysuse auto, clear
(1978 Automobile Data)

. gen domestic = foreign==0

. reg price domestic, noheader
------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    domestic |  -312.2587   754.4488    -0.41   0.680    -1816.225    1191.708
       _cons |   6384.682   632.4346    10.10   0.000     5123.947    7645.417
------------------------------------------------------------------------------

. reg price foreign, noheader
------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |   312.2587   754.4488     0.41   0.680    -1191.708    1816.225
       _cons |   6072.423    411.363    14.76   0.000     5252.386     6892.46
------------------------------------------------------------------------------

. reg price domestic foreign, nocons noheader
------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    domestic |   6072.423    411.363    14.76   0.000     5252.386     6892.46
     foreign |   6384.682   632.4346    10.10   0.000     5123.947    7645.417
------------------------------------------------------------------------------

Comment

Jorge Eduardo Perez Perez

Join Date: Mar 2014

Posts: 429
#4

24 Dec 2018, 08:42

Thanks, I'll try to replicate the error on an example dataset. The different parametrization hypothesis wouldn't explain why -reghdfe- yields the same results with the different parametrizations, while -areg- and -xtreg- yield different results.

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#5

24 Dec 2018, 09:37

I looked again at the results that you present, and they are all coefficients estimates on the dummy variables.

I think the easiest way for you (and all of us) to get convinced that these are the same regression and the different estimates on the dummies are simply due to parametrisation is the following:

include at least one continuous regressor apart from those dummies in your regressions.

The parameter estimates on this continuous regressor from all methods should be the same.
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3047

24 Dec 2018, 09:43

Look below how I am convincing myself that the outcome of all three commands is the same by looking at the other coefficients beyond the dummies (the user written command is beyond my scope, I think I never in my life used a user written command):

Code:

. sysuse auto, clear
(1978 Automobile Data)

. areg price weight length, absorb(rep78)

Linear regression, absorbing indicators         Number of obs     =         69
Absorbed variable: rep78                        No. of categories =          5
                                                F(   2,     62)   =      22.98
                                                Prob > F          =     0.0000
                                                R-squared         =     0.4341
                                                Adj R-squared     =     0.3793
                                                Root MSE          =  2294.5106

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |   5.478309   1.158582     4.73   0.000     3.162337    7.794281
      length |  -109.5065   39.26104    -2.79   0.007    -187.9882   -31.02482
       _cons |   10154.62   4270.525     2.38   0.021      1617.96    18691.27
------------------------------------------------------------------------------
F test of absorbed indicators: F(4, 62) = 2.079               Prob > F = 0.094

. bysort rep78: gen time = _n

. xtset rep78 time
       panel variable:  rep78 (unbalanced)
        time variable:  time, 1 to 30
                delta:  1 unit

. xtreg  price weight length, fe

Fixed-effects (within) regression               Number of obs     =         69
Group variable: rep78                           Number of groups  =          5

R-sq:                                           Obs per group:
     within  = 0.4258                                         min =          2
     between = 0.0011                                         avg =       13.8
     overall = 0.3578                                         max =         30

                                                F(2,62)           =      22.98
corr(u_i, Xb)  = -0.4394                        Prob > F          =     0.0000

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |   5.478309   1.158582     4.73   0.000     3.162337    7.794281
      length |  -109.5065   39.26104    -2.79   0.007    -187.9882   -31.02482
       _cons |   10154.62   4270.525     2.38   0.021      1617.96    18691.27
-------------+----------------------------------------------------------------
     sigma_u |  1333.3033
     sigma_e |  2294.5106
         rho |  .25242509   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(4, 62) = 2.08                       Prob > F = 0.0942

. reg price weight length i.rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(6, 62)        =      7.93
       Model |   250380669         6  41730111.5   Prob > F        =    0.0000
    Residual |   326416290        62  5264778.86   R-squared       =    0.4341
-------------+----------------------------------   Adj R-squared   =    0.3793
       Total |   576796959        68  8482308.22   Root MSE        =    2294.5

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |   5.478309   1.158582     4.73   0.000     3.162337    7.794281
      length |  -109.5065   39.26104    -2.79   0.007    -187.9882   -31.02482
             |
       rep78 |
          2  |   1149.134   1821.448     0.63   0.530    -2491.889    4790.157
          3  |   1322.082    1677.64     0.79   0.434    -2031.472    4675.637
          4  |   2310.734   1714.842     1.35   0.183    -1117.186    5738.654
          5  |   3545.927   1793.553     1.98   0.052    -39.33398    7131.187
             |
       _cons |   8278.473   4525.821     1.83   0.072    -768.5151    17325.46
------------------------------------------------------------------------------

.

Comment

Jorge Eduardo Perez Perez

Join Date: Mar 2014
Posts: 429

24 Dec 2018, 10:16

The differences I get are from different parametrizations with the same command. Across commands, with the same parametrization the results are the same. Here's a working example with a subset of my data:

Code:

* Input data
clear
input long cty float(year unemp_rate running intent defier m0 m1)
37005 1995    6  0 0 0 0 0
37005 1996    8 16 0 0 1 0
37005 1997  5.3 -2 1 0 1 1
37007 1995  9.2  0 0 0 0 0
37007 1996  7.1  1 0 0 1 0
37007 1997  6.1 -1 1 0 1 1
37015 1995  6.8  0 0 0 0 0
37015 1996  6.6 -2 1 0 1 0
37015 1997  5.6  2 0 1 1 1
37033 1995  3.6  0 0 0 0 0
37033 1996  3.5 23 0 0 1 0
37033 1997  3.6 25 0 0 1 1
37065 1995  8.1  0 0 0 0 0
37065 1996 11.8  5 0 0 1 0
37065 1997 11.2 -3 1 0 1 1
37075 1995 14.1  0 0 0 0 0
37075 1996   12 -6 1 0 1 0
37075 1997   11 -7 1 0 1 1
37083 1995    9  0 0 0 0 0
37083 1996  9.8  4 0 0 1 0
37083 1997    9 -5 1 0 1 1
37091 1995  5.7  0 0 0 0 0
37091 1996  5.6 -1 1 0 1 0
37091 1997  5.1 -1 1 0 1 1
37095 1995   10  0 0 0 0 0
37095 1996  9.1 -4 1 0 1 0
37095 1997  7.2  1 0 1 1 1
37117 1995  6.3  0 0 0 0 0
37117 1996 10.7 27 0 0 1 0
37117 1997  9.7 12 0 0 1 1
37121 1995  7.5  0 0 0 0 0
37121 1996  5.4 -3 1 0 1 0
37121 1997  5.8  4 0 1 1 1
37131 1995  7.4  0 0 0 0 0
37131 1996  7.7 -7 1 0 1 0
37131 1997  7.5 -6 1 0 1 1
37153 1995  9.5  0 0 0 0 0
37153 1996 11.3 -5 1 0 1 0
37153 1997    9 -8 1 0 1 1
37165 1995    7  0 0 0 0 0
37165 1996    7 12 0 0 1 0
37165 1997  7.2 11 0 0 1 1
37173 1995 16.7  0 0 0 0 0
37173 1996 17.8 -8 1 0 1 0
37173 1997 16.9 -9 1 0 1 1
37177 1995 10.1  0 0 0 0 0
37177 1996  9.2 -9 1 0 1 0
37177 1997  8.7  3 0 1 1 1
37181 1995  8.3  0 0 0 0 0
37181 1996    8 13 0 0 1 0
37181 1997  6.8 10 0 0 1 1
37185 1995  8.5  0 0 0 0 0
37185 1996   10  0 1 0 1 0
37185 1997    6 -5 1 0 1 1
37187 1995 10.4  0 0 0 0 0
37187 1996  7.3  3 0 0 1 0
37187 1997  6.1  4 0 0 1 1
37199 1995  5.8  0 0 0 0 0
37199 1996  5.9  9 0 0 1 0
37199 1997  5.1 15 0 0 1 1
end

* Declare panel

xtset cty year

* Define outcome

glo outcome "unemp_rate"

* Generate dummy variables

gen i0 = 1.intent#0.defier#1.m0
gen i1 = l1.1.intent#l1.0.defier#1.m1

* Method 1 for areg and reghdfe: Use factor variable notation

foreach method in areg reghdfe {

# delimit ;
    `method' ${outcome}

    1.intent#0.defier#1.m0
    l1.1.intent#l1.0.defier#1.m1
    

    c.running#0.defier#1.m0
    l1.c.running#l1.0.defier#1.m1
    
    
    , cluster(cty) absorb(cty);
# delimit cr
est store `method'1

* Method 2 for areg and reghdfe: Include previously generated dummy variables

# delimit ;
    `method' ${outcome}

    i0
    i1
    

    c.running#0.defier#1.m0
    l1.c.running#l1.0.defier#1.m1
    

    , cluster(cty) absorb(cty);
# delimit cr
est store `method'2
}

* Methods 1 and 2 for xtreg

# delimit ;
    xtreg ${outcome}

    1.intent#0.defier#1.m0
    l1.1.intent#l1.0.defier#1.m1
    

    c.running#0.defier#1.m0
    l1.c.running#l1.0.defier#1.m1
    , r fe;
# delimit cr
est store xtreg1


# delimit ;
    xtreg ${outcome}
    i0
    i1

    c.running#0.defier#1.m0
    l1.c.running#l1.0.defier#1.m1
        
    , r fe ;
# delimit cr
est store xtreg2

* Display table

est tab areg* reghdfe* xtreg*, b(%9.3f)

which returns

Code:

--------------------------------------------------------------------------------------
    Variable |   areg1       areg2     reghdfe1    reghdfe2     xtreg1      xtreg2    
-------------+------------------------------------------------------------------------
      intent#|
   defier#m0 |
      1 0 1  |    -5.713                  -0.414                  -5.713              
             |
    L.intent#|
 L.defier#m1 |
      1 0 1  |    -7.964                  -1.526                  -7.964              
             |
   defier#m0#|
   c.running |
        0 1  |    -0.757       0.070       0.070       0.070      -0.757       0.070  
             |
 L.defier#m1#|
  cL.running |
        0 1  |    -0.392      -0.024      -0.024      -0.024      -0.392      -0.024  
             |u
          i0 |                -0.414                  -0.414                  -0.414  
          i1 |                -1.526                  -1.526                  -1.526  
       _cons |    15.535       8.626                              15.535       8.626  
--------------------------------------------------------------------------------------

Notice that I am not changing the levels of the dummies at all. I am simply generating them beforehand as opposed to using the factor variable notation, and I am using the factor variable notation in the generate syntax. So I don't think the coefficients in the dummies should change. There isn't a entirely continuous regressor in this regression, but running is continuous and it's interacted with the dummies, yet it's coefficients are not the same across methods. They were not the same for the initial results either.

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com

Comment

Joro Kolev

Join Date: Aug 2018

Posts: 3047
#8

24 Dec 2018, 23:46

I toyed around with this quite a lot, and I am afraid I do not know what is going on.

If it is of any help, the problem does not arise for the first two variables, the problem appears when the third and the forth variables are included in the regression (whether I include the third and fourth variables in factor notation in the regression, or I pregenerate them and include the generated variables).

When all variables are included in the regression, the problem does not arise if I use -regress-, it arises when the fixed effects are included with -areg- say.

Along the way I learnt that the factor variables are a bit of an issue, as some commands take them, and some dont take them. And I do not see any logic why some commands do take them, and some dont.

E.g., -xtsum-, -correlate-, do not take them, -summ- and the -regs- take them...

If I were you I would write to StataCorp to ask them what is going on here.

If they manage to figure out what is going on, please put their answer here, I think this issue might be of interest to other people as well.
1 like
Comment
Jorge Eduardo Perez Perez

Join Date: Mar 2014

Posts: 429
#9

26 Dec 2018, 08:28

Thank you, I will ask Stata tech support and post their answer here.

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com
Comment
Jorge Eduardo Perez Perez

Join Date: Mar 2014

Posts: 429
#10

10 Jan 2019, 08:44

Stata Tech support got back to me. This is a bug. It will be fixed in the next update.

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com
Comment
Jorge Eduardo Perez Perez

Join Date: Mar 2014

Posts: 429
#11

26 Feb 2019, 09:33

This was fixed in the latest update (I haven't tested it yet, though)

-------- update 20feb2019 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

3. areg specified with a factor variable having a single chosen level, when that factor level was not observed in one or more of the absorption groups and the model contained subsequent interaction terms containing a
continuous variable, produced incorrect coefficients and standard errors. This has been fixed.

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com
Comment

Announcement

Inconsistent results with -areg-, -xtreg- and factor variables.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment