Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Inconsistent results with -areg-, -xtreg- and factor variables.

    Hi Statalist.

    I am estimating a model with fixed effects and factor variables. I can estimate it using -areg-, -reg-, -xtreg- or -reghdfe-. With all the commands but -reghdfe- I get different results when I use the factor variables in the regression, compared to when I generate the variables beforehand and then I include them in the regression. These are differences in coefficients with the same estimation method. The differences in standard errors across methods should be expected.

    Here's the code I am using:

    Code:
    gen i0 = 1.intent#0.defier#1.m0
    gen i1 = l1.1.intent#l1.0.defier#1.m1
    gen i2 = l2.1.intent#l2.0.defier#1.m2
    gen i3 = l3.1.intent#l3.0.defier#1.m3
    
    
    foreach method in areg reghdfe {
    
    # delimit ;
        `method' ${outcome}
    
        1.intent#0.defier#1.m0
        l1.1.intent#l1.0.defier#1.m1
        l2.1.intent#l2.0.defier#1.m2
        l3.1.intent#l3.0.defier#1.m3
    
        c.running#0.defier#1.m0
        l1.c.running#l1.0.defier#1.m1
        l2.c.running#l2.0.defier#1.m2
        l3.c.running#l3.0.defier#1.m3
    
        0.defier#m0
        0.l1.defier#m1
        0.l2.defier#m2
        0.l3.defier#m3
        
        i.year
        
                
        wave0 l1wave0 l2wave0 l3wave0
        l(4/6).rdtreat
        l(4/6).rdtreat2    
        
                            
        if inrange(year,1996,2006)
        & insample                
        , cluster(cty) absorb(cty);
    # delimit cr
    est store `method'1
    
    
    
    # delimit ;
        `method' ${outcome}
    
        i0
        i1
        i2
        i3
    
        c.running#0.defier#1.m0
        l1.c.running#l1.0.defier#1.m1
        l2.c.running#l2.0.defier#1.m2
        l3.c.running#l3.0.defier#1.m3
    
        0.defier#m0
        0.l1.defier#m1
        0.l2.defier#m2
        0.l3.defier#m3
        
        i.year
        
                
        wave0 l1wave0 l2wave0 l3wave0
        l(4/6).rdtreat
        l(4/6).rdtreat2    
        
                            
        if inrange(year,1996,2006)
        & insample                
        , cluster(cty) absorb(cty);
    # delimit cr
    est store `method'2
    }
    The code for -xtreg- is similar, I am using the fe option. Method 1 uses the factor variables in the regression, method 2 creates them beforehand. Here's a table with the results. As you can see, the areg1 and areg2 (xtreg1 and xtreg2) columns are different, but the reghdfe1 and reghdfe2 columns are the same.

    Code:
    . est tab areg* reghdfe* xtreg*, b(%9.3f)
    
    --------------------------------------------------------------------------------------
        Variable |   areg1       areg2     reghdfe1    reghdfe2     xtreg1      xtreg2    
    -------------+------------------------------------------------------------------------
          intent#|
       defier#m0 |
          1 0 1  |     0.052                   0.014                   0.052              
                 |
        L.intent#|
     L.defier#m1 |
          1 0 1  |     0.164                   0.210                   0.164              
                 |
       L2.intent#|
    L2.defier#m2 |
          1 0 1  |    -0.489                  -0.566                  -0.489              
                 |
       L3.intent#|
    L3.defier#m3 |
          1 0 1  |    -0.445                  -0.382                  -0.445              
                 |
       defier#m0#|
       c.running |
            0 1  |    -0.010      -0.010      -0.010      -0.010      -0.010      -0.010  
                 |
     L.defier#m1#|
      cL.running |
            0 1  |     0.005       0.006       0.006       0.006       0.005       0.006  
                 |
             L2. |
          defier#|
              m2#|
     cL2.running |
            0 1  |    -0.008      -0.010      -0.010      -0.010      -0.008      -0.010  
                 |
             L3. |
          defier#|
              m3#|
     cL3.running |
            0 1  |     0.009       0.009       0.009       0.009       0.009       0.009  
                 |
       defier#m0 |
            0 1  |     0.265       0.276       0.276       0.276       0.265       0.276  
                 |
     L.defier#m1 |
            0 1  |    -0.255      -0.304      -0.304      -0.304      -0.255      -0.304  
                 |
    L2.defier#m2 |
            0 1  |     0.290       0.364       0.364       0.364       0.290       0.364  
                 |
    L3.defier#m3 |
            0 1  |    -0.178      -0.217      -0.217      -0.217      -0.178      -0.217  
                 |
            year |
           1997  |    -0.653      -0.657      -0.657      -0.657      -0.653      -0.657  
           1998  |    -0.798      -0.801      -0.801      -0.801      -0.798      -0.801  
           1999  |    -1.030      -1.031      -1.031      -1.031      -1.030      -1.031  
           2000  |    -0.489      -0.491      -0.491      -0.491      -0.489      -0.491  
           2001  |     1.685       1.680       1.680       1.680       1.685       1.680  
           2002  |     2.961       2.957       2.957       2.957       2.961       2.957  
           2003  |     2.468       2.461       2.461       2.461       2.468       2.461  
           2004  |     1.608       1.602       1.602       1.602       1.608       1.602  
           2005  |     1.369       1.366       1.366       1.366       1.369       1.366  
           2006  |     0.954       0.947       0.947       0.947       0.954       0.947  
                 |
           wave0 |     0.928       0.917       0.917       0.917       0.928       0.917  
         l1wave0 |     0.898       0.908       0.908       0.908       0.898       0.908  
         l2wave0 |     0.572       0.556       0.556       0.556       0.572       0.556  
         l3wave0 |     0.605       0.606       0.606       0.606       0.605       0.606  
                 |
         rdtreat |
             L4. |    -0.828      -0.836      -0.836      -0.836      -0.828      -0.836  
             L5. |     0.063       0.054       0.054       0.054       0.063       0.054  
             L6. |    -1.002      -0.999      -0.999      -0.999      -1.002      -0.999  
                 |
        rdtreat2 |
             L4. |    -0.200      -0.195      -0.195      -0.195      -0.200      -0.195  
             L5. |    -0.126      -0.126      -0.126      -0.126      -0.126      -0.126  
             L6. |    -0.364      -0.354      -0.354      -0.354      -0.364      -0.354  
                 |
              i0 |                 0.014                   0.014                   0.014  
              i1 |                 0.210                   0.210                   0.210  
              i2 |                -0.566                  -0.566                  -0.566  
              i3 |                -0.382                  -0.382                  -0.382  
           _cons |     5.597       5.601                               5.597       5.601  
    --------------------------------------------------------------------------------------
    Thank you,





    Jorge Eduardo Pérez Pérez
    www.jorgeperezperez.com

  • #2
    This is way too complicated and nobody but you knows what all those variables, dummies, categories, mean.

    Try to generate the error in some simple set up where the meaning of variables is transparent.

    My guess is that you get different estimates because of different paramterisation.

    Comment


    • #3
      Here is an example of the "same regression" in three different parametrisations:

      Code:
      . sysuse auto, clear
      (1978 Automobile Data)
      
      . gen domestic = foreign==0
      
      . reg price domestic, noheader
      ------------------------------------------------------------------------------
             price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
          domestic |  -312.2587   754.4488    -0.41   0.680    -1816.225    1191.708
             _cons |   6384.682   632.4346    10.10   0.000     5123.947    7645.417
      ------------------------------------------------------------------------------
      
      . reg price foreign, noheader
      ------------------------------------------------------------------------------
             price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
           foreign |   312.2587   754.4488     0.41   0.680    -1191.708    1816.225
             _cons |   6072.423    411.363    14.76   0.000     5252.386     6892.46
      ------------------------------------------------------------------------------
      
      . reg price domestic foreign, nocons noheader
      ------------------------------------------------------------------------------
             price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
          domestic |   6072.423    411.363    14.76   0.000     5252.386     6892.46
           foreign |   6384.682   632.4346    10.10   0.000     5123.947    7645.417
      ------------------------------------------------------------------------------

      Comment


      • #4
        Thanks, I'll try to replicate the error on an example dataset. The different parametrization hypothesis wouldn't explain why -reghdfe- yields the same results with the different parametrizations, while -areg- and -xtreg- yield different results.
        Jorge Eduardo Pérez Pérez
        www.jorgeperezperez.com

        Comment


        • #5
          I looked again at the results that you present, and they are all coefficients estimates on the dummy variables.

          I think the easiest way for you (and all of us) to get convinced that these are the same regression and the different estimates on the dummies are simply due to parametrisation is the following:

          include at least one continuous regressor apart from those dummies in your regressions.

          The parameter estimates on this continuous regressor from all methods should be the same.

          Comment


          • #6
            Look below how I am convincing myself that the outcome of all three commands is the same by looking at the other coefficients beyond the dummies (the user written command is beyond my scope, I think I never in my life used a user written command):

            Code:
            . sysuse auto, clear
            (1978 Automobile Data)
            
            . areg price weight length, absorb(rep78)
            
            Linear regression, absorbing indicators         Number of obs     =         69
            Absorbed variable: rep78                        No. of categories =          5
                                                            F(   2,     62)   =      22.98
                                                            Prob > F          =     0.0000
                                                            R-squared         =     0.4341
                                                            Adj R-squared     =     0.3793
                                                            Root MSE          =  2294.5106
            
            ------------------------------------------------------------------------------
                   price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                  weight |   5.478309   1.158582     4.73   0.000     3.162337    7.794281
                  length |  -109.5065   39.26104    -2.79   0.007    -187.9882   -31.02482
                   _cons |   10154.62   4270.525     2.38   0.021      1617.96    18691.27
            ------------------------------------------------------------------------------
            F test of absorbed indicators: F(4, 62) = 2.079               Prob > F = 0.094
            
            . bysort rep78: gen time = _n
            
            . xtset rep78 time
                   panel variable:  rep78 (unbalanced)
                    time variable:  time, 1 to 30
                            delta:  1 unit
            
            . xtreg  price weight length, fe
            
            Fixed-effects (within) regression               Number of obs     =         69
            Group variable: rep78                           Number of groups  =          5
            
            R-sq:                                           Obs per group:
                 within  = 0.4258                                         min =          2
                 between = 0.0011                                         avg =       13.8
                 overall = 0.3578                                         max =         30
            
                                                            F(2,62)           =      22.98
            corr(u_i, Xb)  = -0.4394                        Prob > F          =     0.0000
            
            ------------------------------------------------------------------------------
                   price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                  weight |   5.478309   1.158582     4.73   0.000     3.162337    7.794281
                  length |  -109.5065   39.26104    -2.79   0.007    -187.9882   -31.02482
                   _cons |   10154.62   4270.525     2.38   0.021      1617.96    18691.27
            -------------+----------------------------------------------------------------
                 sigma_u |  1333.3033
                 sigma_e |  2294.5106
                     rho |  .25242509   (fraction of variance due to u_i)
            ------------------------------------------------------------------------------
            F test that all u_i=0: F(4, 62) = 2.08                       Prob > F = 0.0942
            
            . reg price weight length i.rep78
            
                  Source |       SS           df       MS      Number of obs   =        69
            -------------+----------------------------------   F(6, 62)        =      7.93
                   Model |   250380669         6  41730111.5   Prob > F        =    0.0000
                Residual |   326416290        62  5264778.86   R-squared       =    0.4341
            -------------+----------------------------------   Adj R-squared   =    0.3793
                   Total |   576796959        68  8482308.22   Root MSE        =    2294.5
            
            ------------------------------------------------------------------------------
                   price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                  weight |   5.478309   1.158582     4.73   0.000     3.162337    7.794281
                  length |  -109.5065   39.26104    -2.79   0.007    -187.9882   -31.02482
                         |
                   rep78 |
                      2  |   1149.134   1821.448     0.63   0.530    -2491.889    4790.157
                      3  |   1322.082    1677.64     0.79   0.434    -2031.472    4675.637
                      4  |   2310.734   1714.842     1.35   0.183    -1117.186    5738.654
                      5  |   3545.927   1793.553     1.98   0.052    -39.33398    7131.187
                         |
                   _cons |   8278.473   4525.821     1.83   0.072    -768.5151    17325.46
            ------------------------------------------------------------------------------
            
            .

            Comment


            • #7
              The differences I get are from different parametrizations with the same command. Across commands, with the same parametrization the results are the same. Here's a working example with a subset of my data:

              Code:
              * Input data
              clear
              input long cty float(year unemp_rate running intent defier m0 m1)
              37005 1995    6  0 0 0 0 0
              37005 1996    8 16 0 0 1 0
              37005 1997  5.3 -2 1 0 1 1
              37007 1995  9.2  0 0 0 0 0
              37007 1996  7.1  1 0 0 1 0
              37007 1997  6.1 -1 1 0 1 1
              37015 1995  6.8  0 0 0 0 0
              37015 1996  6.6 -2 1 0 1 0
              37015 1997  5.6  2 0 1 1 1
              37033 1995  3.6  0 0 0 0 0
              37033 1996  3.5 23 0 0 1 0
              37033 1997  3.6 25 0 0 1 1
              37065 1995  8.1  0 0 0 0 0
              37065 1996 11.8  5 0 0 1 0
              37065 1997 11.2 -3 1 0 1 1
              37075 1995 14.1  0 0 0 0 0
              37075 1996   12 -6 1 0 1 0
              37075 1997   11 -7 1 0 1 1
              37083 1995    9  0 0 0 0 0
              37083 1996  9.8  4 0 0 1 0
              37083 1997    9 -5 1 0 1 1
              37091 1995  5.7  0 0 0 0 0
              37091 1996  5.6 -1 1 0 1 0
              37091 1997  5.1 -1 1 0 1 1
              37095 1995   10  0 0 0 0 0
              37095 1996  9.1 -4 1 0 1 0
              37095 1997  7.2  1 0 1 1 1
              37117 1995  6.3  0 0 0 0 0
              37117 1996 10.7 27 0 0 1 0
              37117 1997  9.7 12 0 0 1 1
              37121 1995  7.5  0 0 0 0 0
              37121 1996  5.4 -3 1 0 1 0
              37121 1997  5.8  4 0 1 1 1
              37131 1995  7.4  0 0 0 0 0
              37131 1996  7.7 -7 1 0 1 0
              37131 1997  7.5 -6 1 0 1 1
              37153 1995  9.5  0 0 0 0 0
              37153 1996 11.3 -5 1 0 1 0
              37153 1997    9 -8 1 0 1 1
              37165 1995    7  0 0 0 0 0
              37165 1996    7 12 0 0 1 0
              37165 1997  7.2 11 0 0 1 1
              37173 1995 16.7  0 0 0 0 0
              37173 1996 17.8 -8 1 0 1 0
              37173 1997 16.9 -9 1 0 1 1
              37177 1995 10.1  0 0 0 0 0
              37177 1996  9.2 -9 1 0 1 0
              37177 1997  8.7  3 0 1 1 1
              37181 1995  8.3  0 0 0 0 0
              37181 1996    8 13 0 0 1 0
              37181 1997  6.8 10 0 0 1 1
              37185 1995  8.5  0 0 0 0 0
              37185 1996   10  0 1 0 1 0
              37185 1997    6 -5 1 0 1 1
              37187 1995 10.4  0 0 0 0 0
              37187 1996  7.3  3 0 0 1 0
              37187 1997  6.1  4 0 0 1 1
              37199 1995  5.8  0 0 0 0 0
              37199 1996  5.9  9 0 0 1 0
              37199 1997  5.1 15 0 0 1 1
              end
              
              * Declare panel
              
              xtset cty year
              
              * Define outcome
              
              glo outcome "unemp_rate"
              
              * Generate dummy variables
              
              gen i0 = 1.intent#0.defier#1.m0
              gen i1 = l1.1.intent#l1.0.defier#1.m1
              
              * Method 1 for areg and reghdfe: Use factor variable notation
              
              foreach method in areg reghdfe {
              
              # delimit ;
                  `method' ${outcome}
              
                  1.intent#0.defier#1.m0
                  l1.1.intent#l1.0.defier#1.m1
                  
              
                  c.running#0.defier#1.m0
                  l1.c.running#l1.0.defier#1.m1
                  
                  
                  , cluster(cty) absorb(cty);
              # delimit cr
              est store `method'1
              
              * Method 2 for areg and reghdfe: Include previously generated dummy variables
              
              # delimit ;
                  `method' ${outcome}
              
                  i0
                  i1
                  
              
                  c.running#0.defier#1.m0
                  l1.c.running#l1.0.defier#1.m1
                  
              
                  , cluster(cty) absorb(cty);
              # delimit cr
              est store `method'2
              }
              
              * Methods 1 and 2 for xtreg
              
              # delimit ;
                  xtreg ${outcome}
              
                  1.intent#0.defier#1.m0
                  l1.1.intent#l1.0.defier#1.m1
                  
              
                  c.running#0.defier#1.m0
                  l1.c.running#l1.0.defier#1.m1
                  , r fe;
              # delimit cr
              est store xtreg1
              
              
              # delimit ;
                  xtreg ${outcome}
                  i0
                  i1
              
                  c.running#0.defier#1.m0
                  l1.c.running#l1.0.defier#1.m1
                      
                  , r fe ;
              # delimit cr
              est store xtreg2
              
              * Display table
              
              est tab areg* reghdfe* xtreg*, b(%9.3f)
              which returns

              Code:
              --------------------------------------------------------------------------------------
                  Variable |   areg1       areg2     reghdfe1    reghdfe2     xtreg1      xtreg2    
              -------------+------------------------------------------------------------------------
                    intent#|
                 defier#m0 |
                    1 0 1  |    -5.713                  -0.414                  -5.713              
                           |
                  L.intent#|
               L.defier#m1 |
                    1 0 1  |    -7.964                  -1.526                  -7.964              
                           |
                 defier#m0#|
                 c.running |
                      0 1  |    -0.757       0.070       0.070       0.070      -0.757       0.070  
                           |
               L.defier#m1#|
                cL.running |
                      0 1  |    -0.392      -0.024      -0.024      -0.024      -0.392      -0.024  
                           |u
                        i0 |                -0.414                  -0.414                  -0.414  
                        i1 |                -1.526                  -1.526                  -1.526  
                     _cons |    15.535       8.626                              15.535       8.626  
              --------------------------------------------------------------------------------------

              Notice that I am not changing the levels of the dummies at all. I am simply generating them beforehand as opposed to using the factor variable notation, and I am using the factor variable notation in the generate syntax. So I don't think the coefficients in the dummies should change. There isn't a entirely continuous regressor in this regression, but running is continuous and it's interacted with the dummies, yet it's coefficients are not the same across methods. They were not the same for the initial results either.


              Jorge Eduardo Pérez Pérez
              www.jorgeperezperez.com

              Comment


              • #8
                I toyed around with this quite a lot, and I am afraid I do not know what is going on.

                If it is of any help, the problem does not arise for the first two variables, the problem appears when the third and the forth variables are included in the regression (whether I include the third and fourth variables in factor notation in the regression, or I pregenerate them and include the generated variables).

                When all variables are included in the regression, the problem does not arise if I use -regress-, it arises when the fixed effects are included with -areg- say.

                Along the way I learnt that the factor variables are a bit of an issue, as some commands take them, and some dont take them. And I do not see any logic why some commands do take them, and some dont.

                E.g., -xtsum-, -correlate-, do not take them, -summ- and the -regs- take them...

                If I were you I would write to StataCorp to ask them what is going on here.

                If they manage to figure out what is going on, please put their answer here, I think this issue might be of interest to other people as well.



                Comment


                • #9
                  Thank you, I will ask Stata tech support and post their answer here.

                  Jorge Eduardo Pérez Pérez
                  www.jorgeperezperez.com

                  Comment


                  • #10
                    Stata Tech support got back to me. This is a bug. It will be fixed in the next update.

                    Jorge Eduardo Pérez Pérez
                    www.jorgeperezperez.com

                    Comment


                    • #11
                      This was fixed in the latest update (I haven't tested it yet, though)


                      -------- update 20feb2019 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


                      3. areg specified with a factor variable having a single chosen level, when that factor level was not observed in one or more of the absorption groups and the model contained subsequent interaction terms containing a
                      continuous variable, produced incorrect coefficients and standard errors. This has been fixed.

                      Jorge Eduardo Pérez Pérez
                      www.jorgeperezperez.com

                      Comment

                      Working...
                      X