Individually added factor variable levels

Gabor Mugge

Join Date: Apr 2021
Posts: 30

Individually added factor variable levels

28 Mar 2024, 12:54

Hi,

Could someone please explain to me why the output of the first three regression commands below differs from the output of the last command? Based on the information provided in `help fvvarlist', I would expect all outputs to be the same as the one given by the last command.

Thank you,
Gabor

Code:

clear all

set obs 1000000

set seed 134

gen cat=mod(_n, 3)+1

gen y=rnormal(0,.1)
replace y=y+1 if cat==2
replace y=y+2 if cat==3

* first set of regressions
reg y 2.cat 3.cat
reg y i(2 3).cat
reg y i2.cat i3.cat

* last regression
reg y 2bn.cat 3bn.cat

Output of the first three commands:

Code:

      Source |       SS           df       MS      Number of obs   = 1,000,000
-------------+----------------------------------   F(1, 999998)    >  99999.00
       Model |  500055.387         1  500055.387   Prob > F        =    0.0000
    Residual |  176737.175   999,998  .176737529   R-squared       =    0.7389
-------------+----------------------------------   Adj R-squared   =    0.7389
       Total |  676792.563   999,999  .676793239   Root MSE        =     .4204

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       3.cat |   1.500083   .0008918  1682.07   0.000     1.498336    1.501831
       _cons |    .499973   .0005149   971.04   0.000     .4989639    .5009822
------------------------------------------------------------------------------

Output of the last command:

Code:

. reg y 2bn.cat 3bn.cat

      Source |       SS           df       MS      Number of obs   = 1,000,000
-------------+----------------------------------   F(2, 999997)    >  99999.00
       Model |  666799.899         2  333399.949   Prob > F        =    0.0000
    Residual |  9992.66366   999,997  .009992694   R-squared       =    0.9852
-------------+----------------------------------   Adj R-squared   =    0.9852
       Total |  676792.563   999,999  .676793239   Root MSE        =    .09996

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         cat |
          2  |   1.000233   .0002449  4084.93   0.000     .9997533    1.000713
          3  |   2.000201   .0002449  8168.77   0.000     1.999721    2.000681
             |
       _cons |  -.0001443   .0001731    -0.83   0.404    -.0004837     .000195
------------------------------------------------------------------------------

Last edited by Gabor Mugge; 28 Mar 2024, 12:57.

Tags: None

Daniel Schaefer

Join Date: Mar 2020
Posts: 806

28 Mar 2024, 15:05

In the first three examples, you exclude category 1, and then the first listed category (2) is set to the "base" or reference category, meaning categories 1 and 2 are excluded from the model. Your first three models are equivalent to this:

Code:

gen cat3 = cat == 3
reg y cat3

Code:

. reg y cat3

      Source |       SS           df       MS      Number of obs   = 1,000,000
-------------+----------------------------------   F(1, 999998)    >  99999.00
       Model |  500055.387         1  500055.387   Prob > F        =    0.0000
    Residual |  176737.175   999,998  .176737529   R-squared       =    0.7389
-------------+----------------------------------   Adj R-squared   =    0.7389
       Total |  676792.563   999,999  .676793239   Root MSE        =     .4204

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        cat3 |   1.500083   .0008918  1682.07   0.000     1.498336    1.501831
       _cons |    .499973   .0005149   971.04   0.000     .4989639    .5009822
------------------------------------------------------------------------------

By specifying bn ("base not" or "no base") you instruct Stata to include the virtual dummy variable for categories 2 and 3 without specifying an excluded base category. So your last model is equivalent to this:

Code:

gen cat2 = cat == 2
gen cat3 = cat == 3
reg y cat2 cat3

Code:

. reg y cat2 cat3

      Source |       SS           df       MS      Number of obs   = 1,000,000
-------------+----------------------------------   F(2, 999997)    >  99999.00
       Model |  666799.899         2  333399.949   Prob > F        =    0.0000
    Residual |  9992.66366   999,997  .009992694   R-squared       =    0.9852
-------------+----------------------------------   Adj R-squared   =    0.9852
       Total |  676792.563   999,999  .676793239   Root MSE        =    .09996

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        cat2 |   1.000233   .0002449  4084.93   0.000     .9997533    1.000713
        cat3 |   2.000201   .0002449  8168.77   0.000     1.999721    2.000681
       _cons |  -.0001443   .0001731    -0.83   0.404    -.0004837     .000195
------------------------------------------------------------------------------

To get a model using nb equivalent to the first three, you can simply include the 3rd category in the model as a virtual dummy variable without any excluded base (the excluded categories are just the 0 values on the dummy).

Code:

reg y 3bn.cat

Code:

. reg y 3bn.cat

      Source |       SS           df       MS      Number of obs   = 1,000,000
-------------+----------------------------------   F(1, 999998)    >  99999.00
       Model |  500055.387         1  500055.387   Prob > F        =    0.0000
    Residual |  176737.175   999,998  .176737529   R-squared       =    0.7389
-------------+----------------------------------   Adj R-squared   =    0.7389
       Total |  676792.563   999,999  .676793239   Root MSE        =     .4204

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       3.cat |   1.500083   .0008918  1682.07   0.000     1.498336    1.501831
       _cons |    .499973   .0005149   971.04   0.000     .4989639    .5009822
------------------------------------------------------------------------------

Last edited by Daniel Schaefer; 28 Mar 2024, 15:07.

Announcement

Individually added factor variable levels

Comment