Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Individually added factor variable levels

    Hi,

    Could someone please explain to me why the output of the first three regression commands below differs from the output of the last command? Based on the information provided in `help fvvarlist', I would expect all outputs to be the same as the one given by the last command.

    Thank you,
    Gabor

    Code:
    clear all
    
    set obs 1000000
    
    set seed 134
    
    gen cat=mod(_n, 3)+1
    
    gen y=rnormal(0,.1)
    replace y=y+1 if cat==2
    replace y=y+2 if cat==3
    
    * first set of regressions
    reg y 2.cat 3.cat
    reg y i(2 3).cat
    reg y i2.cat i3.cat
    
    * last regression
    reg y 2bn.cat 3bn.cat
    Output of the first three commands:
    Code:
          Source |       SS           df       MS      Number of obs   = 1,000,000
    -------------+----------------------------------   F(1, 999998)    >  99999.00
           Model |  500055.387         1  500055.387   Prob > F        =    0.0000
        Residual |  176737.175   999,998  .176737529   R-squared       =    0.7389
    -------------+----------------------------------   Adj R-squared   =    0.7389
           Total |  676792.563   999,999  .676793239   Root MSE        =     .4204
    
    ------------------------------------------------------------------------------
               y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
           3.cat |   1.500083   .0008918  1682.07   0.000     1.498336    1.501831
           _cons |    .499973   .0005149   971.04   0.000     .4989639    .5009822
    ------------------------------------------------------------------------------
    Output of the last command:

    Code:
    . reg y 2bn.cat 3bn.cat
    
          Source |       SS           df       MS      Number of obs   = 1,000,000
    -------------+----------------------------------   F(2, 999997)    >  99999.00
           Model |  666799.899         2  333399.949   Prob > F        =    0.0000
        Residual |  9992.66366   999,997  .009992694   R-squared       =    0.9852
    -------------+----------------------------------   Adj R-squared   =    0.9852
           Total |  676792.563   999,999  .676793239   Root MSE        =    .09996
    
    ------------------------------------------------------------------------------
               y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
             cat |
              2  |   1.000233   .0002449  4084.93   0.000     .9997533    1.000713
              3  |   2.000201   .0002449  8168.77   0.000     1.999721    2.000681
                 |
           _cons |  -.0001443   .0001731    -0.83   0.404    -.0004837     .000195
    ------------------------------------------------------------------------------
    Last edited by Gabor Mugge; 28 Mar 2024, 12:57.

  • #2
    In the first three examples, you exclude category 1, and then the first listed category (2) is set to the "base" or reference category, meaning categories 1 and 2 are excluded from the model. Your first three models are equivalent to this:

    Code:
    gen cat3 = cat == 3
    reg y cat3
    Code:
    . reg y cat3
    
          Source |       SS           df       MS      Number of obs   = 1,000,000
    -------------+----------------------------------   F(1, 999998)    >  99999.00
           Model |  500055.387         1  500055.387   Prob > F        =    0.0000
        Residual |  176737.175   999,998  .176737529   R-squared       =    0.7389
    -------------+----------------------------------   Adj R-squared   =    0.7389
           Total |  676792.563   999,999  .676793239   Root MSE        =     .4204
    
    ------------------------------------------------------------------------------
               y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
            cat3 |   1.500083   .0008918  1682.07   0.000     1.498336    1.501831
           _cons |    .499973   .0005149   971.04   0.000     .4989639    .5009822
    ------------------------------------------------------------------------------
    By specifying bn ("base not" or "no base") you instruct Stata to include the virtual dummy variable for categories 2 and 3 without specifying an excluded base category. So your last model is equivalent to this:

    Code:
    gen cat2 = cat == 2
    gen cat3 = cat == 3
    reg y cat2 cat3
    Code:
    . reg y cat2 cat3
    
          Source |       SS           df       MS      Number of obs   = 1,000,000
    -------------+----------------------------------   F(2, 999997)    >  99999.00
           Model |  666799.899         2  333399.949   Prob > F        =    0.0000
        Residual |  9992.66366   999,997  .009992694   R-squared       =    0.9852
    -------------+----------------------------------   Adj R-squared   =    0.9852
           Total |  676792.563   999,999  .676793239   Root MSE        =    .09996
    
    ------------------------------------------------------------------------------
               y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
            cat2 |   1.000233   .0002449  4084.93   0.000     .9997533    1.000713
            cat3 |   2.000201   .0002449  8168.77   0.000     1.999721    2.000681
           _cons |  -.0001443   .0001731    -0.83   0.404    -.0004837     .000195
    ------------------------------------------------------------------------------
    To get a model using nb equivalent to the first three, you can simply include the 3rd category in the model as a virtual dummy variable without any excluded base (the excluded categories are just the 0 values on the dummy).

    Code:
    reg y 3bn.cat
    Code:
    . reg y 3bn.cat
    
          Source |       SS           df       MS      Number of obs   = 1,000,000
    -------------+----------------------------------   F(1, 999998)    >  99999.00
           Model |  500055.387         1  500055.387   Prob > F        =    0.0000
        Residual |  176737.175   999,998  .176737529   R-squared       =    0.7389
    -------------+----------------------------------   Adj R-squared   =    0.7389
           Total |  676792.563   999,999  .676793239   Root MSE        =     .4204
    
    ------------------------------------------------------------------------------
               y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
           3.cat |   1.500083   .0008918  1682.07   0.000     1.498336    1.501831
           _cons |    .499973   .0005149   971.04   0.000     .4989639    .5009822
    ------------------------------------------------------------------------------
    Last edited by Daniel Schaefer; 28 Mar 2024, 15:07.

    Comment

    Working...
    X