Difference in specifying base level when including (#) vs. (##) interactions

Krista Lane

Join Date: Jun 2014

Posts: 81
#1

Difference in specifying base level when including (#) vs. (##) interactions

27 Jun 2022, 22:45

I'm estimating a model in which multiple variables are interacted many times. To more easily identify mistakes and avoid seeing repeated variables Stata is omitting, I would like to be able to write out "a b a#b" instead of "a##b". However, I am having trouble consistently being able to define base levels.

In the following example code (this isn't the actual regression I'm running), the second line of attempt 1 and attempt 2 omits race==3 instead of race==1. The only way I was able to get the ## and # regressions to match was to manually change my preferred base level to be larger than all other values of race (attempt 3).

I imagine I'm missing something obvious. Thank you!

clear
input float(y educ race age prg gdr)
0 1 1 2 1 0
0 1 2 3.2 1 0
0 3 2 7 1 0
0 2 2 6 0 0
1 2 2 1 0 1
1 2 2 45 0 1
1 1 2 2 0 0
1 1 2 1 0 0
0 3 2 3 1 0
0 3 2 2 1 1
0 1 1 1 0 1
0 2 1 43 0 1
1 2 1 2 0 0
1 2 1 1 1 0
1 3 1 3 0 0
1 1 1 2 0 1
0 1 3 1 0 1
0 2 3 43 0 1
1 2 3 2 1 0
1 2 3 1 0 0
1 3 3 3 1 0
1 1 3 2 0 1
1 1 1 2 0 1
0 1 3 1 0 1
end

*attempt 1
fvset base 1 race
reg y race##i1.prg
reg y i.race prg race#i1.prg

*attempt 2
reg y ib1.race##i1.prg
reg y ib1.race prg ib1.race#i1.prg

*attempt 3
replace race = 6 if race==1
fvset base 6 race
reg y race##i1.prg
reg y i.race prg race#i1.prg
Tags: None

Fei Wang

Join Date: Oct 2021
Posts: 726

28 Jun 2022, 00:27

Krista, I don't understand the reason you choose #, but personally I prefer ## to #, as shown below -- both the code and result are concise and well organized.

Code:

reg y b1.race##prg

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |
          2  |   .1333333   .2794395     0.48   0.639    -.4537472    .7204139
          3  |  -.2666667   .2794395    -0.95   0.353    -.8537472    .3204139
             |
       1.prg |  -.1666667   .3767961    -0.44   0.664    -.9582859    .6249526
             |
    race#prg |
        2 1  |  -.6333333   .4876563    -1.30   0.210    -1.657861    .3911945
        3 1  |   .7666667   .5394899     1.42   0.172    -.3667596    1.900093
             |
       _cons |   .6666667   .1883981     3.54   0.002      .270857    1.062476
------------------------------------------------------------------------------

If you'd like to separate the three terms, then it's important to tell Stata that the variable type of "race" or "prg" is consistent in any place. In the first line of command below, "prg" itself is treated as a continuous variable while is regarded as a factor variable in "race#prg". One solution (the second line of command) is to add "i." to "prg" itself to emphasize that "prg" is always a factor variable. Another solution (the third line of command) is to add "c." to "prg" in "race#prg" to highlight that "prg" is always a continuous variable. The second solution is only technically correct; conceptually, the first solution is preferred as "prg" is indeed a discrete factor variable.

Code:

reg y i.race prg b1.race#prg
------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |
          2  |   .1333333   .2794395     0.48   0.639    -.4537472    .7204139
          3  |  -.2666667   .2794395    -0.95   0.353    -.8537472    .3204139
             |
         prg |         .6   .3861011     1.55   0.138    -.2111684    1.411168
             |
    race#prg |
        1 1  |  -.7666667   .5394899    -1.42   0.172    -1.900093    .3667596
        2 1  |       -1.4   .4948812    -2.83   0.011    -2.439707   -.3602932
        3 1  |          0  (omitted)
             |
       _cons |   .6666667   .1883981     3.54   0.002      .270857    1.062476
------------------------------------------------------------------------------

reg y i.race i.prg b1.race#prg
------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |
          2  |   .1333333   .2794395     0.48   0.639    -.4537472    .7204139
          3  |  -.2666667   .2794395    -0.95   0.353    -.8537472    .3204139
             |
       1.prg |  -.1666667   .3767961    -0.44   0.664    -.9582859    .6249526
             |
    race#prg |
        2 1  |  -.6333333   .4876563    -1.30   0.210    -1.657861    .3911945
        3 1  |   .7666667   .5394899     1.42   0.172    -.3667596    1.900093
             |
       _cons |   .6666667   .1883981     3.54   0.002      .270857    1.062476
------------------------------------------------------------------------------

reg y i.race prg b1.race#c.prg
------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |
          2  |   .1333333   .2794395     0.48   0.639    -.4537472    .7204139
          3  |  -.2666667   .2794395    -0.95   0.353    -.8537472    .3204139
             |
         prg |  -.1666667   .3767961    -0.44   0.664    -.9582859    .6249526
             |
  race#c.prg |
          2  |  -.6333333   .4876563    -1.30   0.210    -1.657861    .3911945
          3  |   .7666667   .5394899     1.42   0.172    -.3667596    1.900093
             |
       _cons |   .6666667   .1883981     3.54   0.002      .270857    1.062476
------------------------------------------------------------------------------

You may set base group using -fvset base- at the beginning, but again the regression commands need to be the following to reflect the base group setting.

Code:

reg y i.race i.prg race#prg
reg y i.race prg race#c.prg

Comment

Krista Lane

Join Date: Jun 2014

Posts: 81
#3

28 Jun 2022, 09:31

Thank you! This is very helpful. I agree that ## is more well organized in this case. A better example of why I prefer # in my case is the comparison between

reg y race##c.age gdr##c.age
reg y i.race i.gdr age race#c.age gdr#c.age

in which the first regression omits the second occurrence of "age".
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#4

28 Jun 2022, 10:42

I see your point now. For this case, you may use the code below to avoid the omission issue.

Code:

reg y (race gdr)##c.age

You may refer to "help fvvarlist" for more flexible syntax.
Comment

Announcement