Important Bug Using Differencing and Interactions

Jeff Wooldridge

Join Date: Apr 2014
Posts: 2168

Important Bug Using Differencing and Interactions

05 Jan 2022, 09:07

I'm using Stata 17 and, based on a post about testing for serial correlation in panel data after differencing, I think I've discovered an important bug in using the differencing operator. First, I know that factor notation is not allowed with differencing. Can someone from Stata explain why? There is no reason to exclude that, and I suspect this is partly the source of misunderstanding that some people have about whether it is okay to difference dummy variables in an equation. (Answer: Yes, because differencing with panel data is often done for estimating an equation that starts in levels.) It would be a big improvement in Stata 18 to simply differencing anything that appears in D.(), whether it is an interaction of continuous variables, discrete variables, or combinations. And something like i.year should be allowed, too.

But not allowing factor notation is not the same as a bug. A real bug is that Stata drops interaction terms among continuous variables when using differencing if one of the variables doesn't change across time. Here's my Stata output, using airfare.dta that comes with my MIT Press book:

Code:

. sum ldist

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       ldist |      4,596    6.696482    .6593177   4.553877   7.909857

. gen ldist_dm = ldist - r(mean)

. xtreg lfare concen c.concen#c.ldist_dm y98 y99 y00, fe vce(cluster id)

Fixed-effects (within) regression               Number of obs     =      4,596
Group variable: id                              Number of groups  =      1,149

R-squared:                                      Obs per group:
     Within  = 0.1429                                         min =          4
     Between = 0.3048                                         avg =        4.0
     Overall = 0.2411                                         max =          4

                                                F(5,1148)         =     104.09
corr(u_i, Xb) = -0.6841                         Prob > F          =     0.0000

                                        (Std. err. adjusted for 1,149 clusters in id)
-------------------------------------------------------------------------------------
                    |               Robust
              lfare | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
--------------------+----------------------------------------------------------------
             concen |   .1661329   .0484029     3.43   0.001     .0711647     .261101
                    |
c.concen#c.ldist_dm |  -.2498619   .0828545    -3.02   0.003    -.4124252   -.0872987
                    |
                y98 |   .0230874   .0041459     5.57   0.000      .014953    .0312218
                y99 |   .0355923   .0051452     6.92   0.000     .0254972    .0456874
                y00 |   .0975745   .0054655    17.85   0.000     .0868511    .1082979
              _cons |    4.93797   .0317998   155.28   0.000     4.875578    5.000362
--------------------+----------------------------------------------------------------
            sigma_u |  .50598297
            sigma_e |  .10605257
                rho |  .95791776   (fraction of variance due to u_i)
-------------------------------------------------------------------------------------

. reg D.(lfare concen c.concen#c.ldist_dm y98 y99 y00), nocons vce(cluster id)
note: cD.concen#cD.ldist_dm omitted because of collinearity.

Linear regression                               Number of obs     =      3,447
                                                F(4, 1148)        =     118.18
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0952
                                                Root MSE          =     .12508

                                          (Std. err. adjusted for 1,149 clusters in id)
---------------------------------------------------------------------------------------
                      |               Robust
              D.lfare | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
----------------------+----------------------------------------------------------------
               concen |
                  D1. |   .1759764   .0430367     4.09   0.000     .0915371    .2604158
                      |
cD.concen#cD.ldist_dm |          0  (omitted)
                      |
                  y98 |
                  D1. |   .0227692   .0041573     5.48   0.000     .0146124     .030926
                      |
                  y99 |
                  D1. |   .0364365    .005153     7.07   0.000      .026326    .0465469
                      |
                  y00 |
                  D1. |   .0978497   .0055468    17.64   0.000     .0869666    .1087328
---------------------------------------------------------------------------------------

Note that fixed effects has no trouble with c.concen#c.ldist_dm but differencing drops this term. The mistake stems from redefining the difference of the interaction to the interaction of the differencing. So what should appear is the interaction between D.concen and ldist_dm, but Stata changes it to cD.concen#cD.ldist_dm. Why is Stata doing this? The variable ldist_dm doesn't change across time but concen does and so I can easily include their interaction in the levels equation. I know how to fix this by using D.() differently, but it shouldn't need "fixing" because there's nothing wrong with the differencing command that I did use. Stata should not be changing my model. For the same reason, Stata should allow things like i.x1#c.x2 and simply difference this term without differencing each term and then forming the interaction. It shouldn't matter whether one of x1 and x2 changes across time.

Tags: None

Enrique Pinzon (StataCorp)

StataCorp Employee

Join Date: Jan 2015

Posts: 217
#2

05 Jan 2022, 17:54

Jeff and Sebastian have highlighted behavior of factor variable notation and difference operators that I would like to expand upon. Part of the discussion is here and part you can find also in https://www.statalist.org/forums/for...rrelation-test.

Stata does not allow 'D.' (or 'S.') on factor variables:

The time-series operators are commutative, meaning that the specified order of the operators does not matter. For example,

FFD.cvar
FDF.cvar
DFF.cvar

all mean the same thing, and Stata automatically reformats the operators into a standardized form.

F2D.cvar

When factor variables notation was introduced in Stata 11, this commutative property requirement was enforced when mixing with
time-series. For example, the following yield the same expanded indicator variables

iF.fvar
Fi.fvar

and similarly

iL.fvar
Li.fvar

Applying the lag/lead operator on the indicators for the levels of fvar yields the same thing as generating indicator variables for the lag/lead
of fvar.

With the 'D.' and 'S.' operators, this property cannot hold for two reasons.

1. There is the problem of possible negative values in

D.fvar
S.fvar

Negative values are not allowed with 'i.'. Stata cannot support negative values in factor variables because the expanded list of
indicators varibles for i.fvar

0.fvar
1.fvar
2.fvar
3.fvar

are valid variables you can put in Stata expressions, such as

gen mpg_minus_1rep78 = mpg - 1.rep78

Suppose negative values were allowed. Then

-1.rep78

would be ambiguous, because it might mean (1) the negative of 1.rep78, or mean (2) the indicator for when rep78 takes on the value -1. Since
negative values are not allowed, the meaning is unambiguously (1).

2. Let's set aside negative values and assume

D.fvar

only take on positive values. Without parentheses, one might argue that

iD.fvar
Di.fvar

should mean the same thing. But which meaning should it take. We have two choices:

1) apply the difference operator on fvar then expand its levels
2) expand fvar into it's individual indicator variables, then apply the difference operator on each indicator variable

These do not yield the same thing. Thus 'i.' cannot be commutative with 'D.' (or 'S.' for similar reasons).

Even with parenthesis, where we try to enforce an order of operations, these have the same meaning

i.(L.fvar)
L.(i.fvar)

whereas the following could not have the same meaning, even if they were allowed

i.(D.fvar)
D.(i.fvar)
----

Why does 'D.' distribute to each continuous variable in a parenthesis bound interaction?

D.(x1 x2 c.x1#c.x2) ==> D.x1 D.x2 cD.x1#cD.x2

Time series operators (and factor variable operators) operate on variables, not on terms. Parentheses are a notational convenience that
provides a shortcut for distributing these variable operators across a variable list.

We currently do not have a notation where 'D.' can be applied to an interaction term, as if it were a variable unto itself. We can see the
need and convenience of such a thing; however, we have yet to develop a syntax/mechanism to support it.
6 likes
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2594
#3

06 Jan 2022, 03:34

Originally posted by Enrique Pinzon (StataCorp) View Post

We currently do not have a notation where 'D.' can be applied to an interaction term, as if it were a variable unto itself. We can see the
need and convenience of such a thing; however, we have yet to develop a syntax/mechanism to support it.

Thank you for these detailed explanations, Enrique. As mentioned elsewhere, I agree with Jeff that applying the differencing operator to factor variables is a desirable feature. Jeff's example of a panel data regression in first differences is a prime example. I also understand that this clashes with the syntactical logic you have set out above.

My suggestion would be to introduce an additional syntax that allows for the use of operators in the way Jeff has proposed. How about using, say, curly brackets instead of parentheses:
D.{x1 x2 c.x1#c.x2} ==> D.x1 D.x2 cD.x1#cD.x2

This syntactical feature could be implemented more generally: The curly brackets in {fvarlist} could be a short-cut to prompt Stata to replace fvarlist by a list of temporary variables enclosed in parentheses, (tempvarlist), irrespectively of whether it is used in combination with any operators. This would achieve what Jeff has in mind. Given that this would be a new feature, there would be no need for commutativity. D.{i.fvar} would be well defined, while i.{D.fvar} could possibly result in an error message if D.fvar contains negative values; no problem at all.

If for some reason curly brackets cannot be used, I am sure there are alternatives.

Last edited by Sebastian Kripfganz; 06 Jan 2022, 04:28.

https://www.kripfganz.de/stata/
5 likes
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#4

06 Jan 2022, 12:44

Thanks Enrique for the explanation. I like Sebastian's suggestion as a new feature.

It does seem to me that since D.i.fvar is always well defined and i.D.fvar is not, is seems natural that the default would be the former when one types D.i.fvar, or something like D.(y i.fvar).

And if D. cannot be applied to an interaction as if the interaction is a variable unto itself -- which is the leading case when using D.() in a panel data context -- then that should probably generate an error message because the user is likely not getting what they want. Moreover, they might not notice they're getting something they don't want. I only noticed it in my example because ldist doesn't change over time and so the interaction of the difference dropped out. That made me look more closely at the output.
Comment

Announcement

Important Bug Using Differencing and Interactions

Comment

Comment

Comment