Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Important Bug Using Differencing and Interactions

    I'm using Stata 17 and, based on a post about testing for serial correlation in panel data after differencing, I think I've discovered an important bug in using the differencing operator. First, I know that factor notation is not allowed with differencing. Can someone from Stata explain why? There is no reason to exclude that, and I suspect this is partly the source of misunderstanding that some people have about whether it is okay to difference dummy variables in an equation. (Answer: Yes, because differencing with panel data is often done for estimating an equation that starts in levels.) It would be a big improvement in Stata 18 to simply differencing anything that appears in D.(), whether it is an interaction of continuous variables, discrete variables, or combinations. And something like i.year should be allowed, too.

    But not allowing factor notation is not the same as a bug. A real bug is that Stata drops interaction terms among continuous variables when using differencing if one of the variables doesn't change across time. Here's my Stata output, using airfare.dta that comes with my MIT Press book:

    Code:
    . sum ldist
    
        Variable |        Obs        Mean    Std. dev.       Min        Max
    -------------+---------------------------------------------------------
           ldist |      4,596    6.696482    .6593177   4.553877   7.909857
    
    . gen ldist_dm = ldist - r(mean)
    
    . xtreg lfare concen c.concen#c.ldist_dm y98 y99 y00, fe vce(cluster id)
    
    Fixed-effects (within) regression               Number of obs     =      4,596
    Group variable: id                              Number of groups  =      1,149
    
    R-squared:                                      Obs per group:
         Within  = 0.1429                                         min =          4
         Between = 0.3048                                         avg =        4.0
         Overall = 0.2411                                         max =          4
    
                                                    F(5,1148)         =     104.09
    corr(u_i, Xb) = -0.6841                         Prob > F          =     0.0000
    
                                            (Std. err. adjusted for 1,149 clusters in id)
    -------------------------------------------------------------------------------------
                        |               Robust
                  lfare | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    --------------------+----------------------------------------------------------------
                 concen |   .1661329   .0484029     3.43   0.001     .0711647     .261101
                        |
    c.concen#c.ldist_dm |  -.2498619   .0828545    -3.02   0.003    -.4124252   -.0872987
                        |
                    y98 |   .0230874   .0041459     5.57   0.000      .014953    .0312218
                    y99 |   .0355923   .0051452     6.92   0.000     .0254972    .0456874
                    y00 |   .0975745   .0054655    17.85   0.000     .0868511    .1082979
                  _cons |    4.93797   .0317998   155.28   0.000     4.875578    5.000362
    --------------------+----------------------------------------------------------------
                sigma_u |  .50598297
                sigma_e |  .10605257
                    rho |  .95791776   (fraction of variance due to u_i)
    -------------------------------------------------------------------------------------
    
    . reg D.(lfare concen c.concen#c.ldist_dm y98 y99 y00), nocons vce(cluster id)
    note: cD.concen#cD.ldist_dm omitted because of collinearity.
    
    Linear regression                               Number of obs     =      3,447
                                                    F(4, 1148)        =     118.18
                                                    Prob > F          =     0.0000
                                                    R-squared         =     0.0952
                                                    Root MSE          =     .12508
    
                                              (Std. err. adjusted for 1,149 clusters in id)
    ---------------------------------------------------------------------------------------
                          |               Robust
                  D.lfare | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    ----------------------+----------------------------------------------------------------
                   concen |
                      D1. |   .1759764   .0430367     4.09   0.000     .0915371    .2604158
                          |
    cD.concen#cD.ldist_dm |          0  (omitted)
                          |
                      y98 |
                      D1. |   .0227692   .0041573     5.48   0.000     .0146124     .030926
                          |
                      y99 |
                      D1. |   .0364365    .005153     7.07   0.000      .026326    .0465469
                          |
                      y00 |
                      D1. |   .0978497   .0055468    17.64   0.000     .0869666    .1087328
    ---------------------------------------------------------------------------------------
    Note that fixed effects has no trouble with c.concen#c.ldist_dm but differencing drops this term. The mistake stems from redefining the difference of the interaction to the interaction of the differencing. So what should appear is the interaction between D.concen and ldist_dm, but Stata changes it to cD.concen#cD.ldist_dm. Why is Stata doing this? The variable ldist_dm doesn't change across time but concen does and so I can easily include their interaction in the levels equation. I know how to fix this by using D.() differently, but it shouldn't need "fixing" because there's nothing wrong with the differencing command that I did use. Stata should not be changing my model. For the same reason, Stata should allow things like i.x1#c.x2 and simply difference this term without differencing each term and then forming the interaction. It shouldn't matter whether one of x1 and x2 changes across time.

  • #2
    Jeff and Sebastian have highlighted behavior of factor variable notation and difference operators that I would like to expand upon. Part of the discussion is here and part you can find also in https://www.statalist.org/forums/for...rrelation-test.

    Stata does not allow 'D.' (or 'S.') on factor variables:

    The time-series operators are commutative, meaning that the specified order of the operators does not matter. For example,

    FFD.cvar
    FDF.cvar
    DFF.cvar

    all mean the same thing, and Stata automatically reformats the operators into a standardized form.

    F2D.cvar

    When factor variables notation was introduced in Stata 11, this commutative property requirement was enforced when mixing with
    time-series. For example, the following yield the same expanded indicator variables

    iF.fvar
    Fi.fvar

    and similarly

    iL.fvar
    Li.fvar

    Applying the lag/lead operator on the indicators for the levels of fvar yields the same thing as generating indicator variables for the lag/lead
    of fvar.

    With the 'D.' and 'S.' operators, this property cannot hold for two reasons.

    1. There is the problem of possible negative values in

    D.fvar
    S.fvar

    Negative values are not allowed with 'i.'. Stata cannot support negative values in factor variables because the expanded list of
    indicators varibles for i.fvar

    0.fvar
    1.fvar
    2.fvar
    3.fvar

    are valid variables you can put in Stata expressions, such as

    gen mpg_minus_1rep78 = mpg - 1.rep78

    Suppose negative values were allowed. Then

    -1.rep78

    would be ambiguous, because it might mean (1) the negative of 1.rep78, or mean (2) the indicator for when rep78 takes on the value -1. Since
    negative values are not allowed, the meaning is unambiguously (1).

    2. Let's set aside negative values and assume

    D.fvar

    only take on positive values. Without parentheses, one might argue that

    iD.fvar
    Di.fvar

    should mean the same thing. But which meaning should it take. We have two choices:

    1) apply the difference operator on fvar then expand its levels
    2) expand fvar into it's individual indicator variables, then apply the difference operator on each indicator variable

    These do not yield the same thing. Thus 'i.' cannot be commutative with 'D.' (or 'S.' for similar reasons).

    Even with parenthesis, where we try to enforce an order of operations, these have the same meaning

    i.(L.fvar)
    L.(i.fvar)

    whereas the following could not have the same meaning, even if they were allowed

    i.(D.fvar)
    D.(i.fvar)
    ----

    Why does 'D.' distribute to each continuous variable in a parenthesis bound interaction?

    D.(x1 x2 c.x1#c.x2) ==> D.x1 D.x2 cD.x1#cD.x2

    Time series operators (and factor variable operators) operate on variables, not on terms. Parentheses are a notational convenience that
    provides a shortcut for distributing these variable operators across a variable list.

    We currently do not have a notation where 'D.' can be applied to an interaction term, as if it were a variable unto itself. We can see the
    need and convenience of such a thing; however, we have yet to develop a syntax/mechanism to support it.




    Comment


    • #3
      Originally posted by Enrique Pinzon (StataCorp) View Post
      We currently do not have a notation where 'D.' can be applied to an interaction term, as if it were a variable unto itself. We can see the
      need and convenience of such a thing; however, we have yet to develop a syntax/mechanism to support it.
      Thank you for these detailed explanations, Enrique. As mentioned elsewhere, I agree with Jeff that applying the differencing operator to factor variables is a desirable feature. Jeff's example of a panel data regression in first differences is a prime example. I also understand that this clashes with the syntactical logic you have set out above.

      My suggestion would be to introduce an additional syntax that allows for the use of operators in the way Jeff has proposed. How about using, say, curly brackets instead of parentheses:
      D.{x1 x2 c.x1#c.x2} ==> D.x1 D.x2 cD.x1#cD.x2

      This syntactical feature could be implemented more generally: The curly brackets in {fvarlist} could be a short-cut to prompt Stata to replace fvarlist by a list of temporary variables enclosed in parentheses, (tempvarlist), irrespectively of whether it is used in combination with any operators. This would achieve what Jeff has in mind. Given that this would be a new feature, there would be no need for commutativity. D.{i.fvar} would be well defined, while i.{D.fvar} could possibly result in an error message if D.fvar contains negative values; no problem at all.

      If for some reason curly brackets cannot be used, I am sure there are alternatives.
      Last edited by Sebastian Kripfganz; 06 Jan 2022, 05:28.
      https://twitter.com/Kripfganz

      Comment


      • #4
        Thanks Enrique for the explanation. I like Sebastian's suggestion as a new feature.

        It does seem to me that since D.i.fvar is always well defined and i.D.fvar is not, is seems natural that the default would be the former when one types D.i.fvar, or something like D.(y i.fvar).

        And if D. cannot be applied to an interaction as if the interaction is a variable unto itself -- which is the leading case when using D.() in a panel data context -- then that should probably generate an error message because the user is likely not getting what they want. Moreover, they might not notice they're getting something they don't want. I only noticed it in my example because ldist doesn't change over time and so the interaction of the difference dropped out. That made me look more closely at the output.

        Comment

        Working...
        X