Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with regression where interaction term is omitted by Stata

    Hello everyone,

    I am running a regression to analyze the impact of election years and landslide elections on disclosure size, which the log of website size. My model includes an interaction term between electionyear and landslide, but this interaction term is omitted due to collinearity. Here is a data sample:

    (Note on variables - ein is a unique identifier for each organization, logdisclsize is the log of website size, org is the type of organization, electionyear is a binary indicator for whether there was an election, landslide is a binary indicator for whether there was a landslide election, size and the winsorized variables are controls)

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(logdisclsize electionyear landslide size wins_ExecComp wins_leverage wins_ContribReliance) long ein float year str16 state str4 org
     10.84887 0 0  16.34981  .06247739  .01618405  .7696673 10248780 2011 "ME" "ENV"
    10.694057 0 0 16.353073  .06970697 .011605377  .6577324 10248780 2013 "ME" "ENV"
    10.795568 1 0 16.415674  .08384993  .06066541  .7097442 10248780 2014 "ME" "ENV"
    10.870756 0 0 16.435535  .07682364  .05358697 .59489995 10248780 2015 "ME" "ENV"
     11.00408 0 0 16.493631  .10753528  .03700265  .7087289 10248780 2016 "ME" "ENV"
    10.983002 1 0   16.5836  .07915805 .037205808 .51900214 10248780 2018 "ME" "ENV"
    11.102217 0 0 16.619827 .073426425 .031817265  .6083745 10248780 2019 "ME" "ENV"
     10.12475 0 0 15.689817  .09311032  .02294062  .8391001 10270690 2013 "ME" "ENV"
    11.652548 0 0  15.76943  .09187462 .025827337  .8749067 10270690 2015 "ME" "ENV"
      11.8213 0 0 15.754642  .09122185 .035800748  .8071181 10270690 2016 "ME" "ENV"
    12.106854 0 0  15.84974  .08144714  .03633184  .9579841 10270690 2017 "ME" "ENV"
    12.061775 1 0  16.01647  .07850363 .031095315  .9270397 10270690 2018 "ME" "ENV"
    12.056853 0 0 16.306654 .073099725  .02553698   .931366 10270690 2019 "ME" "ENV"
    10.518646 0 0  15.52521 .031852446  .08306593  .7954195 10317679 2012 "ME" "REPR"
     10.63246 0 0 15.641062  .03481711   .1558381  .7699555 10317679 2013 "ME" "REPR"
    10.839287 1 0 15.705325 .024914693    .180689  .6935283 10317679 2014 "ME" "REPR"
     10.85532 0 0 15.618464  .02366554   .1922071  .7194572 10317679 2015 "ME" "REPR"
    10.849357 0 0 15.548765  .02367953  .19987574  .7202281 10317679 2016 "ME" "REPR"
    10.888838 0 0 15.575302 .033151954  .18535903  .7151666 10317679 2017 "ME" "REPR"
    11.097547 1 0 15.578058 .033256307   .1112484  .7204551 10317679 2018 "ME" "REPR"
    end
    Here is one of the regressions I run, and the output:

    Code:
     reghdfe logdisclsize electionyear##landslide size  wins_ExecComp wins_leverage wins
    > _ContribReliance if org == "REPR", absorb (ein year) cluster(state)
    (dropped 18 singleton observations)
    (MWFE estimator converged in 6 iterations)
    note: 1.electionyear#1.landslide omitted because of collinearity
    
    HDFE Linear regression                            Number of obs   =      1,323
    Absorbing 2 HDFE groups                           F(   6,     46) =       0.99
    Statistics robust to heteroskedasticity           Prob > F        =     0.4457
                                                      R-squared       =     0.9897
                                                      Adj R-squared   =     0.9874
                                                      Within R-sq.    =     0.0048
    Number of clusters (state)   =         47         Root MSE        =     0.5070
    
                                           (Std. err. adjusted for 47 clusters in state)
    ------------------------------------------------------------------------------------
                       |               Robust
          logdisclsize | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------------+----------------------------------------------------------------
        1.electionyear |   .0775608   .0513554     1.51   0.138    -.0258123    .1809339
           1.landslide |   .0218505   .0929655     0.24   0.815    -.1652793    .2089803
                       |
          electionyear#|
             landslide |
                  0 1  |          0  (empty)
                  1 1  |          0  (omitted)
                       |
                  size |   .0347288   .0279295     1.24   0.220    -.0214904     .090948
         wins_ExecComp |  -.0763669   .2829463    -0.27   0.788    -.6459082    .4931745
         wins_leverage |  -.0495822   .1025383    -0.48   0.631     -.255981    .1568167
    wins_ContribReli~e |  -.0879667   .1880076    -0.47   0.642    -.4664063    .2904729
                 _cons |   7.392996   .3775152    19.58   0.000     6.633097    8.152895
    ------------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
             ein |       234           0         234     |
            year |         9           1           8     |
    -----------------------------------------------------+
    I understand the interaction is omitted due to collinearity - every landslide is also an electionyear, and only 5.94% of observations are landslides (and 7% for this subset of REPR organizations). However, all electionyears are not landslides. How can I interpret these main effects given that I ideally would have wanted the coefficient on the interaction term?

    Moreover, does anybody have advice on presenting this table in a paper or even testing it differently, as I am told that academically it is best practice to run the regression with an interaction rather than just the main effects, but I've never seen a table presented with the interaction omitted as it is here? Thank you so much!

  • #2
    You partially misunderstand what is going on in the regression. Colinearity is not the primary problem--it is a consequence of the fact that the combination of landslide = 1 and election = 0 can never occur. That's not a colinearity--it's an incomplete interaction. The interaction, which should normally have four values (election 0 landslide 0; election 1 landslide 0; election 1 landslide 1; election 0 landslide 1) has only the first three. So the 0.election#1.landslide term is not omitted due to colinearity: it is never found anywhere in the data set. Notice that Stata calls it "empty," not "omitted." But now, with only three pieces of the interaction available, the 1.election#1.landslide term is, indeed colinear with the remaining terms and the constant.

    So, your data simply do not support an interaction analysis. You need to do something slightly different. You need to create a three-level variable, call it year_type: 0 for no election, 1 for election but not a landslide, and 2 for an election with a landslide. Then you enter that three level variable into the model, as i.year_type, instead of the electionyear##landslide interaction.

    Comment


    • #3
      Hi Clyde,

      Thank you so much, that explanation is super helpful and clears up a lot! Just a quick follow-up to ensure I understand correctly: am I right that interpreting the coefficient on the year_type variable you described would be based on the assumption that there is a linear increase between the levels? In other words, the change in my dependent variable between no election and election(no landslide) would be the same as the change between election (no landslide) and landslide? Conceptually, that assumption wouldn't be appropriate for my setting, so in that case would the next best alternative simply be the original regression without the interaction which gives the same main effects, and I just have to compromise on breaking with the "usual" way of setting up such models? Thank you again!

      Comment


      • #4
        am I right that interpreting the coefficient on the year_type variable you described would be based on the assumption that there is a linear increase between the levels?
        No, that's not right.

        If you introduced the variable just as year_type, that would be true. But if you introduce it as i.year_type, Stata will treat it as a discrete variable. The output for it will consist of two rows: one for level 1 and the other for level 2. The output for 1.year_type will be the expected difference in outcome between a non-landslide election year and a non-election year. The output for 2.year_type will be the expected difference in outcome between a landslide election year and a non-election year. If you also need the expected difference between a landslide election year and a non-landslide election year, then you can get that after the regression with -lincom 2.year - 1.year-.

        I take it that you are not familiar with Stata's factor variable notation. Unless you are using an archaic version of Stata that predates its introduction, you really should get to know it. It saves you a lot of work (e.g. you never need to create "dummy" variables any more) and opens up simplified analysis of marginal effects and predictive margins through the -margins- command. See -help fvvarlist- to get started.

        Comment


        • #5
          Hi Clyde,

          I've spent some time reading up, and this really helps, thank you so much! Very much appreciated!

          Comment

          Working...
          X