Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difference-in-Difference - collinearity

    Dear Statalisters,


    I am relatively new to Stata, so please bear with me. Also, I am aware that this issue has already been addressed on this forum, but I don't seem to be able to find the solution to my problem.

    I am using logit in Stata 15.1 to understand whether migration changes employment outcomes. I am using an unbalanced dataset. For the purpose of this explanation, I will use the most basic specification (i.e. without any socio-economic control variables and without margins which I use at a later stage).

    I am typing:

    Code:
    logit employed i.migrant##i.migration i.year, cluster(ident)
    where:
    - employed is a binary variable equal to 1 for years when respondents were economically active, 0 otherwise.
    - migrant is a 'treatment': a time-invariant binary variable for control and treatment groups, which equals 1 for migrants (those who migrated); 0 for non-migrants (those who stayed behind).
    - migration is 'time' or 'post': a binary variable equal to 1 for years after migration, 0 for years before migration. As such, migration == 0 for both groups in the years before migration, but migration == 1 only for 1 group who underwent the treatment, i.e. migrants.

    The problem I encounter is as follows: the interaction term is omitted due to collinearity (while both migrant & migration are estimated without problems). More specifically, I obtain the following output:

    Code:
     logit employed i.migrant##i.l_mig2 i.year, cluster(ident)
    
    note: 1950.year != 0 predicts success perfectly
          1950.year dropped and 1 obs not used
    
    note: 0.migrant#1.l_mig2 identifies no observations in the sample
    note: 1.migrant#1.l_mig2 omitted because of collinearity
    note: 2009.year omitted because of collinearity
    Iteration 0:   log pseudolikelihood = -67798.349  
    Iteration 1:   log pseudolikelihood = -67286.391  
    Iteration 2:   log pseudolikelihood = -67285.374  
    Iteration 3:   log pseudolikelihood = -67285.374  
    
    Logistic regression                             Number of obs     =    104,797
                                                    Wald chi2(60)     =     224.75
                                                    Prob > chi2       =     0.0000
    Log pseudolikelihood = -67285.374               Pseudo R2         =     0.0076
    
                                    (Std. Err. adjusted for 4,502 clusters in ident)
    --------------------------------------------------------------------------------
                   |               Robust
          employed |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    ---------------+----------------------------------------------------------------
         1.migrant |   -.259598   .0554714    -4.68   0.000      -.36832    -.150876
          1.l_mig2 |   .4030797   .0643309     6.27   0.000     .2769934     .529166
                   |
    migrant#l_mig2 |
              0 1  |          0  (empty)
              1 1  |          0  (omitted)
                   |
              year |
             1950  |          0  (empty)
             1951  |  -.8324219   1.000956    -0.83   0.406     -2.79426    1.129416
             1952  |  -.9865726   .5585038    -1.77   0.077     -2.08122    .1080748
             1953  |  -.8025823   .3977641    -2.02   0.044    -1.582186   -.0229789
    Please note that I tripple-checked the data to make sure they are coded in the correct way. An example of data for a migrant in my data would be:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str8 ident double year float(migrant migration)
    "B0000001" 1991 1 0
    "B0000001" 1992 1 0
    "B0000001" 1993 1 0
    "B0000001" 1994 1 0
    "B0000001" 1995 1 0
    "B0000001" 1996 1 0
    "B0000001" 1997 1 0
    "B0000001" 1998 1 0
    "B0000001" 1999 1 0
    "B0000001" 2000 1 0
    "B0000001" 2001 1 0
    "B0000001" 2002 1 0
    "B0000001" 2003 1 1
    "B0000001" 2004 1 1
    "B0000001" 2005 1 1
    "B0000001" 2006 1 1
    "B0000001" 2007 1 1
    "B0000001" 2008 1 1
    "B0000001" 2009 1 1
    end
    format %ty year
    A corresponding example for a non-migrant:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str8 ident double year float(migrant migration)
    "C0001002" 1995 0 0
    "C0001002" 1996 0 0
    "C0001002" 1997 0 0
    "C0001002" 1998 0 0
    "C0001002" 1999 0 0
    "C0001002" 2000 0 0
    "C0001002" 2001 0 0
    "C0001002" 2002 0 0
    "C0001002" 2003 0 0
    "C0001002" 2004 0 0
    "C0001002" 2005 0 0
    "C0001002" 2006 0 0
    "C0001002" 2007 0 0
    "C0001002" 2008 0 0
    "C0001002" 2009 0 0
    end
    format %ty year
    Is the problem driven by the fact that my time/post variable (here: migration) varies only for the control group (i.e. migrants)? Or is there any other issue I am not aware of? I will be most grateful for your help.

    Best wishes,

    Justyna

  • #2
    Is the problem driven by the fact that my time/post variable (here: migration) varies only for the control group (i.e. migrants)?

    In a word, yes.

    To do a classical difference in differences model you need to have all four combinations of treated and untreated crossed with pre- and post-treatment. In the classical DID model, there is a "post-treatment" period for the "untreated" because the treatment begins at a specific time. So the DID model actually compares the treated after treatment with the untreated in the times when the treated are receiving the treatment.

    Your data do not have that structure. I would imagine, also, that your migrants do not all immigrate in the same year, so it is not possible to impose that structure on the data. This data requires a generalized DID model instead.

    https://www.annualreviews.org/doi/pd...-040617-013507 (Generalized DID reference)
    will explain it for you.

    Comment


    • #3
      Thanks very much Clyde, that is helpful.
      For anyone looking for more resources tackling the same problem (in addition to the useful paper shared by Clyde) - here's another one that I also found useful:
      https://www.nber.org/WNE/lect_10_diffindiffs.pdf

      Comment

      Working...
      X