Difference-in-Difference - collinearity

Justyna Hejman-Mancewicz

Join Date: Nov 2019
Posts: 4

Difference-in-Difference - collinearity

08 Nov 2019, 18:27

Dear Statalisters,

I am relatively new to Stata, so please bear with me. Also, I am aware that this issue has already been addressed on this forum, but I don't seem to be able to find the solution to my problem.

I am using logit in Stata 15.1 to understand whether migration changes employment outcomes. I am using an unbalanced dataset. For the purpose of this explanation, I will use the most basic specification (i.e. without any socio-economic control variables and without margins which I use at a later stage).

I am typing:

Code:

logit employed i.migrant##i.migration i.year, cluster(ident)

where:
- employed is a binary variable equal to 1 for years when respondents were economically active, 0 otherwise.
- migrant is a 'treatment': a time-invariant binary variable for control and treatment groups, which equals 1 for migrants (those who migrated); 0 for non-migrants (those who stayed behind).
- migration is 'time' or 'post': a binary variable equal to 1 for years after migration, 0 for years before migration. As such, migration == 0 for both groups in the years before migration, but migration == 1 only for 1 group who underwent the treatment, i.e. migrants.

The problem I encounter is as follows: the interaction term is omitted due to collinearity (while both migrant & migration are estimated without problems). More specifically, I obtain the following output:

Code:

 logit employed i.migrant##i.l_mig2 i.year, cluster(ident)

note: 1950.year != 0 predicts success perfectly
      1950.year dropped and 1 obs not used

note: 0.migrant#1.l_mig2 identifies no observations in the sample
note: 1.migrant#1.l_mig2 omitted because of collinearity
note: 2009.year omitted because of collinearity
Iteration 0:   log pseudolikelihood = -67798.349  
Iteration 1:   log pseudolikelihood = -67286.391  
Iteration 2:   log pseudolikelihood = -67285.374  
Iteration 3:   log pseudolikelihood = -67285.374  

Logistic regression                             Number of obs     =    104,797
                                                Wald chi2(60)     =     224.75
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -67285.374               Pseudo R2         =     0.0076

                                (Std. Err. adjusted for 4,502 clusters in ident)
--------------------------------------------------------------------------------
               |               Robust
      employed |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
     1.migrant |   -.259598   .0554714    -4.68   0.000      -.36832    -.150876
      1.l_mig2 |   .4030797   .0643309     6.27   0.000     .2769934     .529166
               |
migrant#l_mig2 |
          0 1  |          0  (empty)
          1 1  |          0  (omitted)
               |
          year |
         1950  |          0  (empty)
         1951  |  -.8324219   1.000956    -0.83   0.406     -2.79426    1.129416
         1952  |  -.9865726   .5585038    -1.77   0.077     -2.08122    .1080748
         1953  |  -.8025823   .3977641    -2.02   0.044    -1.582186   -.0229789

Please note that I tripple-checked the data to make sure they are coded in the correct way. An example of data for a migrant in my data would be:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 ident double year float(migrant migration)
"B0000001" 1991 1 0
"B0000001" 1992 1 0
"B0000001" 1993 1 0
"B0000001" 1994 1 0
"B0000001" 1995 1 0
"B0000001" 1996 1 0
"B0000001" 1997 1 0
"B0000001" 1998 1 0
"B0000001" 1999 1 0
"B0000001" 2000 1 0
"B0000001" 2001 1 0
"B0000001" 2002 1 0
"B0000001" 2003 1 1
"B0000001" 2004 1 1
"B0000001" 2005 1 1
"B0000001" 2006 1 1
"B0000001" 2007 1 1
"B0000001" 2008 1 1
"B0000001" 2009 1 1
end
format %ty year

A corresponding example for a non-migrant:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 ident double year float(migrant migration)
"C0001002" 1995 0 0
"C0001002" 1996 0 0
"C0001002" 1997 0 0
"C0001002" 1998 0 0
"C0001002" 1999 0 0
"C0001002" 2000 0 0
"C0001002" 2001 0 0
"C0001002" 2002 0 0
"C0001002" 2003 0 0
"C0001002" 2004 0 0
"C0001002" 2005 0 0
"C0001002" 2006 0 0
"C0001002" 2007 0 0
"C0001002" 2008 0 0
"C0001002" 2009 0 0
end
format %ty year

Is the problem driven by the fact that my time/post variable (here: migration) varies only for the control group (i.e. migrants)? Or is there any other issue I am not aware of? I will be most grateful for your help.

Best wishes,

Justyna

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29794
#2

08 Nov 2019, 20:55

Is the problem driven by the fact that my time/post variable (here: migration) varies only for the control group (i.e. migrants)?

In a word, yes.

To do a classical difference in differences model you need to have all four combinations of treated and untreated crossed with pre- and post-treatment. In the classical DID model, there is a "post-treatment" period for the "untreated" because the treatment begins at a specific time. So the DID model actually compares the treated after treatment with the untreated in the times when the treated are receiving the treatment.

Your data do not have that structure. I would imagine, also, that your migrants do not all immigrate in the same year, so it is not possible to impose that structure on the data. This data requires a generalized DID model instead.

https://www.annualreviews.org/doi/pd...-040617-013507 (Generalized DID reference)
will explain it for you.
2 likes
Comment
Justyna Hejman-Mancewicz

Join Date: Nov 2019

Posts: 4
#3

13 Nov 2019, 06:29

Thanks very much Clyde, that is helpful.
For anyone looking for more resources tackling the same problem (in addition to the useful paper shared by Clyde) - here's another one that I also found useful:
https://www.nber.org/WNE/lect_10_diffindiffs.pdf
Comment

Announcement

Difference-in-Difference - collinearity

Comment

Comment