DID - a year with low number of observations

Marry Lee

Join Date: Nov 2020

Posts: 186
#1

DID - a year with low number of observations

13 Feb 2022, 05:34

Hello everyone,

I learned that not having the same number of observations throughout the years used for a DID estimation is not a problem.
But, I am still doubtful about something.

I have a public policy that is implemented at the county level, and I am looking at the effect of this policy at the individual level using individual data. So data is not a real panel, it is the regroupment of different individuals born in different years.
The data I use is 3 years before the policy and 2 years after. The policy is in January 1998, I take: 1995, 96 and 97 for the pre-period and 98 and 99 for the post period.

The problem is that in 1999, I have a very low number of observations: for example, in 1998, I have 110 observation, while in 1999, I have 19 observations. This is due to the fact that those born in 1999 are not old enough during the survey to report their outcomes (so the outcome is a test score, do these individuals are taking the test during the year of the survey, so not everyone already have access to this information yet).
I still want to include those born in 1999, to increase the sample size, but does this pose a problem for my DID?

and if I want to look at the effect by year of birth, using the interaction of the treatment variable and indicators for year of birth, is the coefficient for the year 1999 reliable or not?

I would be really grateful for your answers!
Let me know if you need more information about my data.

Last edited by Marry Lee; 13 Feb 2022, 05:38.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29795
#2

13 Feb 2022, 11:59

Well, I think there are two distinct issues here. The simplest one is that the 1999 cohort is represented by a smaller number of observations. Consequently any analyses that focus specifically on that cohort may end up with imprecise estimates (large standard errors, wide confidence intervals). Whether the loss of precision is so large as to render the results useless or not is something you will just have to see when you get them. It depends on a lot of things.

The other issue is whether the 1999 sample is sufficiently representative of the 1999 cohort. You indicated that the lower size of that cohort in your data is due to age-related inability to participate in the measurement. The age-relatedness may be a problem if age itself is a determinant of the outcomes you are studying. You may have a group that over-represents the older members of the 1999 birth cohort and underrepresents the younger ones. If age at ranges of weeks or months can affect the study outcome, then the 1999 data will be biased. You might be able to estimate the amount of bias and partially adjust for it by including an additional variable representing the month or week-within-year of birth, if that information is available.
1 like
Comment
Marry Lee

Join Date: Nov 2020

Posts: 186
#3

13 Feb 2022, 12:28

Thank you so much Clyde Schechter !

In fact, what is weird is that I get a small confidence interval or an interval that does not contain 0, for the 1999 cohort.
Here is the code I run:

Code:

probit High_Q_S i.TCZ#ib1997.year_birth (other X vars) `provinceXyearFE' i.coun, cluster(coun) margins year_birth, dydx(TCZ) noestimcheck post marginsplot, yline(0) level(90)

and here are the results for 2 outcomes:
Graph outcome 1.gph

Graph outcome 2.gph

Attached Files

Graph outcome 1.gph (8.0 KB, 2 views)

Graph outcome 2.gph (8.0 KB, 2 views)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29795
#4

13 Feb 2022, 13:44

Well, I don't know how to interpret these. First, other than seeing that High_Q_S is your outcome variable, and year_birth, I assume, has the obvious meaning, I don't know what anything else in your code refers to. I'm guessing that `provinceXyearFE' is a list of homebrew indicator variables that indicate combinations of province and year. If year, in that context, is year of birth, then already this is wrong: you have to use factor variable notation so that -margins- can handle them correctly. Also that inclusion of `provinceXyearFE' suggests that you are trying to emulate a fixed-effects probit model by including indicators ("dummies") for your desired fixed effects--but that's not correct either. That only works correctly for linear models, or in samples with very large samples. Unfortunately, the fixed-effects probit model is unwieldy to work with, and there is no Stata command for fixed effects probit models, only random-effects or population-averaged. So if you are wedded to fixed effects, you are probably better off working with -xtlogit, fe-. If you are not wedded to fixed-effects but are wedded to probit, you might use the random effects probit model.

As for the graphs, they are two different graphs, but they have the same labeling and titling, so although they obviously represent different analyses, I can't tell what the difference between them represents.

Anyway, I would need a lot more information about this in order to give specific advice about what is going on.
Comment
Marry Lee

Join Date: Nov 2020

Posts: 186
#5

13 Feb 2022, 16:34

Thank you so much Clyde Schechter for the detailed answer!

Well, I don't know how to interpret these. First, other than seeing that High_Q_S is your outcome variable, and year_birth, I assume, has the obvious meaning, I don't know what anything else in your code refers to. I'm guessing that `provinceXyearFE' is a list of homebrew indicator variables that indicate combinations of province and year. If year, in that context, is year of birth, then already this is wrong: you have to use factor variable notation so that -margins- can handle them correctly.

You are right, I should have defined everything:
provinceXyearFE is in fact: local provinceXyearFE="i.provinceXyear", so it is a list of indicators for province X year of birth

i.coun: is a list of indicators for counties (the level lower than provinces)

TCZ: indicator variable, TCZ=1 if the county is affected by the policy and TCZ=0 otherwise.

Also that inclusion of `provinceXyearFE' suggests that you are trying to emulate a fixed-effects probit model by including indicators ("dummies") for your desired fixed effects--but that's not correct either. That only works correctly for linear models, or in samples with very large samples. Unfortunately, the fixed-effects probit model is unwieldy to work with, and there is no Stata command for fixed effects probit models, only random-effects or population-averaged. So if you are wedded to fixed effects, you are probably better off working with -xtlogit, fe-. If you are not wedded to fixed-effects but are wedded to probit, you might use the random effects probit model.

So, you are saying that probit with dummies is not totaly wrong (we call this unconditional fixed effects probit estimator), but if the sample size is very small, then a problem of incidental parameter bias, is this right?

Choosing either -xtlogit, fe- or a random effects probit model should be based in some arguments, not what I want to do right?

As for the graphs, they are two different graphs, but they have the same labeling and titling, so although they obviously represent different analyses, I can't tell what the difference between them represents.

The only difference between the 2 graphs is that the dependent variable in each graph is different, they represent the coefficients for the interaction terms TCZXbirth_year for each outcome variable.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29795
#6

13 Feb 2022, 20:30

Thanks for the explanations. Makes a lot more sense to me now.

So, you are saying that probit with dummies is not totaly wrong (we call this unconditional fixed effects probit estimator), but if the sample size is very small, then a problem of incidental parameter bias, is this right?

Yes.

Choosing either -xtlogit, fe- or a random effects probit model should be based in some arguments, not what I want to do right?

Yes, but!

Apart from scale, there is very little difference between the logistic and normal (probit) distributions. It takes a truly gargantuan sample to distinguish the subtle differences in their tails. So, if you have a strong theoretical reason to prefer the probit, OK, go for it. But I'm skeptical that you really do, because, in fact, in the real world almost nothing is going to be truly probit, nor truly logit, and any sample that is large enough to really distinguish them will probably also be large enough to clearly reject both. -logit- and -probit- are just convenience likelihoods for working with binary outcome data. To my mind, the basis for choosing between them will almost always come down to which is more convenient. (And on that score, the ability to explain logistic regression coefficients as log odds ratios, and the inability to explain probit coefficients as anything at all comprehensible to most people, means logit usually wins.)

Now that you have explained that the graphs represent results for two different outcome variables, it seems that in one of them you get, as I predicted, a very wide standard error for year 1999, but for the other it's very, very narrow. My conclusion is that for the latter, there probably is very little variance in that outcome among the 1999 observations--a possibility that perhaps suggests the kind of bias I was worried about in #2. That is, bias affecting that outcome, but not the first, may have overwhelmed the small sample size effect on precision. (Another bias consideration is whether there is a lot of missing data on the outcome variable or other model variables in the 1999 cohort, causing a bias in the 1999 estimation sample, even if it isn't present in the entire sample of the 199 cohort.)
1 like
Comment

Marry Lee

Join Date: Nov 2020
Posts: 186

14 Feb 2022, 02:54

Thank you so much Clyde Schechter for your helpful answer!

I have two more follow-up questions,

My problem with logit, fe, is that it does not allow for clustering (while I want to cluster at the county level, the policy is at the county level, but the outcomes are at the individual level).
So, if no problem of representativeness, I can just leave the 1999 cohort for the whole DID model (where I define post= 1 if year_birth==1998 and year_birth==1999), right?

However, for the results by year of birth, I should clearly explain that the results for the 1999 are not reliable (because standard errors may be wrong), is this right?

Considering the estimation sample only, I have the following distribution of observations by the outcome variable (binary) and the policy status in the county where the child is born (TCZ =1 if the policy is implemented and 0 otherwise):

Code:

. tab High_Q_S  year_birth if sample1==1

           |                       year_birth
  High_Q_S |      1995       1996       1997       1998       1999 |     Total
-----------+-------------------------------------------------------+----------
         0 |       178        154        109         63         10 |       514
         1 |       142        124        103         47          9 |       425
-----------+-------------------------------------------------------+----------
     Total |       320        278        212        110         19 |       939

The 1999 cohort has the following distribution:

Code:

 tab High_Q_S TCZ if year_birth==1999 & sample1==1

           |          TCZ
  High_Q_S |         0          1 |     Total
-----------+----------------------+----------
         0 |         4          6 |        10
         1 |         4          5 |         9
-----------+----------------------+----------
     Total |         8         11 |        19

Considering the second outcome variable (which gives very small confidence interval):

Code:

. tab High_Q_S2  year_birth if sample2==1

           |                       year_birth
 High_Q_S2 |      1995       1996       1997       1998       1999 |     Total
-----------+-------------------------------------------------------+----------
         0 |        73         49         40         22          3 |       187
         1 |       185        169        109         52          7 |       522
-----------+-------------------------------------------------------+----------
     Total |       258        218        149         74         10 |       709

The 1999 cohort, for this second outcome, has the following distribution:

Code:

 tab High_Q_S2 TCZ if year_birth==1999 & sample2==1

           |          TCZ
 High_Q_S2 |         0          1 |     Total
-----------+----------------------+----------
         0 |         2          1 |         3
         1 |         2          5 |         7
-----------+----------------------+----------
     Total |         4          6 |        10

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29795
#8

14 Feb 2022, 10:36

So, if no problem of representativeness, I can just leave the 1999 cohort for the whole DID model (where I define post= 1 if year_birth==1998 and year_birth==1999), right?

Right.

However, for the results by year of birth, I should clearly explain that the results for the 1999 are not reliable (because standard errors may be wrong), is this right?

I would explain that the results for 1999 are not reliable. I wouldn't say it's that the standard errors may be wrong, but rather just emphasize the paucity of data for that year. In a sample that small, everything is subject to a lot of sampling variability. So I'd point that out. I don't think I would use the word "wrong." After all, strictly speaking, all the results of any regression are wrong. They're just estimates and there is almost always some degree of error in them. And the estimates in this small sample are, in that regard, no worse than those from large samples--it's just that the magnitude of variability is much larger.
1 like
Comment
Marry Lee

Join Date: Nov 2020

Posts: 186
#9

14 Feb 2022, 11:15

Thank you Clyde Schechter !
Comment

Announcement