Difference in Differences analysis unbalanced sample

Linda Debert

Join Date: Jan 2021

Posts: 10
#1

Difference in Differences analysis unbalanced sample

22 Jan 2021, 17:34

Hello :-)

I have two large panel data sets of companies that are listed in Europe and the US in the years from 2015-2019, the post phase starts in 2018. I start with the European data, which should be my treated sample. Unfortunately, some companies are not observable for each time points, thus my sample is unbalanced. So, here are my questions and I really hope, that you can help me to get familiar with this analysis:
Am I right and each company in the control and treatment group should be at least two times in the respective groups: Once in the pre- and once in the post analysis period? Thus, I should drop observations from companies that are either only available in the pre-time or in the post-time, right?

Is it a problem that my sample is unbalanced and that for some companies I have time series observations for all of the five years, whereby some companies are only available for two years. This “phenomenon” seems to be random. Can I leave the data like they are for analysis? Is there something I must change when doing a diff in diff with unbalanced data?

I have to analyse the effect of a firm disclosure rule, which is effective for certain companies for reports issued from 2018 onwards. However, some European companies will not be affected. So, is it right to include only those companies in the European treatment group, which are mandated by law to provide these reports in 2018 and/or 2019? Or should I include all European companies in my treatment group, no matter whether they are affected or not?

I would be happy if you would help me.

Kind regards,
Linda
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

22 Jan 2021, 18:13

1. Well, it's not strictly necessary to remove them, though removing them is often done. Here are the implications. If, as is usually the case, you use a fixed-effects regression model to do the DID estimation, a company that appears only in the pre- or only in the post- period will not contribute to the estimation of the treatment effect, as the group#period interaction term will be a constant for that company. It will, however, contribute to the calculation of standard errors. And if you use a random-effects model, then it even contributes to the treatment effect estimation. So you can leave these observations in if you like. Their contribution will be small, at most.

2. Unbalanced data, per se, is not a problem and requires no special handling.

All of that said, the real concern is why some companies are represented every time and others are not. If the circumstances that cause the company not to be observable at a certain time are also related to the outcome variable, then the data may be seriously biased, and there is no good remedy for that problem. For example, if your outcome is some measure of profitability and if companies are more likely to choose not to participate in data collection during periods when they are less profitable, then the entire data set has an optimistic bias. So it is important for you to gain an understanding of the reasons for the unobservability and think carefully about whether this is a source of bias, or is just an event that is independent of what you are trying to observe.

3. I don't quite understand this one. You say first that the European data "should be [your] treated sample." Then you say that some European countries are not actually subject to the treatment. Can you clarify what's going on here? Notwithstanding the answers to that, it would be very hard to justify including in the treated group a company that was not in fact treated. That company should be either in the control group, or perhaps be considered ineligible for inclusion in your study. But I don't have a sense of what the circumstances are. It also may well be that if I did have a sense of it, I would not be able to give you a clear answer as this really depends more on an understanding of the underlying science, which is beyond my domain of familiarity, rather than being a statistical matter.
Comment
Linda Debert

Join Date: Jan 2021

Posts: 10
#3

23 Jan 2021, 05:01

Dear Clyde,

thank you so so much for your help!!

So, you would keep the observations which are not present in bothe pre- and post?

Regarding point 3. The Directive is effective only for certain firms, in case they exceed a certain size treshold. Thus, although the Directive concerns all European companies, it is "really" effective only for the largest of them. I sorted out which European companies do really have to prepare a report under law. Should I drop the other European companies which are nit directly affected, although the Directive is - in principle - valid for all European companies but if and only if a certain threshold is exceeded. I woder if the treatment should be the Directive per se and thus all European comoanies or only those of them who are really affected by the law and must prepare a report?

I have another question regarding stata code. I try to do the follwoing commands:
sort isin year
by isin: keep if (year == 2015 | year == 2016 | year == 2017) & (year == 2018 | year == 2019)

It drops all of my observations. What went wrong here? :-(

Kind regards,
Linda
Comment
Linda Debert

Join Date: Jan 2021

Posts: 10
#4

23 Jan 2021, 05:32

And just to make sure :-): it is not neccessary to "balance" my unbalanced sample, for example, by dropping those companies/observations which are not present in all of the five years or by using imputation or another method to balance it? For the diff-in-diff I can leave the sample it like it is? This will not matter?

Kind regards,
Linda
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

23 Jan 2021, 13:22

Working in reverse of the order asked:

it is not neccessary to "balance" my unbalanced sample, for example, by dropping those companies/observations which are not present in all of the five years or by using imputation or another method to balance it?

No, it is not necessary to balance the sample. There is no advantage at all to having a balanced sample here.

by isin: keep if (year == 2015 | year == 2016 | year == 2017) & (year == 2018 | year == 2019)
It drops all of my observations. What went wrong here? :-(

Think about what you have coded here. If year is 2015, 2016, or 2017, then it cannot be 2018 or 2019. And vice versa. So no value of year ever satisfies both of those conditions. I think what you want to do is:

Code:

by isin: keep if inrange(year, 2015, 2019)

Should I drop the other European companies which are nit directly affected, although the Directive is - in principle - valid for all European companies but if and only if a certain threshold is exceeded. I woder if the treatment should be the Directive per se and thus all European comoanies or only those of them who are really affected by the law and must prepare a report?

Only you can answer that question. What is your research goal? Are you trying to evaluate the effect of adopting that policy, or are you trying to evaluate the effect of having to file a report. Because of the way the policy was designed, those are different conditions. Which is the target of your research question? Those two effects will probably be different, at least quantitatively. If nothing else, since the policy is broader than the actual reporting, the effect of the policy will be "diluted out" somewhat by the firms that are not required to report. On the other hand, you might also consider whether having some firms required to file reports results in spillover effects on other firms in the same jurisdictions. These are all substantive considerations, not statistical ones, and only you or somebody very familiar with your research goals can answer them.

Just one point to be clear, though: whatever decision you make about including or excluding the companies small enough to be exempt from reporting, make sure you do the exact same thing in the control jurisdictions!

So, you would keep the observations which are not present in bothe pre- and post?

I would keep them, yes. I would not strongly criticize somebody for making the opposite decision, but I, myself, would keep them. I am averse to throwing away data unless I have reason to believe it is invalid or outside the universe of my study.
Comment
Linda Debert

Join Date: Jan 2021

Posts: 10
#6

24 Jan 2021, 11:06

Dear Clyde,

thank you very much. If my preparation of the European data is finished, I intend to match it with my controls by using propensity score matching. If I choose the "right" size variables for matching, is this proceeding what you mean when saying "make sure you do the exact same thing in the control jurisdiction"? Or do I have to prepare the control group similarly to the European sample before matching?

Kind regards,
Linda
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

24 Jan 2021, 12:30

Well, it appears that size has a very large impact in your data set because small firms were not subject to the reporting requirement. So I would not rely on propensity score matching to deal with this, because it would allow some small firms to be matched with large firms and vice versa, if they agreed closely enough on other variables in the propensity formula.

What I originally meant, and still assert, is that if you include small European firms in the treated group, then you must also include small non-European firms in the control group. If you exclude them from the European sample, then you should similarly exclude them from the control sample. I would now add one more precaution: if you decide to include the small firms in the analysis, I recommend you separately do propensity score matching among the large and among the small to assure that no small firms are matched with large firms or vice versa.
Comment
Linda Debert

Join Date: Jan 2021

Posts: 10
#8

25 Jan 2021, 14:47

Dear Clyde,

again, thank you so so much for your comments and the way you make me think about my questions. This is really helpful to me.

kind regards,
Linda
Comment
Linda Debert

Join Date: Jan 2021

Posts: 10
#9

26 Jan 2021, 12:13

Dear Clyde,

I figured out to test whether missing values are mcar by using mcartest. Unfortunatelly, this test shows that missing values are not mcar, both in the EU sample (only by including a categorial variable) and the US. I have not expected this result. So, under this circumstance, can I also use a diff- in diff analysis? Oder should I treat missing values before now?

Second, as I said before, my European sample has a 1 for the treatment indicator both pre and post. I already told you, that only some of these firms really get the treatment while others are too small. Is it nevertheless okay to give the European pre-observations a treatment indicator 1 or will this proceeding bias my results?

best,
Linda
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

26 Jan 2021, 12:58

There are no good solutions to the problem of missing data. One tries to find the least bad solution for the situation at hand.

You actually shouldn't be surprised that your data are not MCAR. MCAR is pretty uncommon in the real world, and usually results from some kind easily recognizable "act of God" interfering with the collection of data. For example, medical laboratory results might be MCAR if the missingness was due to an accident in the laboratory in which one tray of specimens was dropped and the specimen containers broke. The other common kind of MCAR arises when there is an ongoing stream of data over time, but the data set just terminates data collection at some earlier date that is chosen in a totally exogenous way. Other than that, it is pretty unusual for a real world data set to have data MCAR.

It is, in part, because MCAR is so unusual in real life that people have tried to develop ways of dealing with data that are not MCAR. A technique like multiple imputation relies on the much weaker assumption of MAR. It requires a fair amount of judgment, including a deep understanding of the nature of the variables in your data set, to decide whether MAR is a reasonable assumption for your data. Note that there is no statistical test for MAR. If you encounter a command that purports to test for this, it is based on a misunderstanding. The MAR assumption is based on your understanding of the relationship between the variables with missing data and the values of other variables in your data set, and in particular depends on the inherently unknowable values for the observations that are, in fact, missing. Multiple imputation is very popular because the MAR assumption is frequently reasonable, though I must say that it is also sometimes applied to situations where no thought has been given to that underlying assumption. You probably will want to confer with somebody in your discipline about whether this would be an appropriate approach to your situation.

If that is not a good solution for your data, then another approach, which doesn't require any strong assumptions about the missingness mechanism, is to do robustness analysis where various approaches to imputing the missing data are used (best case scnearios, worst case scenarios, etc.) and seeing how sensitive the results are to the unknown missing values in the data.

The DID estimation is neither more nor less vulnerable to the effects of missing data than any other analysis. So you can be as confident, or timorous, as you would be about using any other analytic approach in this situation.

Second, as I said before, my European sample has a 1 for the treatment indicator both pre and post. I already told you, that only some of these firms really get the treatment while others are too small. Is it nevertheless okay to give the European pre-observations a treatment indicator 1 or will this proceeding bias my results?

If there is a difference between this question and what I responded to in #5, I am missing it. If it's the same question, I stand by my previous answer.
Comment
Linda Debert

Join Date: Jan 2021

Posts: 10
#11

29 Jan 2021, 17:45

Dear Clyde,

ok thank you very much. And stata will ommit all observations with missing data in one or more variables by default when doing an analysis right? Can I change this?

Best,
Linda
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#12

29 Jan 2021, 17:49

And stata will ommit all observations with missing data in one or more variables by default when doing an analysis right? Can I change this?

That is right, and, no, there is no way to change that.
Comment
Linda Debert

Join Date: Jan 2021

Posts: 10
#13

03 Feb 2021, 13:56

Ok :-) thank you so much.
Comment
Linda Debert

Join Date: Jan 2021

Posts: 10
#14

05 Feb 2021, 04:30

Dear Clyde,

I try to use psmatch2 and obviously it works. The program creates 8 variables and I can see which of the treated observations get a partner in the controls. I use 1-1 matching since my control group is huge and end up with nearly 400 treated observations with a partner. Now, I am wondering how I isolate these observations (800 in sum) from the rest of my sample. I now how I can drop observations without common support but not, how I can drop those obeservation in my treatet and untreated group taht do not have a partner. Can you please help me? Is it necessary to drop them or does stata only works with those observation which find a match in my further analysis (e.g. in a descriptive analysis of data and the following diff-in-diff)?

Best,
Linda
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#15

05 Feb 2021, 12:28

It is best not to address questions to a particular person, unless it is to follow up on that person's previous post to you. Here you have raised a new issue, on which I have not previously commented. As it happens, I have not used -psmatch2- in a very long time and I barely remember anything about how it works. I don't know the answer to your question. By beginning with "Dear Clyde" you may have discouraged others who could help you from even reading the rest of #14.

Also, this question is unrelated to the topic of this thread, except to the extent that it is a question arising in the same project you are working on. But it is important to keep threads on topic here. They are not just dialogs between a questioner and a respondent. They are public discussions. Other people come to the Forum and read threads whose titles interest them. Still others come to the Forum and do searches for specific questions. When a thread goes off topic, those people get misdirected (or can't find what they're looking for because the newer topic isn't captured in the thread title) and their time is wasted.

My suggestion is that you re-post this question in a new thread, and not addressed to anyone in particular.
Comment

Announcement

Difference in Differences analysis unbalanced sample

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment