General diff-in-diff queries

Jonathan Eklund

Join Date: Sep 2017

Posts: 10
#1

General diff-in-diff queries

29 Sep 2017, 03:35

Hi everyone,

I have general questions about the design of a difference-in-differences model.

I have data on the prices of roughly 300 products between 2011 and 2016 (one observation per product per year).

In 2016, a new pricing policy was introduced which was meant to put pressure on companies to drop their prices; this particular policy only applied to 33 of the products (~10% of total sample). This means I have 5 pre-treatment observations and 1 post-treatment observation.

I would like to use a difference-in-differences design to study the impact of this policy on prices, but I first have a few questions about how to organize the dataset and run the model.

I have an unbalanced dataset. Among the 33 products in the “treatment group”, 28 have data for both 2015 and 2016 (allowing for an estimate of the treatment effect). However, the number of pre-treatment observations varies between products. For example, product A might have observations for 2011-2016 (each year), product B only for 2011, 2013, 2014, 2015, 2016, whereas product C might have observations just for 2015 and 2016. Five of the products which were subjected to treatment do not have any data for 2016 (post-treated period), only for some or all of the pre-treatment periods.

Among the products in the “control group”, the situations is similar. A small subset of products do not have observations in both 2015 and 2016, whereas for others the number of pre-treatment observations varies. The reason for the unbalanced panel dataset is that not all products were purchased in each year.

How should I organize my dataset to run the diff-in-diff model?

Should I start by dropping all products which do not at least have data for both 2015 and 2016? Do I need to restrict the analysis to those products which have an equal number of pre-treatment period?

These questions also relate to how I should graphically illustrate the price trends in order to examine whether the parallel trends assumption holds. Which products should I include when graphing the prices, and do you think that taking the average price is suitable?

I was then going to run a diff-in-diff model like the following

Code:

xtset product_id year xtreg price_ln treated did i.year, vce(cluster product_id) fe

where “price_ln” is the natural log of the inflation-adjusted prices (outcome variable), “treated” is an indicator variable for those products subjected to the new policy, “did” is the variable of interest (I created this by multiplying treated*post, where “post” is a binary variable for the post-treatment period), and “year” is a time-fixed effect. I was planning to cluster the standard errors by the product ID to account for any heteroscedasticity or autocorrelation.

Any input would be greatly appreciated.

Last edited by Jonathan Eklund; 29 Sep 2017, 03:39.
Tags: None
Jonathan Eklund

Join Date: Sep 2017

Posts: 10
#2

29 Sep 2017, 04:45

I have two more questions I forgot to mention above:
Is it reasonable to include those products which were not influenced by the policy in the control group? The products in the control group by definition differ from those in the treated arm, because this particular policy cannot be applied to any product (i.e., it is not possible to randomly apply this policy to any product). In other words, the products in the treated arm therefore differ systematically from those in the control group. However, a priori, I would not expect there to be any difference in the price trends until the policy is implemented. My understanding of diff-in-diff models is that it is okay for there to be observed and unobserved differences between the two groups, as long as the trends are similar in the pre-treatment periods so that the diff-in-diff estimate removes the effect of market factors that affect both groups.

Is the sample size in the treated arm sufficient? This dataset includes all of the products in an entire country which are affected by this particular policy.

Last edited by Jonathan Eklund; 29 Sep 2017, 04:53.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#3

29 Sep 2017, 08:17

There is no requirement that you have equal numbers of observations for different entities in your study. Don't give that much thought. The only way in which it might be relevant is if the process that causes the amount of available data on different products to be different is also associated with price. In that case you have endogenous missingness, biased selection for inclusion, which is a serious problem. But balance per se is not an issue.

You don't seem to be worried about the fact that you have only one observation in the post-intervention era. But that's a big problem. It doesn't preclude doing the analysis, but it means that the only kind of effect you can examine with this model is a shift in price following the intervention, you cannot test for a change in trend, because you do not measure post-treatment trend at all. Are you really sure this whole research question is ripe for investigation?

Those observations that have only pre-intervention or post-intervention observations are not informative about the treatment effect. But you do not need to drop them from the data. They still provide information about the outcome distribution, and Stata will handle them appropriately.

Don't calculate your own did variable. Use factor variable notation. Also your model is mis-specified because it omits the post variable itself. Here is how to do this correctly with factor-variable notation.

Code:

xtset product_id year xtreg price_ln i.treated##i.post i.year, vce(cluster product_id) fe

See -help fvvarlist-. By doing this, you will be able to then use the -margins- command after you run -xtreg-:
Note: Stata will omit the treatment variable because it is non-varying within product_id. Also, in addition to having one year omitted as the reference category for year, you will also lose another year due to colinearity with post. Neither of things is any cause for concern.

Code:

margins treated#post margins treated, dydx(post) pwcompare(cimargins effects)

This will give you the predicted value of price_ln in each group before and after, as well as the marginal effect of transition to the post-intervention era in each group, and then the difference between those (the did estimator of treatment effect). See -help margins-.

DID analysis is commonly used in non-experimental settings. It is understood that there may well be systematic differences between the controls and the cases. To the extent, however, that the cases and controls have parallel trends prior to intervention, it is reasonable to believe that the DID estimator will provide an estimate of the true causal effect. It is not guaranteed, but when randomization is not possible, it is often the best one can do.
2 likes
Comment
Jonathan Eklund

Join Date: Sep 2017

Posts: 10
#4

02 Oct 2017, 05:11

Dear Clyde,

Many thanks for taking the time to give such a comprehensive response. I really appreciate it.

Running your model code above, I get a significant decline in prices in the treated group (interaction term). However, I am not able to run the margins commands. I get an error message that the margins are "not estimable".

I agree that it the lack of data in the post-intervention period is problematic. It would be much better to have more post-treatment observations to estimate changes in the price trends. However, I have been asked to estimate preliminary results on the impact of the policy on prices, with the important caveat that this question should be revisited as additional years of data become available.

I should say that the products in question are medicines and the policy in question is a reference pricing system. This means that a particular health insurer groups medicines by therapeutic class (medicines within a class are viewed as being therapeutically equivalent and thus perfectly substitutable for most patients), and then only agreeing to reimburse the cheapest price in the class.

I have a few follow-up questions:
What would be an appropriate way to graphically represent price trends (parallel trends assumption)? Other studies (examples 1 2 3) separately plot the average price among the treated and control medicines. However, it seems like these average would be very susceptible to extreme prices. For example, if new on-patent medicines are purchased (which are more likely to end up in the control group because of how the pricing policy is designed), then this would raise the average price in the control group. This could then potentially lead to the parallel trends assumption not being met, but this would just be due to the changing composition of the control and treatment arms.
How can I account for this? Should I just be plotting the trends of those products for which we have at least a certain number of years of data (even if all of the data points can be used in the actual model, since as you said Stata will adjust for an unbalanced panel)

To test empirically whether the parallel trends assumption is met, is one option to code 2015 as being in the post-intervention period (i.e., the variable "post" would taken on a value of 1 for both 2015 and 2016, not just 2016) and then re-run the model?
When I do this, I also get a statistically significant effect for the interaction term (i.treated##i.post) in 2015, which doesn’t make much sense since nothing happened in that year that should lead to systematic differences. However, I’m getting a bit confused about what’s being picked up by the fixed effects, year trends, and interaction term.

Also, the effect of the policy on medicines prices would also differ depending on how long a product’s been on the market and its price prior to the intervention. For example, a generic medicine which has been around for a long time and is very cheap in the year prior to the intervention cannot, in general, drop as much in price as a newer products which higher prices. In other words, there is bigger room for price drops for products which have gone off patent more recently. Is this not something I would need to account for also, especially taking into account my questions above about the changing compositions of the groups?

I hope these questions make sense. Please don’t hesitate to let me know if you require any other details or explanations.

Last edited by Jonathan Eklund; 02 Oct 2017, 05:14.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#5

02 Oct 2017, 08:36

OK. Sorry about the -margins- problems. With -fe- estimation, this problem commonly arises. You can get the results by adding the -noestimcheck- option to the -margins- command. (Note: You should not do this indiscriminately whenever any -margins- command comes back with "not estimable." Usually, that message means there is a problem. But in fixed effects estimation, this problem is expected, and the use of -noestimcheck- is OK. You do have to understand, however, that in any fixed effects model with a constant, the fixed effects are not actually identifiable, consequently the predicted margins are also not properly identified. BUT the differences between them based on treatment vs control group or pre vs post are identified and can be relied upon.)

As for what happens when you change the pre-post breakpoint to 2015, I think that the reason you are still getting an effect is that the 2016 measurements (which, in fact, do occur after the intervention) are still part of the post-intervention period in this middle, and the differences are large enough that even when diluted with the 2015 measurements, the difference remains large enough to detect.

My own preference for verifying the parallel trends assumption is to do it graphically. You raise an interesting question about plotting the means in the presence of structural outliers. If your sample is large enough, and if the outlier drugs are about equally likely to be in either the treatment or control groups, then I don't think this is a worry. But price distributions can have really, really long tails, and even samples of several hundred or low thousands of observations can easily hit or miss them in ways that leave things unbalanced. The law of large numbers does win out in the asymptotic long run, but these distributions are very funky and the N needed for the LLN to rule can be astronomical. So in this case you might be better off looking at medians.

More broadly, it does seem that you would be better off making your model a bit more complicated to take into account the likely different effect of the intervention on new and old drugs. I really don't know what the best way to do that is. Generically, you need some variable(s) identifying the age of the drug. But whether that should be some categorical variable (and if so, whether it's a dichotomy or involves several levels, and where the cutpoints for those levels might be) or whether it should be the actual age of the drug itself (and how is that defined operationally?) or some transform of age requires a deeper knowledge of the dynamics that drive drug pricing than I have. I think you need to consult with some colleagues who have real expertise in pharmacoeconomics about this. It's over my head, other than to say that, yes, it does seem you need to build this into the model. Once you know how to represent this in variables, it is highly likely that it comes into the model interacting with i.treated##i.pre_post.
Comment
Jonathan Eklund

Join Date: Sep 2017

Posts: 10
#6

03 Oct 2017, 09:01

Thank you for your continued help. I will work on the model a bit, and may come back with some more questions. Have a nice day.
Comment
Dani Vasquez

Join Date: Nov 2017

Posts: 5
#7

03 Nov 2017, 04:15

Dear Statalist, my question fits under this post, as I have a very general question on applying dif-in-dif. I have longitudinal data without pre-treatment information. Can I use dif-in-dif though I have no pre-treatment data? I would set the first year as the baseline to estimate the impact of a policy on the outcome variable. Thanks in advance!
Comment
Jonatan Eklund

Join Date: Jan 2018

Posts: 3
#8

04 Jan 2018, 08:56

Hi, I have a follow-up query to my initial questions above. I thought it’d be best to ask my new query here so readers can read the background above if they wish, but could a moderator please let me know if I should start a new thread.

I have run the difference-in-differences analysis described above, and the results make sense. As a sensitivity analysis, I followed the approach suggested by Bertrand and colleagues (reference below) and collapsed all the pre-intervention data into one period to address any potential serial correlation. Again, the results hold at the 1% significance level.

The only thing I am having trouble with is how to present the information regarding the parallel trends assumption, which was discussed above briefly. Notably, Clyde Schechter said the following:

``My own preference for verifying the parallel trends assumption is to do it graphically. You raise an interesting question about plotting the means in the presence of structural outliers. If your sample is large enough, and if the outlier drugs are about equally likely to be in either the treatment or control groups, then I don't think this is a worry. But price distributions can have really, really long tails, and even samples of several hundred or low thousands of observations can easily hit or miss them in ways that leave things unbalanced. The law of large numbers does win out in the asymptotic long run, but these distributions are very funky and the N needed for the LLN to rule can be astronomical. So in this case you might be better off looking at medians.’’

So, when I plot both the median and mean prices over time (attached), the graphs shows largely parallel trends (including in the intervention year – the last one in the graph). But the difficulty with these data are that because medicines can go off-patent at various times, you can end up with an unbalanced pattern if you do not control for the off-patent status of a drug (ie, whether it is still patent-protected or available in generic form). I think it would be important to control for any relevant characteristic that can change over time at different rates between treated/control groups and is likely to impact prices.

So, in my model I controlled for the generic status of medicines. When I do this, I get a large effect in the year of the intervention, but when I look for treatment leads (ie, to see if there was any significant ``treatment effect’’ in any of the earlier years), none of the effects are significant at the 5% level. This is what I would have expected to find. In other words, only the interaction between treated units and the treatment year was significant, but not any interactions with the earlier years prior to the intervention.

Are these ``placebo tests’’ a suitable way to demonstrate that the parallel trend holds, since otherwise the results would have shown a significant treatment effect in other years? I’m not sure how to plot graphically prices controlling for any changes over time in the composition of the two groups with respect to patent status.

Any insights on this matter would be much appreciated. Also, please let me know if something is not clear or if I should elaborate. Thank you.

(As an aside, Dani, you are unable to run a difference-in-differences analysis without pre-intervention data.)

Bertrand M, Duflo E, Mullainathan S (2003). How much should we trust differences-in-differences estimates? The Quarterly Journal of Economics, 119(1): 249-275.

Attached Files
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#9

04 Jan 2018, 09:11

I think your "placebo tests" are fine and I would go with that. My preference for doing this graphically is just that, a preference, not a law, not even a guideline. What you've done is fine and convincing, and while it is possible to graphically deal with this situation, I think it would be a lot of unnecessary extra work.
Comment
Jonatan Eklund

Join Date: Jan 2018

Posts: 3
#10

05 Jan 2018, 03:07

Dear Clyde,
Thank you very much for taking the time to help me with these queries.
Best wishes,
Jonathan
Comment
Jonatan Eklund

Join Date: Jan 2018

Posts: 3
#11

10 Jan 2018, 03:44

Sorry to revive this thread once more. I realized I had one final question and then I think I have all the pieces clear in my head.

I included a fixed effects (fe) in all of the models I ran, which, if I understand correctly, controls for any individual characteristics of units—in both treatment and control arms—that affect price levels but do not vary over time. This includes characteristics such as the therapeutic value of a medicine, strength, etc. I think this is important to get more precise diff-in-diff estimates.

Is this a correct interpretation (and is my rationale for including a fixed effect correct)? Also, what is the difference between including a fixed effects and including all relevant time-invariant covariates like therapeutic area, strength, etc.? Does this achieve the same thing (in terms of the value of the difference-in-differences estimator)? My guess is that fixed effects is a good option when you can't adequately control for all the characteristics (for example, I don't know how I would adequately control for therapeutic value).

My confusion stems from the fact that various papers seem to include or not include any fixed effects (often without including a rationale), and researchers sometimes use different terms such as "product fixed effects" or ''control for time-invariant product characteristics'' which I take to mean the same thing.

Finally, is it inappropriate to include both a fixed effects and time-invariant covariates? I am thinking it is okay to include something like generic status (which varies over time) and a fixed effect, but not something like therapeutic area and a fixed effect since then you'll run into issues of collinearity. However, this is something I have seen done in other papers.

Thank you.

Last edited by Jonatan Eklund; 10 Jan 2018, 04:04.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#12

10 Jan 2018, 08:18

I included a fixed effects (fe) in all of the models I ran, which, if I understand correctly, controls for any individual characteristics of units—in both treatment and control arms—that affect price levels but do not vary over time. This includes characteristics such as the therapeutic value of a medicine, strength, etc. I think this is important to get more precise diff-in-diff estimates.

This is correct. The use of fixed-effects automatically results in adjustment for any time-invariant attribute of the units represented by the fixed effects.

Also, what is the difference between including a fixed effects and including all relevant time-invariant covariates like therapeutic area, strength, etc.? Does this achieve the same thing (in terms of the value of the difference-in-differences estimator)?

In theory, these would be the same. In reality, you do not have measures of all the relevant time-invariant attributes that you would need. In fact, in reality, you may not even know of the existence of some of them. The use of fixed-effects controls for them nevertheless, whether they are observed, observable but unobserved, unobservable, or unreocgnized.

My confusion stems from the fact that various papers seem to include or not include any fixed effects (often without including a rationale), and researchers sometimes use different terms such as "product fixed effects" or ''control for time-invariant product characteristics'' which I take to mean the same thing.

The quality of methods reporting in the published literature is quite variable, and much that makes it into print is very vaguely described.

Finally, is it inappropriate to include both a fixed effects and time-invariant covariates? I am thinking it is okay to include something like generic status (which varies over time) and a fixed effect, but not something like therapeutic area and a fixed effect since then you'll run into issues of collinearity. However, this is something I have seen done in other papers.

I don't understand this question: you seem to be contradicting yourself here. If you attempt to add a time-invariant covariate to a model with fixed effects you will introduce colinearity. Stata (or any other software package) will notice this and will resolve the problem by dropping a variable: either one of the levels of the fixed effects will be omitted, or the time-invariant covariate will be omitted. Which it drops is not guaranteed to be predictable, although experience suggests that in Stata the one mentioned last in the varlist is usually omitted (and fixed effects can be thought of as always being last.)

But then you go on to talk about a covariate that does vary over time, like generic status. This will not introduce any problems and is recommended if you think it is relevant to your outcome variable.
Comment

Announcement

General diff-in-diff queries

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment