Hi everyone,
I have general questions about the design of a difference-in-differences model.
I have data on the prices of roughly 300 products between 2011 and 2016 (one observation per product per year).
In 2016, a new pricing policy was introduced which was meant to put pressure on companies to drop their prices; this particular policy only applied to 33 of the products (~10% of total sample). This means I have 5 pre-treatment observations and 1 post-treatment observation.
I would like to use a difference-in-differences design to study the impact of this policy on prices, but I first have a few questions about how to organize the dataset and run the model.
I have an unbalanced dataset. Among the 33 products in the “treatment group”, 28 have data for both 2015 and 2016 (allowing for an estimate of the treatment effect). However, the number of pre-treatment observations varies between products. For example, product A might have observations for 2011-2016 (each year), product B only for 2011, 2013, 2014, 2015, 2016, whereas product C might have observations just for 2015 and 2016. Five of the products which were subjected to treatment do not have any data for 2016 (post-treated period), only for some or all of the pre-treatment periods.
Among the products in the “control group”, the situations is similar. A small subset of products do not have observations in both 2015 and 2016, whereas for others the number of pre-treatment observations varies. The reason for the unbalanced panel dataset is that not all products were purchased in each year.
How should I organize my dataset to run the diff-in-diff model?
Should I start by dropping all products which do not at least have data for both 2015 and 2016? Do I need to restrict the analysis to those products which have an equal number of pre-treatment period?
These questions also relate to how I should graphically illustrate the price trends in order to examine whether the parallel trends assumption holds. Which products should I include when graphing the prices, and do you think that taking the average price is suitable?
I was then going to run a diff-in-diff model like the following
where “price_ln” is the natural log of the inflation-adjusted prices (outcome variable), “treated” is an indicator variable for those products subjected to the new policy, “did” is the variable of interest (I created this by multiplying treated*post, where “post” is a binary variable for the post-treatment period), and “year” is a time-fixed effect. I was planning to cluster the standard errors by the product ID to account for any heteroscedasticity or autocorrelation.
Any input would be greatly appreciated.
I have general questions about the design of a difference-in-differences model.
I have data on the prices of roughly 300 products between 2011 and 2016 (one observation per product per year).
In 2016, a new pricing policy was introduced which was meant to put pressure on companies to drop their prices; this particular policy only applied to 33 of the products (~10% of total sample). This means I have 5 pre-treatment observations and 1 post-treatment observation.
I would like to use a difference-in-differences design to study the impact of this policy on prices, but I first have a few questions about how to organize the dataset and run the model.
I have an unbalanced dataset. Among the 33 products in the “treatment group”, 28 have data for both 2015 and 2016 (allowing for an estimate of the treatment effect). However, the number of pre-treatment observations varies between products. For example, product A might have observations for 2011-2016 (each year), product B only for 2011, 2013, 2014, 2015, 2016, whereas product C might have observations just for 2015 and 2016. Five of the products which were subjected to treatment do not have any data for 2016 (post-treated period), only for some or all of the pre-treatment periods.
Among the products in the “control group”, the situations is similar. A small subset of products do not have observations in both 2015 and 2016, whereas for others the number of pre-treatment observations varies. The reason for the unbalanced panel dataset is that not all products were purchased in each year.
How should I organize my dataset to run the diff-in-diff model?
Should I start by dropping all products which do not at least have data for both 2015 and 2016? Do I need to restrict the analysis to those products which have an equal number of pre-treatment period?
These questions also relate to how I should graphically illustrate the price trends in order to examine whether the parallel trends assumption holds. Which products should I include when graphing the prices, and do you think that taking the average price is suitable?
I was then going to run a diff-in-diff model like the following
Code:
xtset product_id year xtreg price_ln treated did i.year, vce(cluster product_id) fe
Any input would be greatly appreciated.
Comment