Hi all,
I am trying to figure out how to better analyze this specific data structure I will show below. Basically, I would like to analyze if a worker taking a specific knowledge (treatment) is in a better position in the labor market. As you can see in the data example, the treatment takes place in a given year (different for different people). Assuming that I have good control (those that do not take the course) and treatment groups, there is several observations per person but not equally spaced (contrary to panel data where all measurements are taken in each year). Also, there are no end time for some observations (see missing values).
I know this could be a very open question but, which type of analysis should be used here?
1) Can I do xtset id_person occasion and then apply FE estimation even though the data is not equally spaced?
xtreg Y treatment time treatment*time, fe
2) Or should I treat the data as in a longitudinal/multilevel framework:
mixed Y treatment time treatment*time ||id_person
Bear in mind that the variable "time" is not in the data example, I just put here what I have read from other post to do when panel data structure and DID is what you want to be estimated.
3) For accessing the causal impact of the treatment variable: a DID might be advisable, basically, I think that the empirical analysis should be as in the two previous equations examples. But I have read that data need to be equally spaced (as in panel data). So, maybe the solution to this could be to compile the observations per years (build a yearly panel data set) and then apply the DID?
4) Because some people could argue about the control group not being a good counterfactual (because innate ability for instance), I would like to ask your opinion about instead of comparing control/treatment groups; to compare the treatment group before/after the treatment. In this way I think that the counterfactual would be better since it is the same subject, and thus, controlling by innate ability and other time-invariant characteristics. What do you think? And if possible, is there any example to follow that you advice me to check how to do this in Stata?
Thanks a lot for all your help and suggestions.
I am trying to figure out how to better analyze this specific data structure I will show below. Basically, I would like to analyze if a worker taking a specific knowledge (treatment) is in a better position in the labor market. As you can see in the data example, the treatment takes place in a given year (different for different people). Assuming that I have good control (those that do not take the course) and treatment groups, there is several observations per person but not equally spaced (contrary to panel data where all measurements are taken in each year). Also, there are no end time for some observations (see missing values).
I know this could be a very open question but, which type of analysis should be used here?
1) Can I do xtset id_person occasion and then apply FE estimation even though the data is not equally spaced?
xtreg Y treatment time treatment*time, fe
2) Or should I treat the data as in a longitudinal/multilevel framework:
mixed Y treatment time treatment*time ||id_person
Bear in mind that the variable "time" is not in the data example, I just put here what I have read from other post to do when panel data structure and DID is what you want to be estimated.
3) For accessing the causal impact of the treatment variable: a DID might be advisable, basically, I think that the empirical analysis should be as in the two previous equations examples. But I have read that data need to be equally spaced (as in panel data). So, maybe the solution to this could be to compile the observations per years (build a yearly panel data set) and then apply the DID?
4) Because some people could argue about the control group not being a good counterfactual (because innate ability for instance), I would like to ask your opinion about instead of comparing control/treatment groups; to compare the treatment group before/after the treatment. In this way I think that the counterfactual would be better since it is the same subject, and thus, controlling by innate ability and other time-invariant characteristics. What do you think? And if possible, is there any example to follow that you advice me to check how to do this in Stata?
Thanks a lot for all your help and suggestions.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte(id_person firm_id start_day start_month) int start_year byte(end_day end_month) int end_year byte(x1 x3 x4 x5 Treat) float occasion 1 1 18 11 2005 18 11 2005 0 0 40 32 0 1 1 1 16 12 2005 16 12 2005 0 0 20 32 0 2 1 1 12 1 2006 12 1 2006 0 0 40 32 0 3 1 1 1 2 2006 1 2 2006 0 0 35 32 0 4 1 1 21 3 2006 21 3 2006 0 0 40 32 0 5 1 1 17 1 2007 17 1 2007 0 0 40 32 0 6 1 1 8 2 2007 8 2 2007 0 0 16 32 0 7 1 1 14 2 2007 15 2 2007 0 0 40 32 0 8 1 2 25 6 2008 24 9 2008 0 0 25 54 0 9 1 3 2 7 2009 1 10 2009 0 1 16 33 1 10 1 4 30 7 2009 . . . 0 0 40 55 1 11 1 5 15 3 2010 . . . 0 0 40 32 1 12 1 6 11 5 2010 31 8 2010 0 0 30 33 1 13 1 7 2 11 2010 24 12 2010 0 0 40 23 1 14 1 7 21 1 2011 23 4 2011 0 0 40 23 1 15 1 7 26 4 2011 8 5 2011 0 0 35 33 1 16 1 8 2 5 2011 30 9 2011 0 0 40 23 1 17 2 9 31 12 2006 1 1 2007 0 1 5 11 0 1 2 10 20 4 2007 22 4 2007 0 0 40 32 0 2 2 10 5 5 2007 6 5 2007 0 0 40 80 0 3 2 10 11 5 2007 27 5 2007 0 1 24 32 0 4 2 11 30 4 2008 12 10 2008 0 0 40 33 0 5 2 12 19 12 2008 . . . 0 1 20 32 0 6 2 13 5 5 2009 13 9 2009 0 0 40 54 0 7 2 14 10 5 2010 16 9 2010 0 0 20 54 0 8 2 15 9 3 2011 8 9 2011 0 0 20 23 1 9 3 16 28 7 2008 . . . 0 0 40 55 0 1 3 17 8 3 2010 . . . 0 0 40 55 0 2 3 17 1 4 2011 . . . 0 0 35 55 0 3 3 18 1 1 2014 . . . 0 0 40 54 1 4 3 19 14 1 2019 30 6 2019 0 1 4 55 1 5 3 20 1 6 2019 . . . 1 0 10 54 1 6 4 21 2 4 2000 5 5 2001 0 0 25 55 0 1 4 21 4 10 2001 . . . 0 0 15 55 0 2 4 21 20 11 2001 . . . 0 0 40 55 0 3 4 22 12 12 2001 13 12 2001 0 1 30 33 0 4 4 23 31 5 2002 . . . 1 0 40 48 0 5 end
Comment