Hi,
I have a panel data structure in Stata with gaps. I've used tsfill to expand the data set and this created missing observations for the years that were missing. I've looked at the documentation for tsset and tsfill, but I'm unsure of how to fill in the missing data appropriately for my situation. As always, any help is greatly appreciated! Below I will try to reproduce a working example.
Suppose I have the following data set:
As you can see there are two countries with years spanning 2005 - 2009, but each country is missing some years in between. In addition, there are some duplicate values at the country-year level, so actually the cross-sectional component is actually each country-headcount pair. Thus, I need to create a new variable for each country-headcount pair:
Now I can run tsset using country_hc and year and then use tsfill to expand the data set.
I want to leave headcount missing; I will impute missing values of headcount using average values given from a different file -- I think I can figure that out on my own. However, I am having immense trouble trying to fill in the value for country. Basically, I want the values of Cambodia and China to auto fill, but keep headcount missing (for now) during the years that were initially missing in the original data.
So in other words, in the end, I want a data set that looks like:
Is there an easy way to do this? I've been trying to do this in a more "brute force" method, but my true data set is actually almost 4 million observations, so computational speed is definitely a factor here. I've looked at the documentation for tsset and tsfill to no avail, but is there something simple I am missing?
Thank you for all your help!
Vincent
I have a panel data structure in Stata with gaps. I've used tsfill to expand the data set and this created missing observations for the years that were missing. I've looked at the documentation for tsset and tsfill, but I'm unsure of how to fill in the missing data appropriately for my situation. As always, any help is greatly appreciated! Below I will try to reproduce a working example.
Suppose I have the following data set:
Code:
clear input str8 country year headcount "Cambodia" 2007 99.99 "Cambodia" 2007 100 "Cambodia" 2008 99.99 "Cambodia" 2008 100 "Cambodia" 2009 99.99 "Cambodia" 2009 100 "China" 2005 99.99 "China" 2005 100 "China" 2008 99.99 "China" 2008 100 "China" 2009 99.99 "China" 2009 100 end list, sepby(country) +----------------------------+ | country year headcount | |----------------------------| 1. | Cambodia 2007 99.99 | 2. | Cambodia 2007 100 | 3. | Cambodia 2008 99.99 | 4. | Cambodia 2008 100 | 5. | Cambodia 2009 99.99 | 6. | Cambodia 2009 100 | |----------------------------| 7. | China 2005 99.99 | 8. | China 2005 100 | 9. | China 2008 99.99 | 10. | China 2008 100 | 11. | China 2009 99.99 | 12. | China 2009 100 | +----------------------------+
Code:
. egen country_hc = group(country headcount) . sort country_hc year . list, sepby(country) +---------------------------------------+ | country year headcount country_hc | |---------------------------------------| 1. | Cambodia 2007 99.99 1 | 2. | Cambodia 2008 99.99 1 | 3. | Cambodia 2009 99.99 1 | 4. | Cambodia 2007 100 2 | 5. | Cambodia 2008 100 2 | 6. | Cambodia 2009 100 2 | |---------------------------------------| 7. | China 2005 99.99 3 | 8. | China 2008 99.99 3 | 9. | China 2009 99.99 3 | 10. | China 2005 100 4 | 11. | China 2008 100 4 | 12. | China 2009 100 4 | +---------------------------------------+
Code:
. tsset country_hc year panel variable: country_hc (weakly balanced) time variable: year, 2005 to 2009, but with gaps delta: 1 unit . tsfill, full . list +---------------------------------------+ | country year headcount country_hc | |---------------------------------------| 1. | 2005 . 1 | 2. | 2006 . 1 | 3. | Cambodia 2007 99.99 1 | 4. | Cambodia 2008 99.99 1 | 5. | Cambodia 2009 99.99 1 | |---------------------------------------| 6. | 2005 . 2 | 7. | 2006 . 2 | 8. | Cambodia 2007 100 2 | 9. | Cambodia 2008 100 2 | 10. | Cambodia 2009 100 2 | |---------------------------------------| 11. | China 2005 99.99 3 | 12. | 2006 . 3 | 13. | 2007 . 3 | 14. | China 2008 99.99 3 | 15. | China 2009 99.99 3 | |---------------------------------------| 16. | China 2005 100 4 | 17. | 2006 . 4 | 18. | 2007 . 4 | 19. | China 2008 100 4 | 20. | China 2009 100 4 | +---------------------------------------+
I want to leave headcount missing; I will impute missing values of headcount using average values given from a different file -- I think I can figure that out on my own. However, I am having immense trouble trying to fill in the value for country. Basically, I want the values of Cambodia and China to auto fill, but keep headcount missing (for now) during the years that were initially missing in the original data.
So in other words, in the end, I want a data set that looks like:
Code:
+---------------------------------------+ | country year headcount country_hc | |---------------------------------------| 1. | Cambodia 2005 . 1 | 2. | Cambodia 2006 . 1 | 3. | Cambodia 2007 99.99 1 | 4. | Cambodia 2008 99.99 1 | 5. | Cambodia 2009 99.99 1 | |---------------------------------------| 6. | Cambodia 2005 . 2 | 7. | Cambodia 2006 . 2 | 8. | Cambodia 2007 100 2 | 9. | Cambodia 2008 100 2 | 10. | Cambodia 2009 100 2 | |---------------------------------------| 11. | China 2005 99.99 3 | 12. | China 2006 . 3 | 13. | China 2007 . 3 | 14. | China 2008 99.99 3 | 15. | China 2009 99.99 3 | |---------------------------------------| 16. | China 2005 100 4 | 17. | China 2006 . 4 | 18. | China 2007 . 4 | 19. | China 2008 100 4 | 20. | China 2009 100 4 | +---------------------------------------+
Thank you for all your help!
Vincent
Comment