Automatically using panel data commands in Stata to fill in missing values?

Vincent La

Join Date: Jan 2015
Posts: 33

Automatically using panel data commands in Stata to fill in missing values?

16 Feb 2015, 17:15

Hi,

I have a panel data structure in Stata with gaps. I've used tsfill to expand the data set and this created missing observations for the years that were missing. I've looked at the documentation for tsset and tsfill, but I'm unsure of how to fill in the missing data appropriately for my situation. As always, any help is greatly appreciated! Below I will try to reproduce a working example.

Suppose I have the following data set:

Code:

clear
input str8 country year headcount
"Cambodia" 2007 99.99
"Cambodia"    2007 100
"Cambodia"    2008 99.99
"Cambodia"    2008 100
"Cambodia"    2009 99.99
"Cambodia"    2009 100
"China"    2005 99.99
"China"    2005 100
"China"    2008 99.99
"China"    2008 100
"China"    2009 99.99
"China"    2009 100
end

list, sepby(country)


     +----------------------------+
     |  country   year   headcount |
     |----------------------------|
  1. | Cambodia   2007      99.99 |
  2. | Cambodia   2007        100 |
  3. | Cambodia   2008      99.99 |
  4. | Cambodia   2008        100 |
  5. | Cambodia   2009      99.99 |
  6. | Cambodia   2009        100 |
     |----------------------------|
  7. |    China   2005      99.99 |
  8. |    China   2005        100 |
  9. |    China   2008      99.99 |
 10. |    China   2008        100 |
 11. |    China   2009      99.99 |
 12. |    China   2009        100 |
     +----------------------------+

As you can see there are two countries with years spanning 2005 - 2009, but each country is missing some years in between. In addition, there are some duplicate values at the country-year level, so actually the cross-sectional component is actually each country-headcount pair. Thus, I need to create a new variable for each country-headcount pair:

Code:

. egen country_hc = group(country headcount)

. sort country_hc year

. list, sepby(country)

     +---------------------------------------+
     |  country   year   headcount   country_hc |
     |---------------------------------------|
  1. | Cambodia   2007      99.99          1 |
  2. | Cambodia   2008      99.99          1 |
  3. | Cambodia   2009      99.99          1 |
  4. | Cambodia   2007        100          2 |
  5. | Cambodia   2008        100          2 |
  6. | Cambodia   2009        100          2 |
     |---------------------------------------|
  7. |    China   2005      99.99          3 |
  8. |    China   2008      99.99          3 |
  9. |    China   2009      99.99          3 |
 10. |    China   2005        100          4 |
 11. |    China   2008        100          4 |
 12. |    China   2009        100          4 |
     +---------------------------------------+

Now I can run tsset using country_hc and year and then use tsfill to expand the data set.

Code:

. tsset country_hc year
       panel variable:  country_hc (weakly balanced)
        time variable:  year, 2005 to 2009, but with gaps
                delta:  1 unit

. tsfill, full

. list

     +---------------------------------------+
     |  country   year   headcount   country_hc |
     |---------------------------------------|
  1. |            2005          .          1 |
  2. |            2006          .          1 |
  3. | Cambodia   2007      99.99          1 |
  4. | Cambodia   2008      99.99          1 |
  5. | Cambodia   2009      99.99          1 |
     |---------------------------------------|
  6. |            2005          .          2 |
  7. |            2006          .          2 |
  8. | Cambodia   2007        100          2 |
  9. | Cambodia   2008        100          2 |
 10. | Cambodia   2009        100          2 |
     |---------------------------------------|
 11. |    China   2005      99.99          3 |
 12. |            2006          .          3 |
 13. |            2007          .          3 |
 14. |    China   2008      99.99          3 |
 15. |    China   2009      99.99          3 |
     |---------------------------------------|
 16. |    China   2005        100          4 |
 17. |            2006          .          4 |
 18. |            2007          .          4 |
 19. |    China   2008        100          4 |
 20. |    China   2009        100          4 |
     +---------------------------------------+

I want to leave headcount missing; I will impute missing values of headcount using average values given from a different file -- I think I can figure that out on my own. However, I am having immense trouble trying to fill in the value for country. Basically, I want the values of Cambodia and China to auto fill, but keep headcount missing (for now) during the years that were initially missing in the original data.

So in other words, in the end, I want a data set that looks like:

Code:

     
     +---------------------------------------+
     |  country   year   headcount   country_hc |
     |---------------------------------------|
  1. | Cambodia   2005          .          1 |
  2. | Cambodia   2006          .          1 |
  3. | Cambodia   2007      99.99          1 |
  4. | Cambodia   2008      99.99          1 |
  5. | Cambodia   2009      99.99          1 |
     |---------------------------------------|
  6. | Cambodia   2005          .          2 |
  7. | Cambodia   2006          .          2 |
  8. | Cambodia   2007        100          2 |
  9. | Cambodia   2008        100          2 |
 10. | Cambodia   2009        100          2 |
     |---------------------------------------|
 11. |    China   2005      99.99          3 |
 12. |    China   2006          .          3 |
 13. |    China   2007          .          3 |
 14. |    China   2008      99.99          3 |
 15. |    China   2009      99.99          3 |
     |---------------------------------------|
 16. |    China   2005        100          4 |
 17. |    China   2006          .          4 |
 18. |    China   2007          .          4 |
 19. |    China   2008        100          4 |
 20. |    China   2009        100          4 |
     +---------------------------------------+

Is there an easy way to do this? I've been trying to do this in a more "brute force" method, but my true data set is actually almost 4 million observations, so computational speed is definitely a factor here. I've looked at the documentation for tsset and tsfill to no avail, but is there something simple I am missing?

Thank you for all your help!

Vincent

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35418
#2

16 Feb 2015, 17:27

tsfill does nothing to turn missing values into non-missing values beyond its mission. Its purpose is to add observations to the dataset if there are gaps, but the only non-missing values ever created are in identifier and time variables. In your case, you have an easy task to fill in missing values for a string variable which you know should be constant.

That's one line

Code:

by country_hc (country) : replace country = country[_N]

That said, your other data management looks strange to me, but it's late locally and I will leave it there.
2 likes
Comment
Vincent La

Join Date: Jan 2015

Posts: 33
#3

16 Feb 2015, 18:04

Hi Nick,

That's exactly what I needed. What does putting the parenthesis around (country) do? I've never quite seen that before?

Thanks!

Vincent
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

16 Feb 2015, 18:46

The specific syntax is well explained using help by. In Nick's example, the by prefix confirms the data to be in order by country_hc and within that by country. Since country is enclosed within parentheses, the by prefix ignores country when forming the by groups. Thus country[_N] will be the value of country for the final observation within group defined by country_hc; since the data was sorted by country within each country_hc group, it will necessarily be the one of the non-blank values of country for the group,
Comment
Vincent La

Join Date: Jan 2015

Posts: 33
#5

16 Feb 2015, 21:55

I see. So when Stata sorts a string variable, it will put the missings first and then the non-missings last, whereas if it's a numeric variable, the non-missings go first and missings go last?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#6

17 Feb 2015, 02:11

Correct. Reverting to my earlier post, I can't see that it makes any substantive sense to have separate panels for countries merely because one value is, or displays as, 99.99 and the other 100. You need to clean that up to avoid an absurd analysis.

Last edited by Nick Cox; 17 Feb 2015, 02:22.
Comment
Vincent La

Join Date: Jan 2015

Posts: 33
#7

18 Feb 2015, 07:40

Hi Nick,

Thanks for your help!

The 99.99 and the other 100 were just placeholders for actual meaningful values. Next time, I'll be sure to make a data set that can't be misconstrued as something absurd. Thanks!

Vincent
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#8

18 Feb 2015, 14:51

Good to hear that.
Comment
Chiraz KARAMTI

Join Date: Oct 2016

Posts: 59
#9

14 Nov 2016, 06:50

Hi Mr Cox,
I'm aware that tsfill doesn't fill missing observations. Say I have panel data with 6 countries and my variable GDP is missing for some years. What shall I do in this case?
thanks!
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#10

14 Nov 2016, 09:01

I am not Mr Cox, but I think it's a question we can't answer for you. Missing data is missing data. You can choose to ignore it and just use your unbalanced or gapped data, you can choose to impute those values, but this requires strong assumptions, you can choose to only use countries where GDP isn't missing, ... it all depends on what you're doing, what people in your field usually do and so on. We can't answer these questions for you, we don't even know which countries you're looking at when!
Comment
Chiraz KARAMTI

Join Date: Oct 2016

Posts: 59
#11

14 Nov 2016, 09:37

Hi Jesse,
Thank you for your response. I know that there are several ways to impute missing data depending on the nature (random or not) of the missing values. I thought that may be for panel data there is a special treatment. But ok, got it.
thanks again
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#12

14 Nov 2016, 12:56

I agree with Jesse in broad terms. I'd add a mention of interpolation, which has big pluses and minuses too.

On a small matter of form, I'd recommend using whatever names people use to describe themselves. Using titles (Mr, Ms, Miss, Dr, Prof, whatever) is usually more formal than anybody wants, and there is a high probability of getting them wrong any way.
Comment
Alex Rengifo

Join Date: Feb 2019

Posts: 2
#13

18 Apr 2019, 20:35

¿Cómo puedo solucionar mi problema en Stata? Me sale este error.
repeated time values within panel
r(451);
Comment
Ujjawal Sawarn

Join Date: Sep 2020

Posts: 26
#14

10 Oct 2020, 04:35

Hi,
For the unbalanced panel, I have used tsfill and created my balanced panel with identifiers filled. I want to fill the global variables, which will have same value for all countries. How can I fill those.
Eg I have for "Country A" data from 1960 to 2010 and I want same value to be filled in other countries in my panel dataset.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#15

10 Oct 2020, 05:00

#14 isn't clear to me without an example. Otherwise put, how is this different from other questions in this thread already answered? Please make concrete what you want with an example spannng at least two countries. FAQ Advice #12 shows you how to give a data example.
Comment

Announcement

Automatically using panel data commands in Stata to fill in missing values?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment