Add missing observations to the panel data

nguyenchihieu82vn

Join Date: May 2014
Posts: 4

Add missing observations to the panel data

29 May 2014, 12:44

Hello,

I would like to add the missing observations for the missing years to the panel data below (my data set is very large, and this would help me easily manage and calculate the growth rates).

The question is similar to this post (http://www.stata.com/statalist/archi.../msg00002.html); but this is a panel, and I cannot use –tsset– and –tsfill–

The missing years are: 2004, 2005, 2009

ccode	cname	pdtcode	partcode	year	export	Import
NGA	Nigeria	98	Africa	2000	16865.06	163984.4
NGA	Nigeria	98	Africa	2001	38273.86	462252.9
NGA	Nigeria	98	Africa	2002	417154.6	313921.3
NGA	Nigeria	98	Africa	2003	60113.79	642668.6
NGA	Nigeria	98	Africa	2006	399306.9	684482.1
NGA	Nigeria	98	Africa	2007	229000	1453937
NGA	Nigeria	98	Africa	2008	935363.9	1646058
NGA	Nigeria	98	Africa	2010	1988731	1629383
NGA	Nigeria	98	WLD	2000	65469.92	5474002
NGA	Nigeria	98	WLD	2001	60095.69	7344401
NGA	Nigeria	98	WLD	2002	1327962	8281683
NGA	Nigeria	98	WLD	2003	598970	14333524
NGA	Nigeria	98	WLD	2006	4060341	21208807
NGA	Nigeria	98	WLD	2007	1775116	29903932
NGA	Nigeria	98	WLD	2008	5111720	27254810
NGA	Nigeria	98	WLD	2010	18959458	42759921

The expected result looks like this:

ccode	cname	pdtcode	partcode	year	export	import
NGA	Nigeria	98	Africa	2000	16865.06	163984.4
NGA	Nigeria	98	Africa	2001	38273.86	462252.9
NGA	Nigeria	98	Africa	2002	417154.6	313921.3
NGA	Nigeria	98	Africa	2003	60113.79	642668.6
NGA	Nigeria	98	Africa	2004
NGA	Nigeria	98	Africa	2005
NGA	Nigeria	98	Africa	2006	399306.9	684482.1
NGA	Nigeria	98	Africa	2007	229000	1453937
NGA	Nigeria	98	Africa	2008	935363.9	1646058
NGA	Nigeria	98	Africa	2009
NGA	Nigeria	98	Africa	2010	1988731	1629383
NGA	Nigeria	98	WLD	2000	65469.92	5474002
NGA	Nigeria	98	WLD	2001	60095.69	7344401
NGA	Nigeria	98	WLD	2002	1327962	8281683
NGA	Nigeria	98	WLD	2003	598970	14333524
NGA	Nigeria	98	WLD	2004
NGA	Nigeria	98	WLD	2005
NGA	Nigeria	98	WLD	2006	4060341	21208807
NGA	Nigeria	98	WLD	2007	1775116	29903932
NGA	Nigeria	98	WLD	2008	5111720	27254810
NGA	Nigeria	98	WLD	2009
NGA	Nigeria	98	WLD	2010	18959458	42759921

Many thanks.
Hieu

Tags: None

FernandoRios

Join Date: Apr 2014

Posts: 2411
#2

29 May 2014, 13:00

Hi nguyenchihieu82vn,
You can start with the task in a different way.
use data,
keep ccode
duplicates drop ccode
expand 11
bysort ccode:gen year=1999+_n
sort ccode year
merge ccode year using data,
This should work for the data you show.
Best
1 like
Comment
nguyenchihieu82vn

Join Date: May 2014

Posts: 4
#3

29 May 2014, 23:16

Thanks a lot FernandosRios,
It works well that way.
Best,
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35211

30 May 2014, 10:48

I don't know why you say

this is a panel, and I cannot use tsset and tsfill

Consider this example:

Code:

 
. webuse grunfeld, clear

. l if company == 7 & year == 1940

     +--------------------------------------------------+
     | company   year   invest   mvalue   kstock   time |
     |--------------------------------------------------|
126. |       7   1940    33.71    145.1    222.9      6 |
     +--------------------------------------------------+

. drop if company == 7 & year == 1940
(1 observation deleted)

. tsfill

. l if company == 7 & year == 1940

     +--------------------------------------------------+
     | company   year   invest   mvalue   kstock   time |
     |--------------------------------------------------|
126. |       7   1940        .        .        .      . |
     +--------------------------------------------------+

This is a panel; it was previously tsset; and tsfill does what it is supposed to. That's not much use, but it works.

Comment

nguyenchihieu82vn

Join Date: May 2014

Posts: 4
#5

30 May 2014, 11:36

@Nick

My problem is that I cannot tsset the data.

. tsset year
repeated time values in sample
r(451);
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#6

30 May 2014, 11:51

Quite. As it is a panel, you must specify a panel identifier as well as a time variable. This is documented in the help for tsset.

Last edited by Nick Cox; 30 May 2014, 11:53.
Comment
nguyenchihieu82vn

Join Date: May 2014

Posts: 4
#7

30 May 2014, 12:08

Thanks a lot Nick. I would read more on that.
Anyway, I find the suggestion of FernandosRios above really helpful for this task as I can quickly check the missing years and transform the data to appropriate form.
Comment
Monica White

Join Date: Jan 2017

Posts: 98
#8

19 Jan 2017, 18:18

Originally posted by FernandoRios View Post

Hi nguyenchihieu82vn,
You can start with the task in a different way.
use data,
keep ccode
duplicates drop ccode
expand 11
bysort ccode:gen year=1999+_n
sort ccode year
merge ccode year using data,
This should work for the data you show.
Best

I actually have this problem myself with a dataset that I am using. I have some years that are missing similar to Nguyen (I need to add the missing observations for the missing years). I don't understand Fernando's explanation/breakdown for Nguyen's question.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#9

19 Jan 2017, 19:38

The code you show, written by Fernando Rios, starts by reading your data set and then keeping only the ccode variable, and only one observation for each ccode. It then expands that to 11 observations for each ccode, and goes on to create a new variable with consecutive values 2000, 2001,...,2010 for each ccode. That resulting "frame" of ccodes and years is then -merge-d with the original data. Where a given ccode-year combination already existed in the original data, that data will be brought back. Where there is no corresponding combination of ccode and year in the original data, there will now be an observation containing the ccode and year, but the other variables will have missing values, as there is no information about them.
1 like
Comment
Monica White

Join Date: Jan 2017

Posts: 98
#10

19 Jan 2017, 21:04

Originally posted by Clyde Schechter View Post

The code you show, written by Fernando Rios, starts by reading your data set and then keeping only the ccode variable, and only one observation for each ccode. It then expands that to 11 observations for each ccode, and goes on to create a new variable with consecutive values 2000, 2001,...,2010 for each ccode. That resulting "frame" of ccodes and years is then -merge-d with the original data. Where a given ccode-year combination already existed in the original data, that data will be brought back. Where there is no corresponding combination of ccode and year in the original data, there will now be an observation containing the ccode and year, but the other variables will have missing values, as there is no information about them.

Thank you Clyde. I tried to do something similar from Fernando's code and logic, but with a dyad-year unit of analysis (two countries). I got an r4 error "no; data in memory would be lost."

I would like to add the missing years and observations for the dataset for the years 2000-2013, very similar to Nyugen. Where importer1 and importer2 are the countries.

Code:

use A.dta keep importer1 importer2 duplicates drop importer1 importer2 expand 14 bysort importer1 importer2:gen year=1999+_n sort importer1 importer2 year merge importer1 importer2 year use A.dta

From what I read from Nick's example, it seems that just erased observations that were already there?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#11

19 Jan 2017, 23:11

So I see two problems here. In the first line, if there is already data in memory that has not been saved since its last modification, Stata will give you precisely the error message you describe. Stata never permits you to clobber unsaved data in memory without either saving it first or indicating that you are aware you are throwing it away and are OK with that. So the first line needs to be -use A.dta, clear-.

There is another error in the last line. But I think it would give you a syntax error rather than the message you got. It should read:

Code:

merge import1 importer2 year using A.dta

A couple of things can also be simplified. -duplicates drop importer1 importer2- can be simplified to just -duplicates drop- because at that point in the code, importer1 and importer2 are the only variables in the data in memory. And the -sort importer1 importer2 year- command is not needed (unless you are using a pretty old version of Stata). Modern Stata automatically sorts the data, if necessary, when it does the -merge-. To be clear, neither of these things is an error; you can leave them the way they are. I'm just pointing out a way to make the code a tad cleaner.
Comment
Monica White

Join Date: Jan 2017

Posts: 98
#12

21 Jan 2017, 11:28

Originally posted by Clyde Schechter View Post

So I see two problems here. In the first line, if there is already data in memory that has not been saved since its last modification, Stata will give you precisely the error message you describe. Stata never permits you to clobber unsaved data in memory without either saving it first or indicating that you are aware you are throwing it away and are OK with that. So the first line needs to be -use A.dta, clear-.

There is another error in the last line. But I think it would give you a syntax error rather than the message you got. It should read:

Code:

merge import1 importer2 year using A.dta

A couple of things can also be simplified. -duplicates drop importer1 importer2- can be simplified to just -duplicates drop- because at that point in the code, importer1 and importer2 are the only variables in the data in memory. And the -sort importer1 importer2 year- command is not needed (unless you are using a pretty old version of Stata). Modern Stata automatically sorts the data, if necessary, when it does the -merge-. To be clear, neither of these things is an error; you can leave them the way they are. I'm just pointing out a way to make the code a tad cleaner.

I'm all about clean data.

Something is wrong during the merge it seems. I even cut out a very small snippet of the original dataset, and it's still not merging properly. I tried your clean code:

Code:

use A.dta, clear keep countrya countryb duplicates drop expand 14 bysort countrya countryb:gen year=1999+_n merge countrya countryb year using A.dta

here is the "snippet" data:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int(year ccode1 ccode2) str11 countrya str10 countryb byte(applied accepted other rejected) 2008 339 240 "Afghanistan" "Turkey" 1 0 0 0 2000 339 241 "Afghanistan" "Tajikistan" 0 0 0 1 2001 339 241 "Afghanistan" "Tajikistan" 6 0 0 6 2002 339 241 "Afghanistan" "Tajikistan" 5 0 0 5 2008 339 241 "Afghanistan" "Tajikistan" 1 0 0 0 2013 339 241 "Afghanistan" "Tajikistan" 1 0 0 0 end

I get the master data not sorted error again. Only the countries and year variables stay

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str11 countrya str10 countryb float year "Afghanistan" "Tajikistan" 2000 "Afghanistan" "Tajikistan" 2001 "Afghanistan" "Tajikistan" 2002 "Afghanistan" "Tajikistan" 2003 "Afghanistan" "Tajikistan" 2004 "Afghanistan" "Tajikistan" 2005 "Afghanistan" "Tajikistan" 2006 "Afghanistan" "Tajikistan" 2007 "Afghanistan" "Tajikistan" 2008 "Afghanistan" "Tajikistan" 2009 "Afghanistan" "Tajikistan" 2010 "Afghanistan" "Tajikistan" 2011 "Afghanistan" "Tajikistan" 2012 "Afghanistan" "Tajikistan" 2013

etc

So, it's providing the years needed, but it's not merging. I'm not sure how it's not sorted correctly.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#13

21 Jan 2017, 11:56

Sorry I didn't pick this up earlier. There's another error in the -merge- statement. In modern Stata, we specify -merge-'s as 1:1, 1:m, or m:1 (-m:m- is legal too but is dangerous as it produces mostly gibberish). When that specification is omitted, Stata's parser thinks you are using old -merge- syntax. Back then, you also had to sort both data sets on the merge key before you could run the merge. So that's why you're getting that error. It should be:

Code:

merge m:1 countrya countryb year using A.dta
1 like
Comment
Monica White

Join Date: Jan 2017

Posts: 98
#14

21 Jan 2017, 13:17

Originally posted by Clyde Schechter View Post

Sorry I didn't pick this up earlier. There's another error in the -merge- statement. In modern Stata, we specify -merge-'s as 1:1, 1:m, or m:1 (-m:m- is legal too but is dangerous as it produces mostly gibberish). When that specification is omitted, Stata's parser thinks you are using old -merge- syntax. Back then, you also had to sort both data sets on the merge key before you could run the merge. So that's why you're getting that error. It should be:

Code:

merge m:1 countrya countryb year using A.dta

Clyde, thank you! I'm actually just learning about merging data in general, so I'm still trying to wrap my head around them all.

In the case of Nyugen and myself, we are using a many to one merge because data we're adding to (using A.dta) is unique, and the data we created is not unique (because it's longitudinal)?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#15

21 Jan 2017, 13:21

In the case of Nyugen and myself, we are using a many to one merge because data we're adding to (using A.dta) is unique, and the data we created is not unique (because it's longitudinal)?

Precisely so.
Comment

Announcement

Add missing observations to the panel data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment