Unbalanced to Balanced panel

Zariab Hossain

Join Date: Oct 2020

Posts: 48
#1

Unbalanced to Balanced panel

07 Feb 2024, 13:14

Dear Stata users,

I am trying to make an unbalanced panel to a balanced panel. But I have some questions. Let me explain with a real example

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str9 dnr202115 int redovar str3 ssyk_3d "100000001" 1996 "822" "100000001" 1997 "822" "100000001" 1998 "311" "100000001" 2001 "131" "100000001" 2002 "131" "100000001" 2003 "131" "100000001" 2004 "311" "100000001" 2005 "311" "100000001" 2006 "311" "100000001" 2008 "311" "100000001" 2009 "311" "100000001" 2010 "311" "100000005" 1996 "122" "100000005" 1997 "122" "100000005" 1999 "122" "100000005" 2000 "122" "100000005" 2002 "347" "100000005" 2003 "347" "100000005" 2004 "241" "100000005" 2005 "241" "100000005" 2006 "241" "100000005" 2007 "241" "100000005" 2008 "213" "100000005" 2010 "419" "100000009" 1996 "513" "100000009" 1997 "513" "100000009" 1998 "513" "100000009" 1999 "513" "100000009" 2000 "513" "100000009" 2001 "513" end

Here dnr202115 is person id, redovar is year and ssyk_3d is occupational code. You can see that some of year's info is missing as in some years a person remains unemployed. However, I want to make a balanced panel out of it which will range from 1996-2010 and the gap years will take ssyk_3d from the previous years. I want 4 things to be taken care of in this imputation:

a) For example, if year 2008 is missing and we have information for 2007 and 2009. Then 2008's ssyk_3d will be filled by 2007.
b) If 2003-06 is missing, then 2003-2006 will be filled by 2002's ssyk_3d.
c) If a person's employment history starts from 2000, 1996-1999 cells for that person would be empty
d) It could be possible that all of these three things can happen at the same time for any person

I would highly appreciate if anyone gives a hint.

Thanks
Zariab Hossain
Uppsala University
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29799
#2

07 Feb 2024, 13:32

Code:

encode dnr202115, gen(id) drop dnr202115 xtset id redovar tsfill, full by id (redovar), sort: replace ssyk_3d = ssyk_3d[_n-1] if missing(ssyk_3d)

Note: This code assumes that there are fewer than 65,536 distinct values of dnr202115 in your data set. If that limit is exceeded, you will have to do it slightly differently:

Code:

egen `c(obs_t)' id = group(dnr202115) xtset id redovar tsfill, full by id (redovar), sort: replace ssyk_3d = ssyk_3d[_n-1] if missing(ssyk_3d) by id (redovar): replace dnr202115 = dnr202115[_n-1] if missing(dnr202115)

That said, are you sure you want to do this. Imputation of missing data by last observation carried forward will usually produce an inaccurate, biased data set. Moreover, with modern software, there are few analyses that actually require a balanced data set. For most purposes, you would probably be better off leaving your data the way it is.
Comment
Zariab Hossain

Join Date: Oct 2020

Posts: 48
#3

07 Feb 2024, 13:48

Thanks a lot Clyde. I have more than a million distinct observations, so I am going to use the second code. You are right that this kind of imputation may lead to problems in many cases. However, I am going to create a new variable called "most recent occupation" and I believe your code will help me do that.
Comment

Announcement

Unbalanced to Balanced panel

Comment

Comment