Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unbalanced to Balanced panel

    Dear Stata users,

    I am trying to make an unbalanced panel to a balanced panel. But I have some questions. Let me explain with a real example

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str9 dnr202115 int redovar str3 ssyk_3d
    "100000001" 1996 "822"
    "100000001" 1997 "822"
    "100000001" 1998 "311"
    "100000001" 2001 "131"
    "100000001" 2002 "131"
    "100000001" 2003 "131"
    "100000001" 2004 "311"
    "100000001" 2005 "311"
    "100000001" 2006 "311"
    "100000001" 2008 "311"
    "100000001" 2009 "311"
    "100000001" 2010 "311"
    "100000005" 1996 "122"
    "100000005" 1997 "122"
    "100000005" 1999 "122"
    "100000005" 2000 "122"
    "100000005" 2002 "347"
    "100000005" 2003 "347"
    "100000005" 2004 "241"
    "100000005" 2005 "241"
    "100000005" 2006 "241"
    "100000005" 2007 "241"
    "100000005" 2008 "213"
    "100000005" 2010 "419"
    "100000009" 1996 "513"
    "100000009" 1997 "513"
    "100000009" 1998 "513"
    "100000009" 1999 "513"
    "100000009" 2000 "513"
    "100000009" 2001 "513"
    end
    Here dnr202115 is person id, redovar is year and ssyk_3d is occupational code. You can see that some of year's info is missing as in some years a person remains unemployed. However, I want to make a balanced panel out of it which will range from 1996-2010 and the gap years will take ssyk_3d from the previous years. I want 4 things to be taken care of in this imputation:

    a) For example, if year 2008 is missing and we have information for 2007 and 2009. Then 2008's ssyk_3d will be filled by 2007.
    b) If 2003-06 is missing, then 2003-2006 will be filled by 2002's ssyk_3d.
    c) If a person's employment history starts from 2000, 1996-1999 cells for that person would be empty
    d) It could be possible that all of these three things can happen at the same time for any person

    I would highly appreciate if anyone gives a hint.

    Thanks
    Zariab Hossain
    Uppsala University

  • #2
    Code:
    encode dnr202115, gen(id)
    drop dnr202115
    xtset id redovar
    tsfill, full
    by id (redovar), sort: replace ssyk_3d = ssyk_3d[_n-1] if missing(ssyk_3d)
    Note: This code assumes that there are fewer than 65,536 distinct values of dnr202115 in your data set. If that limit is exceeded, you will have to do it slightly differently:
    Code:
    egen `c(obs_t)' id = group(dnr202115)
    xtset id redovar
    tsfill, full
    by id (redovar), sort: replace ssyk_3d = ssyk_3d[_n-1] if missing(ssyk_3d)
    by id (redovar): replace dnr202115 = dnr202115[_n-1] if missing(dnr202115)
    That said, are you sure you want to do this. Imputation of missing data by last observation carried forward will usually produce an inaccurate, biased data set. Moreover, with modern software, there are few analyses that actually require a balanced data set. For most purposes, you would probably be better off leaving your data the way it is.

    Comment


    • #3
      Thanks a lot Clyde. I have more than a million distinct observations, so I am going to use the second code. You are right that this kind of imputation may lead to problems in many cases. However, I am going to create a new variable called "most recent occupation" and I believe your code will help me do that.

      Comment

      Working...
      X