From unbalanced to balanced panel data set

Duy To

Join Date: Sep 2020

Posts: 43
#1

From unbalanced to balanced panel data set

01 Oct 2022, 19:44

Dear all,

I have an unbalanced panel dataset, which means that not all entities have data for all years.

I am aware that it is okay to have unbalanced data to run regressions. However, I am curious about the entities that have data for every year (or should I say the repeated entities?); and I want to create a subsample from the unbalanced data which only includes those repeated entities. Am I looking for a strongly balanced data set here?

If yes, is there any way to convert a data set from unbalanced to strongly balanced?

Thank you for your help in advance.
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17678

02 Oct 2022, 04:12

Duy:
provided that what you have in mind is, in general, a very bad idea, as the resulting subsample has often nothing to share with the original dataset, yiou may want to try something like:

Code:

. use "https://www.stata-press.com/data/r17/nlswork.dta"
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)

. tab year

  Interview |
       year |      Freq.     Percent        Cum.
------------+-----------------------------------
         68 |      1,375        4.82        4.82
         69 |      1,232        4.32        9.14
         70 |      1,686        5.91       15.05
         71 |      1,851        6.49       21.53
         72 |      1,693        5.93       27.47
         73 |      1,981        6.94       34.41
         75 |      2,141        7.50       41.91
         77 |      2,171        7.61       49.52
         78 |      1,964        6.88       56.40
         80 |      1,847        6.47       62.88
         82 |      2,085        7.31       70.18
         83 |      1,987        6.96       77.15
         85 |      2,085        7.31       84.45
         87 |      2,164        7.58       92.04
         88 |      2,272        7.96      100.00
------------+-----------------------------------
      Total |     28,534      100.00

. bysort idcode: drop if _N<15
(27,244 observations deleted)

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Justin Niakamal

Join Date: Aug 2017
Posts: 757

02 Oct 2022, 07:26

Also, search xtbalance for a couple of options

Code:

xtbalance from http://fmwww.bc.edu/RePEc/bocode/x
    'XTBALANCE': module to transform the dataset into balanced Panel Data /
    Transform the unbalanced Panel Data into balanced Panel Data / with sample
    range specified by option range.  / KW: panel data / KW: balanced panel /
    Requires: Stata version 8.2 / Distribution-Date: 20091118 / Author: Yujun

xtbalance2 from http://fmwww.bc.edu/RePEc/bocode/x
    'XTBALANCE2': module to create a balanced subsample from unbalanced panel
    data / xtbalance2 creates an indicator variable to identify a balanced /
    subsample from an unbalanced dataset. The program tries to / maximise the
    numbers of observations with respect to either the / time dimension of the

Comment

Duy To

Join Date: Sep 2020

Posts: 43
#4

02 Oct 2022, 16:32

Hi Carlo, that is exactly what I was looking for. Thank you for that. But could you please explain more about why my original thought was a bad idea?

Thank you Justin for your suggestions as well. I will have a closer look at it.

Last edited by Duy To; 02 Oct 2022, 16:36.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17678
#5

02 Oct 2022, 22:58

Duy:
there are different types of missing data (missing completely at random, missing at random and missing not at random) that are not equlvalent and should be dealt with differently.
If you do not diagnose the mechanism that drives the missingness you're actually ignoring this feature of your dataset and run your regression as a complete case analysis (that is, on those panels which have no missing data during the entire timespan your dataset stretches over). But your complete case analysis is simply focused of the "cream" of your original sample and, as such, the inference you make is not valid for the panels with missing data which are present in your dataset.

Kind regards,
Carlo
(Stata 19.0)
Comment
Linh Nguyen

Join Date: Nov 2017

Posts: 85
#6

27 Jul 2023, 21:24

Carlo Lazzaro:
I know that Stata can handle both balanced and unbalanced panel datasets. However, I wonder that the regression results are still BLUE if I use Stata regression commands, (i.e., xtreg y x1 x2, fe vce(cluster cvar)) for unbalanced panel datasets?

--------------------
(Stata 15.1 MP)
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17678
#7

28 Jul 2023, 00:09

Linh:
yes, as far as I know.

Kind regards,
Carlo
(Stata 19.0)
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#8

28 Jul 2023, 15:42

Linh: If this is not a question on a problem set or take-home exam then I will provide a complete answer. The partial answer is: not necessarily.
Comment
Linh Nguyen

Join Date: Nov 2017

Posts: 85
#9

28 Jul 2023, 22:09

Jeff Wooldridge
The problem is just an argument between me and my colleagues. I said that Stata can handle both balanced and unbalanced panel datasets. However, he said no because regression results or predictions will not be exact (not BLUE). However, I don't know any document or article stating that results from Stata command for unbalanced panel data is still BLUE.

If you can provide a complete answer, it is wonderful.

--------------------
(Stata 15.1 MP)
Comment
Linh Nguyen

Join Date: Nov 2017

Posts: 85
#10

28 Jul 2023, 22:12

Jeff Wooldridge
The problem is just an argument between me and my colleagues. I said that Stata can handle both balanced and unbalanced panel datasets. However, he said no because regression results or predictions will not be exact (not BLUE). However, I don't know any document or article stating that results from Stata command for unbalanced panel data are still BLUE.

If you can provide a complete answer, it is wonderful.

--------------------
(Stata 15.1 MP)
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#11

28 Jul 2023, 22:32

The FE estimator is BLUE only when you can rule out serial correlation and heteroskedasticity in the idiosyncratic errors (and assuming the conditional mean is correctly specified). This is true on a balanced and unbalanced panel, provided the selection can be ignored. If you need to use vce(cluster id) then the estimator is not BLUE in general. The exact properties are not often discussed. I discuss the asymptotic efficiency in Chapter 10 of my MIT Press book, and then in the later chapter on sample selection problems. If X is all covariates and S is all complete cases indicator, you can show, under the assumption E(Y|X,S,C) does not depend on S and Var(Y|X,X,C) has the usual scalar form, then FE is BLUE conditional on X and S.
2 likes
Comment
Linh Nguyen

Join Date: Nov 2017

Posts: 85
#12

29 Jul 2023, 02:08

Thank you so much.

--------------------
(Stata 15.1 MP)
Comment

Announcement

From unbalanced to balanced panel data set

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment