Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • From unbalanced to balanced panel data set

    Dear all,

    I have an unbalanced panel dataset, which means that not all entities have data for all years.

    I am aware that it is okay to have unbalanced data to run regressions. However, I am curious about the entities that have data for every year (or should I say the repeated entities?); and I want to create a subsample from the unbalanced data which only includes those repeated entities. Am I looking for a strongly balanced data set here?

    If yes, is there any way to convert a data set from unbalanced to strongly balanced?

    Thank you for your help in advance.

  • #2
    Duy:
    provided that what you have in mind is, in general, a very bad idea, as the resulting subsample has often nothing to share with the original dataset, yiou may want to try something like:
    Code:
    . use "https://www.stata-press.com/data/r17/nlswork.dta"
    (National Longitudinal Survey of Young Women, 14-24 years old in 1968)
    
    . tab year
    
      Interview |
           year |      Freq.     Percent        Cum.
    ------------+-----------------------------------
             68 |      1,375        4.82        4.82
             69 |      1,232        4.32        9.14
             70 |      1,686        5.91       15.05
             71 |      1,851        6.49       21.53
             72 |      1,693        5.93       27.47
             73 |      1,981        6.94       34.41
             75 |      2,141        7.50       41.91
             77 |      2,171        7.61       49.52
             78 |      1,964        6.88       56.40
             80 |      1,847        6.47       62.88
             82 |      2,085        7.31       70.18
             83 |      1,987        6.96       77.15
             85 |      2,085        7.31       84.45
             87 |      2,164        7.58       92.04
             88 |      2,272        7.96      100.00
    ------------+-----------------------------------
          Total |     28,534      100.00
    
    . bysort idcode: drop if _N<15
    (27,244 observations deleted)
    
    .
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Also, search xtbalance for a couple of options

      Code:
      xtbalance from http://fmwww.bc.edu/RePEc/bocode/x
          'XTBALANCE': module to transform the dataset into balanced Panel Data /
          Transform the unbalanced Panel Data into balanced Panel Data / with sample
          range specified by option range.  / KW: panel data / KW: balanced panel /
          Requires: Stata version 8.2 / Distribution-Date: 20091118 / Author: Yujun
      
      xtbalance2 from http://fmwww.bc.edu/RePEc/bocode/x
          'XTBALANCE2': module to create a balanced subsample from unbalanced panel
          data / xtbalance2 creates an indicator variable to identify a balanced /
          subsample from an unbalanced dataset. The program tries to / maximise the
          numbers of observations with respect to either the / time dimension of the

      Comment


      • #4
        Hi Carlo, that is exactly what I was looking for. Thank you for that. But could you please explain more about why my original thought was a bad idea?


        Thank you Justin for your suggestions as well. I will have a closer look at it.
        Last edited by Duy To; 02 Oct 2022, 16:36.

        Comment


        • #5
          Duy:
          there are different types of missing data (missing completely at random, missing at random and missing not at random) that are not equlvalent and should be dealt with differently.
          If you do not diagnose the mechanism that drives the missingness you're actually ignoring this feature of your dataset and run your regression as a complete case analysis (that is, on those panels which have no missing data during the entire timespan your dataset stretches over). But your complete case analysis is simply focused of the "cream" of your original sample and, as such, the inference you make is not valid for the panels with missing data which are present in your dataset.
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Carlo Lazzaro:
            I know that Stata can handle both balanced and unbalanced panel datasets. However, I wonder that the regression results are still BLUE if I use Stata regression commands, (i.e., xtreg y x1 x2, fe vce(cluster cvar)) for unbalanced panel datasets?
            --------------------
            (Stata 15.1 MP)

            Comment


            • #7
              Linh:
              yes, as far as I know.
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                Linh: If this is not a question on a problem set or take-home exam then I will provide a complete answer. The partial answer is: not necessarily.

                Comment


                • #9
                  Jeff Wooldridge
                  The problem is just an argument between me and my colleagues. I said that Stata can handle both balanced and unbalanced panel datasets. However, he said no because regression results or predictions will not be exact (not BLUE). However, I don't know any document or article stating that results from Stata command for unbalanced panel data is still BLUE.

                  If you can provide a complete answer, it is wonderful.
                  --------------------
                  (Stata 15.1 MP)

                  Comment


                  • #10
                    Jeff Wooldridge
                    The problem is just an argument between me and my colleagues. I said that Stata can handle both balanced and unbalanced panel datasets. However, he said no because regression results or predictions will not be exact (not BLUE). However, I don't know any document or article stating that results from Stata command for unbalanced panel data are still BLUE.

                    If you can provide a complete answer, it is wonderful.
                    --------------------
                    (Stata 15.1 MP)

                    Comment


                    • #11
                      The FE estimator is BLUE only when you can rule out serial correlation and heteroskedasticity in the idiosyncratic errors (and assuming the conditional mean is correctly specified). This is true on a balanced and unbalanced panel, provided the selection can be ignored. If you need to use vce(cluster id) then the estimator is not BLUE in general. The exact properties are not often discussed. I discuss the asymptotic efficiency in Chapter 10 of my MIT Press book, and then in the later chapter on sample selection problems. If X is all covariates and S is all complete cases indicator, you can show, under the assumption E(Y|X,S,C) does not depend on S and Var(Y|X,X,C) has the usual scalar form, then FE is BLUE conditional on X and S.

                      Comment


                      • #12
                        Thank you so much.
                        --------------------
                        (Stata 15.1 MP)

                        Comment

                        Working...
                        X