Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Lag/Forward Operator in Unbalanced Panel

    Dear Statalist:

    Consider a panel data set with monthly data in which the variable $x$ in month $t+1$ reports an outcome of month $t$. Suppose the panel is unbalanced such that Jan, Feb, and Apr, but not Mar, are observed.

    To sync outcomes with calendar dates, I use the forward operator in this example:
    ***
    gen outcome = f.x
    ***

    Given that Mar is not in my data set, I am dropping the information about the outcome (reported in Apr) even though it is observed.

    I am looking for a clever way to generate an observation for Mar that contains this information. The fillin command would be an option to fill in all gaps in the panel, but given that my data set is very large I only want to fill in obs for which the outcome is available.

    Thanks,
    Peter


  • #2
    You'll increase your chances of a useful answer if you follow the faq on asking questions - provide Stata code in code delimiters, Stata output, and sample data using dataex.

    There are better ways to handle missing data - look at the multiple imputations documentation.

    I don't fully understand your problem. You can always do multiple leads or lags with conditions (if outcome==. or whatever).

    Comment


    • #3
      If I understand the problem correctly, some months of the year never appear in the data (if for example data is not collected in March). If there are no missing observations otherwise, you could simply generate a sequential date and use that when declaring the data as panel.

      If there are some real missing observations in the data on top of the always missing March observations, you have to create the equivalent of a monthly business calendar using all the information found in the data. To understand some of the code, you have to be familiar with the information in help datetime.

      Code:
      * create fake data for 10 panel over 120 months.
      clear
      set seed 2133
      set obs 10
      gen long panel = _n
      expand 120
      bysort panel: gen mdate = ym(1999, 12) + _n
      format %tm mdate
      gen x = runiform()
      
      * drop some random months and obs in March
      drop if runiform() < .1
      gen m = month(dofm(mdate))
      drop if m == 3
      tab m
      save "statalist_example.dta", replace
      
      * reduce to one observation per month in the data
      contract mdate, freq(mdate2use)
      replace mdate2use = _n
      
      * merge new monthly date with main dataset
      merge 1:m mdate using "statalist_example.dta", assert(match) nogen
      
      * declare data as panel
      tsset panel mdate2use
      
      * tag real missing using lags
      bysort panel: gen realmis = mi(L.x) & _n > 1

      Comment


      • #4
        Thanks for your help and sorry for the delay in responding. In order to clarify my problem, I have adopted Robert's example as follows:

        Code:
        * create fake data for 10 panel over 120 months.
        clear
        set seed 2133
        set obs 10
        gen long panel = _n
        expand 120
        bysort panel: gen mdate = ym(1999, 12) + _n
        format %tm mdate
        gen x = runiform()
        
        * drop some random months and obs in March
        drop if runiform() < .1
        gen m = month(dofm(mdate))
        *drop if m == 3
        tab m
        
        * declare data as panel
        tsset panel mdate
        
        * copy period t outcome reported in t+1 to period t
        bysort panel: gen outcome = f.x
        
        list in 25/28
        In this example, some month are randomly missing and thus the number of observations per month varies. Now suppose that the x reported in month-year t+1 refers to an outcome in month-year t. I want that this outcome reported in t+1 is recorded in t, hence the use of the forward operator.

        If you use the same seed as in my example, you will see that 2002m4 is missing for panel 1 (i.e. there is no observation). However, 2002m5 is observed and the value reported in that month contains the outcome of x in 2002m4. By using the forward operator and because there is no observation for 2002m4, STATA discards the value of x observed in 2002m5.

        What I want to do now is generate a new observation for 2002m4 that contains this value.

        Comment


        • #5
          Various further fixes and dodges are possible, but one simple device here is to think that this is an interpolation problem. In the language of mipolate (SSC) (see e.g. https://www.statalist.org/forums/for...-interpolation) I think you want backward interpolation.

          Comment


          • #6
            I am not sure whether mipolate helps here, since it will only interpolate a missing value (the outcome in March in my example) but not add the missing observation.

            As I wrote in my first post, the "fillin" command would do the job:

            Code:
            * create fake data for 10 panel over 120 months.
            clear
            set seed 2133
            set obs 10
            gen long panel = _n
            expand 120
            bysort panel: gen mdate = ym(1999, 12) + _n
            format %tm mdate
            gen x = runiform()
            
            * drop some random months and obs in March
            drop if runiform() < .1
            gen m = month(dofm(mdate))
            *drop if m == 3
            tab m
            
            * declare data as panel
            tsset panel mdate
            
            * copy period t outcome reported in t+1 to period t
            bysort panel: gen outcome = f.x
            
            list in 25/29
            
            gen y = year(dofm(mdate))
            fillin panel m y
            replace mdate = ym(y,m)
            
            sort panel mdate
            
            bysort panel: replace outcome = f.x
            
            list in 25/29
            However, my data set is very large and there is lots of attrtion. So I only want to fill in obs for which the outcome is available.

            Thanks for your help,
            Peter

            Comment


            • #7
              You're right that mipolate won't add new observations. fillin first.

              Comment


              • #8
                Many thanks for your help, Nick! It seems that tsfill is a more efficient option than fillin because it only fills in observations between the first and the last occurence of a panel observation (suppose you have 20 years of monthly survey data and a survey respondent participates in only 3 month, then fillin would create a lot of observations with missings; given that my panel has more than a millon observations and is highly unbalanced, you would create tens of thousands of observations with misisngs and computing time would be too long).

                However, tsfill does not add observations before the first occurence of a panel observation. Going back to my example, the first observation on panel 1 is in 2000m2 and since x is the outcome of a variable in 2000m1, I would like to add an observation for that month too.

                Ideally, the code that I am looking for would create a new time observation for each panel in case there is none only if f.x is non-missing.
                Last edited by Peter Zorn; 29 Jun 2017, 04:32.

                Comment


                • #9
                  Hi everyone,

                  I am struggling with the lag/future operators. When I use the lag operator, I seem to get the future value (Lagged_Excess_USD_w) . When I use the future value, I get the lagged value (Future_Excess_USD_w).

                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input long Permo float(ymdate Excess_USD_w Lagged_Excess_USD_w Future_Excess_USD_w)
                  1 517 -.009649734 . .07638893
                  1 518 .07638893 -.009649734 -.04564878
                  1 519 -.04564878 .07638893 .04588351
                  1 520 .04588351 -.04564878 -.06459396
                  1 521 -.06459396 .04588351 .00938858
                  1 522 .00938858 -.06459396 -.0007833333
                  1 523 -.0007833333 .00938858 -.03948195
                  1 524 -.03948195 -.0007833333 .00965719
                  1 525 .00965719 -.03948195 -.011107817
                  1 526 -.011107817 .00965719 .
                  1 535 -.007147548 . .008691667
                  1 536 .008691667 -.007147548 .06196634

                  I actually need to calculate the future value, but I want to make sure that I do it correctly. Thank you.

                  Comment

                  Working...
                  X