Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mixed frequency data in xtdpdgmm package

    Dear Statalisters and Sebastian Kripfganz ,


    I have a panel data which consist of weekly observations of income and monthly observations of depression score for 800 individuals. I would like to estimate a dynamic GMM using depression_score as dependent variable and lag of dependent variable and income as independent variables. However, I observe these variables at different frequency (irregular time interval, time spacing) and I have missing values. For example, individual 1 has two mental health interviews at week 5 (wave 1) and week 10 (wave 2) while individual 2 has the same interview at week 3 (wave 1) and week 7 (wave 2). Thus, I could not decide how to define time in xtset for xtdpdgmm (week or wave?). I can keep only non missing depression_score and use week as a time variable. Then, would xtdpdgmm take the difference between time elapsed between two interview for different individuals? Individual 1 has 5 weeks gap between Y and L.Y while individual 2 has 4 weeks gap. Would it be a problem for the estimation?

    Or should I use wave instead? This keeps time lag between Y and L.Y same for all individuals (week 3 and 5 in wave 1, week 7 and 10 in wave 2). Do you have any suggestions?


    Code:
    input float(id week wave) double income double float depression_score
    1 1 1  100 .
    1 2  1 . .
    1 3  1 50 .
    1 4  1 . .
    1 5  1 60 12
    1 6  2 . .
    1 7  2 . .
    1 8  2 80 .
    1 9  2 . .
    1 10 2  100 10
    2  1 1  . .
    2  2 1 50 .
    2  3 1  90 8
    2  4 1  . .
    2  5 1  60 .
    2  6 2  . .
    2  7 2  100 12
    2  8 2  . .
    2  9 2  . .
    2 10 2  . .
    
    end
    Thanks in advance!

    Best regards,
    Last edited by Nursena Sagir; 20 Jan 2022, 07:38.

  • #2
    I believe you would need to collapse your data by wave, e.g.
    Code:
    collapse id (sum) income, by(wave)
    xtset id wave
    xtdpdgmm and other panel data commands do not automatically ignore the missing observations when calculating the differences.
    https://www.kripfganz.de/stata/

    Comment


    • #3
      Thanks Sebastian. I have one further question. If I have missing values in wave variable, will xtdpgmm or any of panel data packages consider L.Y as Y[_n-1] or Y at wave[_n-1], which are not necessarily the same? Depending on that would it make sense to create variable like "gen lag_Y=L.Y if wave[_n]= wave[_n-1]+1" to keep only true lag of Y in the regression?

      Comment


      • #4
        With your current data set, you probably have set week as the time variable. L.Y would then correspond to Y in the previous week, not wave.

        Also, if wave[_n] is nonmissing in your case, then wave[_n-1] is always missing. Thus, you cannot generate the lag of Y in the way you proposed. As an alternative to collapse, you could drop all observations for which wave is missing, and then set wave as the time identifier:
        Code:
        drop if missing(wave)
        xtset id wave
        However, you would then lose some information from the income variable.
        https://www.kripfganz.de/stata/

        Comment


        • #5
          Originally posted by Sebastian Kripfganz View Post
          NT] you could drop all observations for which wave is missing, and then set wave as the time identifier:
          Code:
          drop if missing(wave)
          xtset id wave
          However, you would then lose some information from the income variable.
          Yes this what I exactly did. My questions was if I drop missing waves with this command then how panel data packages treat L.Y? Let's say individual 1 has nonmissing Y value in wave 1 and wave 3 and we drop missing wave 2. What is L.Y in the estimation for Y at wave 1 then? Is it wave 3 or Stata automatically drop individual 1 as there is no wave 2?

          Moreover, to not loose information about income variable I can generate required lag of income as follows: gen lag_income=income[_n-1]. I think this is the most efficient way. What do you think?

          Comment


          • #6
            I just notice that in my previous post it should have been missing(depression_score), not missing(wave).

            If wave is the time identifier and it can take values 1, 2, 3, then for wave 3 the operation L.Y refers to Y from wave 2, irrespectively of whether that observation is available. If that observation is not available, then it will not be included in any calculation you do.

            When working with panel data, you need to be careful with the _n indicator. _n-1 always refers to the previous row in the data set. Yet, this could belong to a different individual. Ideally, define your time variable appropriately and use the time-series lag operator. Alternatively, use the prefix by id:
            https://www.kripfganz.de/stata/

            Comment


            • #7
              Thanks Sebastian, your answer perfectly clarifies my question.

              Comment

              Working...
              X