
No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to generate a region-year unbalanced panel where I only want to drop region-year cells below a certain total observation threshold?

    Dear Statalist,

    I would like to produce an unbalanced region-year panel for which I drop only those region-year cells below a certain total observation threshold (50 observations).
    I tried the following code:
    fillin region year 
    bysort region year: egen freq = total(!_fillin) 
    by region year: egen minfreq = min(freq)
    drop if minfreq <50 
    drop if _fillin
    I tried the above code but I still have some region-year cells with less than 50 observations, although I thought I dropped those region-year cells.
    Please note that I do not want to drop the entire region when a region-year cell is below 50, I just want to drop that particular year for a given region.
    This should generate a unbalanced panel dataset.
    I would really appreciate your help.
    Have a great day!

  • #2
    Well, the only thing that strikes me about your code is the unnecessary creation of variable minfreq, which is always equal to freq. But other than that, I don't see why it wouldn't work, and when I tried to replicate your difficulty with a toy data set, I do not reproduce the problem. See:
    . //      CREATE TOY DATA SET
    . clear*
    . set obs 10
    Number of observations (_N) was 0, now 10.
    . gen region = _n
    . expand 10
    (90 observations created)
    . by region, sort: gen year = 2004 + _n
    . set seed 1234
    . drop if runiform() < 0.25
    (29 observations deleted)
    . gen n_obs = runiformint(40, 100)
    . expand n_obs
    (4,750 observations created)
    . by region year, sort: gen obs_no = _n
    . //      RUN THE (MODIFIED) CODE
    . fillin region year
    . bysort region year: egen freq = total(!_fillin)
    . drop if freq < 50 // NOTE THE ELIMINATION OF MINFREQ
    (658 observations deleted)
    . drop if _fillin
    (0 observations deleted)
    . by region year, sort: assert _N >= 50 & !_fillin
    The absence of any response from Stat following the final command verifies that each region year combination has at least 50 observations that weren't filled in.

    All of that said, I don't understand why you are doing this. Why are you using -fillin- in the first place when, in the end, you drop all of the observations it creates anyway. Wouldn't you do just as well by running -by region year, sort: drop if _N < 50- in the original data?


    • #3
      Hi Clyde,
      I would like to thank you once again for your help, this was not the first time.
      I will have a look at my dataset, it seems I overlooked something somewhere.
      And yes, I think you are correct, I made things unnecessarily complicated as your last coding suggestion makes clear.
      Have a great day!
      And thank you very much.

