Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to generate a region-year unbalanced panel where I only want to drop region-year cells below a certain total observation threshold?

    Dear Statalist,

    I would like to produce an unbalanced region-year panel for which I drop only those region-year cells below a certain total observation threshold (50 observations).
    I tried the following code:
    Code:
    fillin region year 
    bysort region year: egen freq = total(!_fillin) 
    by region year: egen minfreq = min(freq)
    drop if minfreq <50 
    drop if _fillin
    I tried the above code but I still have some region-year cells with less than 50 observations, although I thought I dropped those region-year cells.
    Please note that I do not want to drop the entire region when a region-year cell is below 50, I just want to drop that particular year for a given region.
    This should generate a unbalanced panel dataset.
    I would really appreciate your help.
    Have a great day!
    Best,
    Nico


  • #2
    Well, the only thing that strikes me about your code is the unnecessary creation of variable minfreq, which is always equal to freq. But other than that, I don't see why it wouldn't work, and when I tried to replicate your difficulty with a toy data set, I do not reproduce the problem. See:
    Code:
    . //      CREATE TOY DATA SET
    . clear*
    
    . set obs 10
    Number of observations (_N) was 0, now 10.
    
    . gen region = _n
    
    .
    . expand 10
    (90 observations created)
    
    . by region, sort: gen year = 2004 + _n
    
    .
    . set seed 1234
    
    . drop if runiform() < 0.25
    (29 observations deleted)
    
    .
    . gen n_obs = runiformint(40, 100)
    
    . expand n_obs
    (4,750 observations created)
    
    . by region year, sort: gen obs_no = _n
    
    .
    .
    . //      RUN THE (MODIFIED) CODE
    . fillin region year
    
    . bysort region year: egen freq = total(!_fillin)
    
    . drop if freq < 50 // NOTE THE ELIMINATION OF MINFREQ
    (658 observations deleted)
    
    . drop if _fillin
    (0 observations deleted)
    
    . // VERIFY CORRECTNESS OF RESULTS
    . by region year, sort: assert _N >= 50 & !_fillin
    
    .
    The absence of any response from Stat following the final command verifies that each region year combination has at least 50 observations that weren't filled in.

    All of that said, I don't understand why you are doing this. Why are you using -fillin- in the first place when, in the end, you drop all of the observations it creates anyway. Wouldn't you do just as well by running -by region year, sort: drop if _N < 50- in the original data?

    Comment


    • #3
      Hi Clyde,
      I would like to thank you once again for your help, this was not the first time.
      I will have a look at my dataset, it seems I overlooked something somewhere.
      And yes, I think you are correct, I made things unnecessarily complicated as your last coding suggestion makes clear.
      Have a great day!
      And thank you very much.
      Best,
      Nico

      Comment

      Working...
      X