Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel data with duplicates

    Dear All,
    Please, I have the following kind of panel:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(ID Year freqm2 freqm3 freqm4 freqm5 freqm6 freqm7 freqm8 freqm9 freqm10 freqm11)
    98 2002 91  4 20  .  . 75 64 80 17 77
    52 2001 53  . 70  .  .  .  . 71  .  .
    42 2000 15 34 55 48 94  .  . 78 44 52
    83 2001  5  .  .  . 64 55 56 26 57 59
    75 2002  .  .  .  .  .  . 76 32  . 92
    75 2002  3  5  7 18 22 90 76 32 88 92
    98 2002  .  .  . 14 25  .  .  .  .  .
    98 2003 10  5  .  .  .  .  .  .  .  3
    98 2003  .  . 52 36 87 95  0 47  .  .
    end
    And I would like to transform it into:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte id int year byte(freqm2 freqm3 freqm4 freqm5 freqm6 freqm7 freqm8 freqm9 freqm10 freqm11)
    98 2002 91  4 20 14 25 75 64 80 17 77
    52 2001 53  . 70  .  .  .  . 71  .  .
    42 2000 15 34 55 48 94  .  . 78 44 52
    83 2001  5  .  .  . 64 55 56 26 57 59
    75 2002  .  .  .  .  .  . 76 32  . 92
    75 2002  3  5  7 18 22 90 76 32 88 92
    98 2003 10  5 52 36 87 95  0 47  .  3
    end
    freqm correspond to a value between 0-100 for attendance per month. So each column corresponds to the attendance per month (excluding January and December). But in the first data set some same IDs and years repeat themselves because some months are in one row and others in other rows.
    I want to organise this panel. Could anyone help me, please?

  • #2
    I'm a bit confused what you want to do. In general it seems that you want to reduce the data to one observation per ID per Year, and to do that by pulling the non-missing values from the otherwise duplicative observations. But for some reason that is unclear to me, in your example results you retained separate observations for ID 75 Year 2002. Is that a mistake? Or is there something special about that case that should be accounted for in the code? If so, how do we identify in the data the cases that should not be reduced?

    On the assumption that ID 75 Year 2002 was just an error, you can do this:
    Code:
    collapse (firstnm) freqm*, by(ID Year)
    I should emphasize that doing this is only sensible if you are quite sure that the same ID and Year never has multiple observations with different non-missing values for some freqm* variable. Any such contradictory observations should be resolved before doing the -collapse- so that you don't just end up with invalid data. You might want to check that first with:
    Code:
    foreach v of varlist freqm* {
        by ID Year (`v'), sort: assert `v' == `v'[1] | missing(`v')
    }
    And, looking ahead, you might want to take the next step and make this full monthly-panel data with:
    Code:
    reshape long freqm, i(ID Year) j(month)
    gen mdate = ym(Year, month)
    format mdate %tm
    drop Year month
    Most things you will want to do with this data will work better, or only, if you do this.

    Comment


    • #3
      Dear Clyde, Thank you very much!

      The ID 75 was indeed my mistake.

      What if I use

      Code:
      collapse (max)

      instead of
      Code:
      collapse (firstnm)
      Would this give me the same results?

      Regarding the monthly panel reshape, I cannot do this because my main data set is yearly. So, what I was thinking is to create annual averages out of these monthly attendances. Or creating dummies for certain thresholds of this attendances.

      Many thanks!

      Comment


      • #4
        If the non-missing values of the freqm* variables are always constant within each combination of ID and Year, then, yes -collapse (max)- will give the same result as -collapse (firstnm)-. But if sometimes there are different non-missing values of a freqm* variable for the same combination of ID and Year, then -collapse (max)- will, evidently, leave you with the largest one, whereas -collapse (firstnm)- will give you the one that is first in the current sort order.

        I understand your reasoning for not taking the data to fully long monthly layout. It seems reasonable.

        Comment


        • #5
          Thank you so much!!

          Comment

          Working...
          X