Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splitting spell data by year

    Dear all,

    I am once again at my wits end. To supplement my panel data with additional data, I'd like to incorporate some information from a spell data set, looking like that:
    pid beginy endy spelltyp
    201 1984 1990 (3) lives with partner
    201 1990 1991 (4) Partner not in Household
    201 1991 1995 (5) single
    201 1995 2008 (4) lives with partner

    To do so, I have to split up the spells by year to yield something like that:
    pid year spelltyp
    201 1990 (4) partner not in household
    201 1991 (4) partner not in household
    I tried variations of this code:

    gen duration = endy - beginy + 1
    expand duration
    bysort pid: gen x = _n - 1
    gen syear = beginy + x

    but always end up with unrealistic results (example above year 1991 & 1993) and/or not enough rows (1 instead of 2, 5 instead of 14)

    What am I doing wrong?



  • #2
    It appears that the spell is not inclusive of the endpoint, in which case:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int(pid beginy endy) str28 spelltyp
    201 1984 1990 "(3) lives with partner"      
    201 1990 1991 "(4) Partner not in Household"
    201 1991 1995 "(5) single"                  
    201 1995 2008 "(4) lives with partner"      
    end
    
    bys pid (beginy): assert beginy==endy[_n-1] if _n>1
    gen toexpand= endy- beginy
    expand toexpand
    bys pid (beginy): gen year= cond(_n==1, beginy[1], beginy[1]+_n-1), after(pid)
    drop beginy endy toexpand
    Make sure that the assertion in line #1 is confirmed before proceeding.


    Res.:

    Code:
    . l, sepby(pid spelltyp)
    
         +-------------------------------------------+
         | pid   year                       spelltyp |
         |-------------------------------------------|
      1. | 201   1984         (3) lives with partner |
      2. | 201   1985         (3) lives with partner |
      3. | 201   1986         (3) lives with partner |
      4. | 201   1987         (3) lives with partner |
      5. | 201   1988         (3) lives with partner |
      6. | 201   1989         (3) lives with partner |
         |-------------------------------------------|
      7. | 201   1990   (4) Partner not in Household |
         |-------------------------------------------|
      8. | 201   1991                     (5) single |
      9. | 201   1992                     (5) single |
     10. | 201   1993                     (5) single |
     11. | 201   1994                     (5) single |
         |-------------------------------------------|
     12. | 201   1995         (4) lives with partner |
     13. | 201   1996         (4) lives with partner |
     14. | 201   1997         (4) lives with partner |
     15. | 201   1998         (4) lives with partner |
     16. | 201   1999         (4) lives with partner |
     17. | 201   2000         (4) lives with partner |
     18. | 201   2001         (4) lives with partner |
     19. | 201   2002         (4) lives with partner |
     20. | 201   2003         (4) lives with partner |
     21. | 201   2004         (4) lives with partner |
     22. | 201   2005         (4) lives with partner |
     23. | 201   2006         (4) lives with partner |
     24. | 201   2007         (4) lives with partner |
         +-------------------------------------------+
    
    .

    Comment


    • #3
      See https://journals.sagepub.com/doi/pdf...867X1301300116 for more discussion of Andrew Musau's method.

      Comment


      • #4
        O.P. does not explain what "supplement my panel data with additional data" means. But it is plausible that she intends to -merge- the file she shows with a panel data set that she is working with. In that case, depending on the details, it may not be necessary to do the expansion shown in #2 and commented on in #3. It might be easier to do this with Robert Picard's -rangejoin- command (available from SSC):

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input int(pid beginy endy) str28 spelltyp
        201 1984 1990 "(3) lives with partner"      
        201 1990 1991 "(4) Partner not in Household"
        201 1991 1995 "(5) single"                  
        201 1995 2008 "(4) lives with partner"      
        end
        
        by pid (beginy), sort: assert endy == beginy[_n+1] if _n < _N
        replace endy = endy-1
        
        rangejoin year beginy endy using panel_data_set, by(pid)
        This will do the same thing as expansion followed by -merge-ing on pid and year.

        Note: To use -rangejoin- it is also necessary to install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

        Comment


        • #5
          Thanks for the great answers! However, for both cases, assert yields a "false". i'm assuming that some spells might include the same year as beginy and endy. How I can proceed anyway? Andrew Musau Clyde Schechter Thanks in advance!

          Comment


          • #6
            An observation with beginy == endy would not cause the -assert- to be false. The -assert- is looking at endy in one observation and beginy in the next observation. The point being that in the example data shown, it seems that the spells overlap on the end years. You can't really proceed with what you are doing if there is overlap in these spells. For example, taking the data example shown literally, there is no way to ascertain the spell type in years 1990, 1991, or 1995 because in each of those years the data specifies two contradictory spell types, and we have no way to tell which is the right one. The idea behind the code is to disambiguate by assuming that the end year actually belongs to the later spell.

            Now, the failure of the assert can occur in two different ways. If the beginy value in the next observation is larger than endy in the current observation, that means there is a gap in the data. This is not a serious problem--it just means that when you -merge- or -rangejoin- it to the panel data set, some years will go unmatched.

            On the other hand, if the beginy value in the next observation is smaller than endy in the current observation, we have an overlap that is even longer than just that one year, meaning that the data are contradictory for even longer periods--which is really untenable.

            On the optimistic view that the -assert- failures all due to gaps, and not to extensive overlap, you can just change the first two lines of the code in #4 to:
            Code:
            by pid (beginy), sort: assert endy <= beginy[_n+1] if _n < _N
            by pid (beginy): replace endy = endy-1 if endy == begin[_n+1]
            If this assert still fails, then you have substantially self-contradictory data and cannot proceed. The data will need to be fixed by reference to some other, consistent source of information.

            Comment

            Working...
            X