Splitting spell data by year

Juliane Dold

Join Date: Mar 2024

Posts: 11
#1

Splitting spell data by year

08 Apr 2024, 13:13

Dear all,

I am once again at my wits end. To supplement my panel data with additional data, I'd like to incorporate some information from a spell data set, looking like that:

pid beginy endy spelltyp

201 1984 1990 (3) lives with partner

201 1990 1991 (4) Partner not in Household

201 1991 1995 (5) single

201 1995 2008 (4) lives with partner

To do so, I have to split up the spells by year to yield something like that:

pid year spelltyp

201 1990 (4) partner not in household

201 1991 (4) partner not in household

I tried variations of this code:

gen duration = endy - beginy + 1
expand duration
bysort pid: gen x = _n - 1
gen syear = beginy + x

but always end up with unrealistic results (example above year 1991 & 1993) and/or not enough rows (1 instead of 2, 5 instead of 14)

What am I doing wrong?
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 9945

08 Apr 2024, 15:21

It appears that the spell is not inclusive of the endpoint, in which case:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int(pid beginy endy) str28 spelltyp
201 1984 1990 "(3) lives with partner"      
201 1990 1991 "(4) Partner not in Household"
201 1991 1995 "(5) single"                  
201 1995 2008 "(4) lives with partner"      
end

bys pid (beginy): assert beginy==endy[_n-1] if _n>1
gen toexpand= endy- beginy
expand toexpand
bys pid (beginy): gen year= cond(_n==1, beginy[1], beginy[1]+_n-1), after(pid)
drop beginy endy toexpand

Make sure that the assertion in line #1 is confirmed before proceeding.

Res.:

Code:

. l, sepby(pid spelltyp)

     +-------------------------------------------+
     | pid   year                       spelltyp |
     |-------------------------------------------|
  1. | 201   1984         (3) lives with partner |
  2. | 201   1985         (3) lives with partner |
  3. | 201   1986         (3) lives with partner |
  4. | 201   1987         (3) lives with partner |
  5. | 201   1988         (3) lives with partner |
  6. | 201   1989         (3) lives with partner |
     |-------------------------------------------|
  7. | 201   1990   (4) Partner not in Household |
     |-------------------------------------------|
  8. | 201   1991                     (5) single |
  9. | 201   1992                     (5) single |
 10. | 201   1993                     (5) single |
 11. | 201   1994                     (5) single |
     |-------------------------------------------|
 12. | 201   1995         (4) lives with partner |
 13. | 201   1996         (4) lives with partner |
 14. | 201   1997         (4) lives with partner |
 15. | 201   1998         (4) lives with partner |
 16. | 201   1999         (4) lives with partner |
 17. | 201   2000         (4) lives with partner |
 18. | 201   2001         (4) lives with partner |
 19. | 201   2002         (4) lives with partner |
 20. | 201   2003         (4) lives with partner |
 21. | 201   2004         (4) lives with partner |
 22. | 201   2005         (4) lives with partner |
 23. | 201   2006         (4) lives with partner |
 24. | 201   2007         (4) lives with partner |
     +-------------------------------------------+

.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35212
#3

08 Apr 2024, 19:32

See https://journals.sagepub.com/doi/pdf...867X1301300116 for more discussion of Andrew Musau's method.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#4

08 Apr 2024, 20:43

O.P. does not explain what "supplement my panel data with additional data" means. But it is plausible that she intends to -merge- the file she shows with a panel data set that she is working with. In that case, depending on the details, it may not be necessary to do the expansion shown in #2 and commented on in #3. It might be easier to do this with Robert Picard's -rangejoin- command (available from SSC):

Code:

* Example generated by -dataex-. For more info, type help dataex clear input int(pid beginy endy) str28 spelltyp 201 1984 1990 "(3) lives with partner" 201 1990 1991 "(4) Partner not in Household" 201 1991 1995 "(5) single" 201 1995 2008 "(4) lives with partner" end by pid (beginy), sort: assert endy == beginy[_n+1] if _n < _N replace endy = endy-1 rangejoin year beginy endy using panel_data_set, by(pid)

This will do the same thing as expansion followed by -merge-ing on pid and year.

Note: To use -rangejoin- it is also necessary to install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.
Comment
Juliane Dold

Join Date: Mar 2024

Posts: 11
#5

09 Apr 2024, 12:43

Thanks for the great answers! However, for both cases, assert yields a "false". i'm assuming that some spells might include the same year as beginy and endy. How I can proceed anyway? Andrew Musau Clyde Schechter Thanks in advance!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#6

09 Apr 2024, 12:59

An observation with beginy == endy would not cause the -assert- to be false. The -assert- is looking at endy in one observation and beginy in the next observation. The point being that in the example data shown, it seems that the spells overlap on the end years. You can't really proceed with what you are doing if there is overlap in these spells. For example, taking the data example shown literally, there is no way to ascertain the spell type in years 1990, 1991, or 1995 because in each of those years the data specifies two contradictory spell types, and we have no way to tell which is the right one. The idea behind the code is to disambiguate by assuming that the end year actually belongs to the later spell.

Now, the failure of the assert can occur in two different ways. If the beginy value in the next observation is larger than endy in the current observation, that means there is a gap in the data. This is not a serious problem--it just means that when you -merge- or -rangejoin- it to the panel data set, some years will go unmatched.

On the other hand, if the beginy value in the next observation is smaller than endy in the current observation, we have an overlap that is even longer than just that one year, meaning that the data are contradictory for even longer periods--which is really untenable.

On the optimistic view that the -assert- failures all due to gaps, and not to extensive overlap, you can just change the first two lines of the code in #4 to:

Code:

by pid (beginy), sort: assert endy <= beginy[_n+1] if _n < _N by pid (beginy): replace endy = endy-1 if endy == begin[_n+1]

If this assert still fails, then you have substantially self-contradictory data and cannot proceed. The data will need to be fixed by reference to some other, consistent source of information.
Comment

Announcement

Splitting spell data by year

Comment

Comment

Comment

Comment

Comment