Time dummies to estimate the length of relationships

Clyde Schechter

Join Date: Apr 2014
Posts: 29906

#61

06 Dec 2020, 10:24

I don't understand the variables spell_duration and total_duration_this pair, as they do not look like they could have been created by the code we have discussed in this thread. So I'm just going to ignore them.

If I understand your question correctly you just want a variable that shows the total number of waves in which a couple reports being in a relationship, whether those waves or consecutive or not. If so, that's not too hard:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(id p_id) byte(wave mrcurr1 mrcurr2)
100001  100002  1 1 1
100001  100002  2 1 1
100001  100002  3 1 1
100001  100002  4 1 1
100003  100004  1 1 1
100003  100004  2 1 1
100003  100004  3 1 1
100003  100004  4 1 1
100006 1000842 10 2 2
100006 1000842 11 2 2
100006 1000842 12 2 2
100006 1000842 13 2 2
100006 1000842 14 2 2
100006 1000842 15 1 1
100006 1000842 16 1 1
100006 1000842 17 1 1
100006 1000842 18 1 1
100007 1106359 11 2 2
100008  100009  1 1 1
100008  100009  2 1 1
100008  100009  3 1 1
100008  100009  4 1 1
100008  100009  5 1 1
100008  100009  6 1 1
100008  100009  7 1 1
100011  600231  6 2 2
100011  600231  7 2 2
100011  600231  8 2 2
100011  600231  9 2 2
100016  200179  2 2 2
100016  200179  3 2 2
100016  200179  4 2 2
100016  200179  5 1 1
100016  200179  6 1 1
100016  200179  7 1 1
100016  200179  8 1 1
100016  200179  9 1 1
100016  200179 10 1 1
100016 1800788 18 2 2
end

gen byte in_relationship = inlist(mrcurr1, 1, 2) if !missing(mrcurr1)
replace in_relationship = inlist(mrcurr2, 1, 2) if missing(in_relationship)

by id p_id (wave), sort: egen number_of_waves_in_relationship = total(in_relationship)

Comment

Chris Boulis

Join Date: Feb 2019

Posts: 355
#62

06 Dec 2020, 18:03

Hi Clyde Schechter. Thank you for that. (Actually, these two variables are part of this thread, e.g. #32 p.3). "number_of_waves_in_relationship" appears to provide the same values as "seq" (#56). Using -browse- I tried to check if this variable ever obtains different values than "seq" but it included observations when they were equal

Code:

br id p_id wave begin end seq number_of_waves_in_relationship mrcurr1 mrcurr2 if seq[_N] != number_of_waves_in_relationship

I'm not sure why? Do you have a suggestion?

I think with survival analysis, if a couple was to separate, then have another relationship then reunite with a previous partner, the count of their second time together should restart (from zero) as they did fail by definition. A continued count would seem more reasonable for a couple that did not change partners - would this code achieve that?

Code:

by id p_id (wave), sort: egen number_of_waves_in_relationship = total(in_relationship) if p_id == p_id[_n-1]

(Stata v.15.1).

Last edited by Chris Boulis; 06 Dec 2020, 18:09.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29906
#63

06 Dec 2020, 18:14

I'm not sure why? Do you have a suggestion?

Well, I would infer from this that there are couples for whom one or more of the following obtains:
seq does not start from 1

there are non-consecutive values of seq

there are duplicate values of seq

These situations might have arisen either during the creation of that variable--by now a dusty memory for me-- or as a result of subsequent data management either changing that variable or deleting observations or inserting new ones.

I think with survival analysis, if a couple was to separate, then have another relationship then reunite with a previous partner, the count of second time together should restart (from zero) as they did fail by definition right?

My informed layman's opinion on this is that, yes, it would be sensible to restart from zero. But this is another question where you should seek the advice of an expert in your field: what is the more meaningful construct to study in pursuit of your research goals? Is it the total time a couple is in relationship, or is it the separate duration of each episode of in relationship? That's not a statistical question.

The code you show will not do what you ask. What you want is:

Code:

by id (wave), sort: gen byte episode = sum(p_id != p_id[_n-1]) by id episode, sort: gen episode_duration = wave[_N] - wave[1] + 1
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 355
#64

06 Dec 2020, 19:55

Thank you Clyde Schechter. With respect to

I would infer from this that there are couples for whom one or more of the following obtains:

Point 2 is correct. Is there a way I can test if this (any) code is working - potentially using -browse- or other function rather than searching through the full set of data? In particular, how could I ask Stata to show me all observations where "episode_duration != seq[_N]" (seq #56).

This code calculates the time between the first and last wave of any couple so long as they have the same id and p_id correct? Can I control for gaps in observations - e.g. I have one couple only present in three waves (7, 14, 15) and this code calculates the duration at 9 (waves) whereas at most I'd count that as 3 waves. Can I take into account the length of a potential gap in the data (missing waves? That is, anything longer than a two year gap would trigger a new count.

Last edited by Chris Boulis; 06 Dec 2020, 20:07.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29906
#65

06 Dec 2020, 21:00

Point 2 is correct. Is there a way I can test if this (any) code is working - potentially using -browse- or other function rather than searching through the full set of data? In particular, how could I ask Stata to show me all observations where "episode_duration != seq[_N]" (seq #56).

That would be easy enough to do, but it would not help you, because you really need to see the entire episode to know what's going on with seq.

Code:

by id episode (wave), sort: gen byte to_show = (episode_duration != seq[_N]) browse if to_show

Can I take into account the length of a potential gap in the data (missing waves? That is, anything longer than a two year gap would trigger a new count.

Sure.

Code:

by id (wave), sort: gen byte episode = sum((p_id != p_id[_n-1]) | (wave > wave[_n-1] + 2)) by id episode, sort: gen episode_duration = wave[_N] - wave[1] + 1

Note: I assume 1 wave = 1 year as there is no year variable in play.
1 like
Comment

Chris Boulis

Join Date: Feb 2019
Posts: 355

#66

07 Dec 2020, 17:10

Thank you Clyde Schechter. Yes 1 wave = 1 year. Could you kindly explain the last two lines in #65. My guess of "episode" is (sorting id by wave) "sum the number of times (waves) that p_id appears unchanged" but it's written as "not equal" so I'm confused. Also I think the OR part may be why it's not picking up on changes in partners. The two wave gap limit is not being followed - should this condition be applied to the code for "episode_duration"?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(id p_id) byte(mrcurr1 mrcurr2 begin end to_show wave seq) float episode_duration
100005  700268 2 2 1 1 1  7  1  9
100005  700268 6 2 1 1 1 14  .  9
100005  700268 2 2 1 0 1 15  3  9
100138  400310 2 2 1 0 1  4  1 15
100138  400310 2 2 0 0 1 12  2 15
100138  400310 2 2 0 0 1 13  3 15
100138  400310 2 2 0 0 1 14  4 15
100138  400310 2 2 0 0 1 15  5 15
100138  400310 2 2 0 0 1 16  6 15
100138  400310 2 2 0 0 1 17  7 15
100138  400310 2 2 0 0 1 18  8 15
100016  200179 2 2 1 0 1  2  1 17
100016  200179 2 2 0 0 1  3  2 17
100016  200179 2 2 0 0 1  4  3 17
100016  200179 1 1 0 0 1  5  4 17
100016  200179 1 1 0 0 1  6  5 17
100016  200179 1 1 0 0 1  7  6 17
100016  200179 1 1 0 0 1  8  7 17
100016  200179 1 1 0 0 1  9  8 17
100016  200179 1 1 0 1 1 10  9 17
100016 1800788 2 2 1 0 1 18  1 17
100048  100049 1 1 1 0 1  1  1 11
100048  100049 1 1 0 0 1  2  2 11
100048  100049 1 1 0 0 1  3  3 11
100048  100049 1 1 0 0 1  4  4 11
100048  100049 1 1 0 0 1  5  5 11
100048  100049 1 1 0 1 1  6  6 11
100048  800809 2 2 1 0 1  8  1 11
100048  800809 2 2 0 0 1  9  2 11
100048  800809 2 2 0 0 1 11  3 11
100109  100110 1 1 1 0 1  1  1  5
100109  100110 1 1 0 0 1  2  2  5
100109  100110 1 1 0 0 1  3  3  5
100109  100110 1 1 0 0 1  5  4  5
end

The more important issue is that "end" is working well.

Code:

bys id (wave): gen byte end = ((p_id[_n+1] != p_id) | (spell_num[_n+1] != spell_num))& _n < _N

Can you advise how I could test that "end==1" only due to relationship failure and not other reasons, such as missing waves or where the reported marital status is !inlist(mrcurr1, 1, 2)?

Code:

bys id end (wave): gen byte to_end = ?? 
browse to_end

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29906

#67

07 Dec 2020, 17:59

First, I don't know how you arrived at the data you show in #66, but it wasn't by applying the code in #65 to the value of id, p_id, and wave in the example data from #66. When I do that, it gives very different results for episode_duration. So you must have done something rather different. Look:

Code:

. * Example generated by -dataex-. To install: ssc install dataex
. clear

. input long(id p_id) byte(mrcurr1 mrcurr2 wave)

               id          p_id   mrcurr1   mrcurr2      wave
  1. 100005  700268 2 2  7
  2. 100005  700268 6 2 14
  3. 100005  700268 2 2 15
  4. 100016  200179 2 2  2
  5. 100016  200179 2 2  3
  6. 100016  200179 2 2  4
  7. 100016  200179 1 1  5
  8. 100016  200179 1 1  6
  9. 100016  200179 1 1  7
 10. 100016  200179 1 1  8
 11. 100016  200179 1 1  9
 12. 100016  200179 1 1 10
 13. 100016 1800788 2 2 18
 14. 100048  100049 1 1  1
 15. 100048  100049 1 1  2
 16. 100048  100049 1 1  3
 17. 100048  100049 1 1  4
 18. 100048  100049 1 1  5
 19. 100048  100049 1 1  6
 20. 100048  800809 2 2  8
 21. 100048  800809 2 2  9
 22. 100048  800809 2 2 11
 23. 100109  100110 1 1  1
 24. 100109  100110 1 1  2
 25. 100109  100110 1 1  3
 26. 100109  100110 1 1  5
 27. 100138  400310 2 2  4
 28. 100138  400310 2 2 12
 29. 100138  400310 2 2 13
 30. 100138  400310 2 2 14
 31. 100138  400310 2 2 15
 32. 100138  400310 2 2 16
 33. 100138  400310 2 2 17
 34. 100138  400310 2 2 18
 35. end

.
. by id (wave), sort: gen byte episode = sum((p_id != p_id[_n-1]) | (wave > wave[_n-1] + 2))

. by id episode, sort: gen episode_duration = wave[_N] - wave[1] + 1

.
. list, noobs clean

        id      p_id   mrcurr1   mrcurr2   wave   episode   episod~n  
    100005    700268         2         2      7         1          1  
    100005    700268         6         2     14         2          2  
    100005    700268         2         2     15         2          2  
    100016    200179         2         2      2         1          9  
    100016    200179         2         2      3         1          9  
    100016    200179         2         2      4         1          9  
    100016    200179         1         1      5         1          9  
    100016    200179         1         1      6         1          9  
    100016    200179         1         1      7         1          9  
    100016    200179         1         1      8         1          9  
    100016    200179         1         1      9         1          9  
    100016    200179         1         1     10         1          9  
    100016   1800788         2         2     18         2          1  
    100048    100049         1         1      1         1          6  
    100048    100049         1         1      2         1          6  
    100048    100049         1         1      3         1          6  
    100048    100049         1         1      4         1          6  
    100048    100049         1         1      5         1          6  
    100048    100049         1         1      6         1          6  
    100048    800809         2         2      8         2          4  
    100048    800809         2         2      9         2          4  
    100048    800809         2         2     11         2          4  
    100109    100110         1         1      1         1          5  
    100109    100110         1         1      2         1          5  
    100109    100110         1         1      3         1          5  
    100109    100110         1         1      5         1          5  
    100138    400310         2         2      4         1          1  
    100138    400310         2         2     12         2          7  
    100138    400310         2         2     13         2          7  
    100138    400310         2         2     14         2          7  
    100138    400310         2         2     15         2          7  
    100138    400310         2         2     16         2          7  
    100138    400310         2         2     17         2          7  
    100138    400310         2         2     18         2          7

So you can see the values of episode duration are very different from what you showed.

Anyway, to explain the last two lines of code in #65, I am defining episodes of relationship to be groups of observations where the id and p_id remain the same and there are no gaps of greater than 2 waves between consecutive observations. So the first code runs through each group of observations with the same id (but not necessarily the same pid) and defines a new episode to begin either when the p_id changes or when the wave number jumps by more than 2. The second line calculates the duration of the relationship as the interval between the first and last wave included in the episode.

Can you advise how I could test that "end==1" only due to relationship failure and not other reasons, such as missing waves or where the reported marital status is !inlist(mrcurr1, 1, 2)?

I'm not sure how you define a relationship failure, but the way to see what is going on just around an observation that has end = 1 is:

Code:

isid id p_id wave, sort
gen byte to_browse = inlist(1, end, end[_n-1], [end_n+1])
browse if to_browse

This will give you a view of the observations that have end == 1 along with the observations immediately preceding and following.

Comment

Chris Boulis

Join Date: Feb 2019

Posts: 355
#68

07 Dec 2020, 21:07

Wonderful Clyde Schechter. Yes, something went amiss when I ran the code, my apologies. I now obtain the same results as you show. Thank you also for clarifying the code - that makes sense. Could "episode_duration" be a time variable? Or must the time variable vary by couple (as "wave" does - my current timevar)?

I'm not sure how you define a relationship failure

I define relationship failure as "end==1".

To check my understanding of the final piece of code: (1) checks whether id, p_id and wave uniquely identify the observations - and sorts on all three. (2) generates a variable that shows the observations (on, before and after) when end==1". Kind regards, Chris
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29906
#69

07 Dec 2020, 21:30

Could "episode duration" be a time variable? Yes, it could. The question is whether it is the right time variable for your purposes. I gather your goal is to study factors affecting the duration of relationships. You can conceptualize relationships in different ways. One way is to just look at the time from the first day to the last, ignoring whether in between there may have been gaps, or even gaps where one or both partners partnered with somebody else. That's one way to do it: that's not what episode_duration measures. Another way to think about the length of a relationship is to say that it extends only so long as the couple remains continuously partnered with each other, with no gaps (or only short gaps due to missing data, not due to timeout from the relationship) or interludes with other people. In that view, the first view of the length of a relationship would sometimes be broken into shorter pieces. And each of those shorter pieces would be a separate "episode" whose duration might be the time variable you are interested in. If you were to use episode duration as your time variable, then you would have to account for the fact that some episodes involve the same people -- so these are not independent observations. That lack of independence necessitates modifying the analysis to reflect it: like a frailty factor in a Cox proportional hazards model, or a random effect in a parametric survival model.

Whether you want the total time from first to last day of a relationship as your outcome variable, or the episode duration, is not a question I can help you answer. It is a non-statistical issue: it's about what you're actually trying to study. And I would expect that the results of analyses using these two different outcome variables would differ materially, as factors that would count for prolonging episodes in an on and off relationship might be different from factors that support the overall longevity of the relationship. So you have to think about which concept is the one you are trying to learn about.

Concerning your understanding of the final piece of code, you have that correct.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 355
#70

09 Dec 2020, 17:23

Thanks Clyde Schechter. Yes correct. And yes I do define the duration of a relationship as you state:

it extends only so long as the couple remains continuously partnered with each other, with no gaps (or only short gaps due to missing data, not due to timeout from the relationship) or interludes with other people

What do you mean by:

In that view, the first view of the length of a relationship would sometimes be broken into shorter pieces. And each of those shorter pieces would be a separate "episode" whose duration might be the time variable you are interested in

In my case, does a separate "episode" refer to a separate year a couple is together, or in my case, a wave, therefore, "wave" is a time variable?

Based on my definition (as quoted above), would it be reasonable to use "wave" as my time variable OR given that "episode_duration" counts the short gaps in missing data, not captured by "wave", would it likely act as a better time variable? To that end, if I was to use episode_duration, I would enter it as my timevar in stset? When I did so

Code:

stset episode_duration, id(couple) failure(end==1) origin(begin==1) exit(time .)

and Stata provided:

Code:

------------------------------------------------------------------------------ 87,309 total observations 84,827 multiple records at same instant PROBABLE ERROR (episode_duration[_n-1]==episode_duration) 2,482 observations end on or before enter()
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29906
#71

12 Dec 2020, 21:52

What do you mean by:
In that view, the first view of the length of a relationship would sometimes be broken into shorter pieces. And each of those shorter pieces would be a separate "episode" whose duration might be the time variable you are interested in

In my case, does a separate "episode" refer to a separate year a couple is together, or in my case, a wave, therefore, "wave" is a time variable?

I mean that with this conceptualization, an episode would refer to a series of waves during which the couple remains in relationship and that series of waves is entirely consecutive, or at best has small (i.e. at most 2 year) gaps.

Based on my definition (as quoted above), would it be reasonable to use "wave" as my time variable OR given that "episode_duration" counts the short gaps in missing data, not captured by "wave", would it likely act as a better time variable? To that end, if I was to use episode_duration, I would enter it as my timevar in stset? When I did so

Well, the first question I must ask is whether or not your analysis will include time-varying predictors. If not, then the simplest approach is to reduce the data set to a single observation per episode:

Code:

by id p_id episode, sort: keep if _n == _N egen couple_id = group(id p_id) stset episode_duration, failure(end == 1)

and then account for multiplicity of episodes within the same couple by adding the -share(couple_id)- observation to your -stcox- command. (If you are planning to use a parametric survival analysis instead of -stcox-, then us -mestreg- and specify random effects for id crossed with p_id and episode nested within. (See below for more details.)

But if you have time-varying predictors you need multiple observations per episode, and then you would not use episode_duration as the time variable in -stset-. In fact, it would play no role at all. And you woulld need to specify the actual episode as the -id()- variable. It would look something like this:

Code:

egen couple_episode = group(id p_id episode) by couple_epsiode (wave), sort: gen origin_time = wave[1] // first wave of the episode stset wave, id(couple_episode) failure(end == 1) origin(origin_time)

Now, the complication introduced here is that your couple_episode units are not independent, because some times the same couple has multiple episodes. For that matter, the same person may have episodes with multiple partners. So this structure gets complicated. If you are planning a Cox proportional hazards model, I think the least incorrect way to deal with this is to:

Code:

egen couple_id = group(id p_id)

and use the -shared(couple_id)- option.

This doesn't account for the problem of the same person having episodes with different partners. But I think it's the best you can do within the confines of -stcox-. If you are going to do a parametric survival model, however, you can use -mestreg- and specify an appropriate nesting structure:

Code:

mestreg whatever || _all: R.id || p_id: || episode:
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment