setting my data up to calculate transitional probabilities, problem with part of the code (gen var=f.status)

Rose Matthews

Join Date: Aug 2023
Posts: 154

setting my data up to calculate transitional probabilities, problem with part of the code (gen var=f.status)

18 Mar 2024, 07:52

I was first using a test data set, now using a dummy dataset that clearly represents my research data

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float id str7 event float(treatment dead revised year op) long status float nextyr
 1 "op"      1 1 1 2001 1 2 .
 1 "revised" 1 1 1 2004 1 3 1
 1 "death"   1 1 1 2005 1 1 .
 2 "op"      0 0 1 2001 1 2 .
 2 "revised" 0 0 1 2007 1 3 .
19 "op"      0 1 0 2008 1 2 .
19 "death"   0 1 0 2016 1 1 .
45 "op"      0 0 1 2005 1 2 .
45 "revised" 0 0 1 2008 1 3 .
46 "op"      1 1 0 2007 1 2 .
46 "death"   1 1 0 2020 1 1 .
54 "op"      1 0 0 2001 1 2 .
76 "op"      1 1 0 2009 1 2 .
76 "death"   1 1 0 2015 1 1 .
89 "op"      1 1 0 2006 1 2 .
89 "death"   1 1 0 2010 1 1 .
end
format %ty year
label values treatment q1
label def q1 0 "control", modify
label def q1 1 "treatment", modify
label values dead q2
label def q2 0 "alive", modify
label def q2 1 "dead", modify
label values revised q3
label def q3 0 "success", modify
label def q3 1 "revised", modify
label values status status
label def status 1 "death", modify
label def status 2 "op", modify
label def status 3 "revised", modify

Code:

//// start transition probabilities

// create a datset of probabilities using the example data
//declares data panel data

xtset id year, yearly

//takes the value of status in the following row -- this, as you can see from the data provided in dataex, only works for observation 2 , id = 1.

generate nextyr=f.status

//My question, why doesn't for eg id = 2, observation 4 take the value nextyr = 3,
the same can be said for observation 7, id = 19 , I would have expected observation 6 for nextyr for id=19 to be 1?
Why is this not happening?

//// Just fyi... this is my plan for the rest of the data....

drop if missing(nextyr)
generate f = 1

///it calculates the count of transitions from (status) to nextyr (new transition) within each year.
collapse (sum) f, by(year status nextyr)

///This calculates the total count of transitions from each status within each year.
bysort year status: egen all = total(f)

//It divides the count of transitions (f) by the total count of transitions from the same starting state (all)
//this gives the proportion of transitions to each possible next state, conditional on the current state and year.
generate p = f/all

// review intermediate output
//formatting to 3 decimal places (9 characters)
format %9.3f p

Last edited by Rose Matthews; 18 Mar 2024, 08:04.

Tags: None

Rose Matthews

Join Date: Aug 2023

Posts: 154
#2

18 Mar 2024, 08:33

Ah, I figured out why this is not working.

Reason being is because I have gaps between each year.
The only one which doesn't have a gap is the one with the red arrow and that is why the following code behaves

Code:

generate nextyr=f.status

Any thoughts how perhaps I can address that for 'nextyr' this takes on the status of the consecutive row, even if there are gaps in 'year'

Aim is to calculate the transitional probabilities, if died/ revised hasn't taken place, the patient is thought to remain in the same state as 'op' = alive
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#3

18 Mar 2024, 09:59

So you have to create a new variable that is sequential when the data are in chronological order within id.

Code:

by id (year), sort: gen int seq = _n xtset id seq gen nextyr:status = F.status
Comment
Rose Matthews

Join Date: Aug 2023

Posts: 154
#4

18 Mar 2024, 11:17

thanks for this insight, however, I do have an additional question

How would I be able to differentiate between those observations that are missing in the new variable -nextyr-

For eg ID1 = year 2005, next year = . (pt is dead - can be dropped) , already have this recorded as dead in year 2004.

VS

ID=54 year 2001 = next year =. (pt is alive = success story)

(1) how will I be able to keep those that are missing but are alive (success) story vs those that are missing but are dead - exited the study

(2) if i may ask another question, which I have touched upon in another post and perhaps addressed the question in post 2 in a different manner... as seen here

https://www.statalist.org/forums/for...-gaps-in-years

Does it matter if there are gaps in the years when calculating transitional probabilities for a markov model?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

18 Mar 2024, 11:44

Yes, it matters a great deal if there are gaps in the years when calculating transitional probabilities for a Markov model. Evidently if it takes three years to go from state A to state B, the transition probability is different than if you go from state A to state B in 1 year. Now, I don't know what program you are using to do these calculations, or if you are crafting your own code to do it. But gaps in the years should either be prohibited (i.e. you must fill in the gaps before running the program), or the code should account for the gaps in the calculation. I have never had occasion to use Stata for this particular purpose, so I can't advise you more specifically than that about this. Assuming you are using a pre-existing program, you should consult the help file to understand how it deals with this situation.

I don't understand your first question. The patient who died has an observation in which they show status = dead. The other one does not. That is how they differ. I don't know why you would drop the observation showing the death from your study, but perhaps that has something to do with the particulars of the program you are using to calculate transition probabilities.
Comment

Rose Matthews

Join Date: Aug 2023
Posts: 154

18 Mar 2024, 11:50

thanks for this, as you can see from my post 1
that was my code for calculating transition probabilities in stata...why else would I use other software?

However, I suppose I have a problem which I need to account for which are the year gaps. I suppose from your post in 5; you don't have any further advice on what else I can do to address the gaps, then?

Code:

//// start transition probabilities
// create a datset of probabilities using the example data
//declares data panel data

xtset id year, yearly

//takes the value of status in the following row -- this, as you can see from the data provided in dataex, only works for observation 2 , id = 1.

generate nextyr=f.status


//// Just fyi... this is my plan for the rest of the data....

drop if missing(nextyr)
generate f = 1

///it calculates the count of transitions from (status) to nextyr (new transition) within each year.
collapse (sum) f, by(year status nextyr)

///This calculates the total count of transitions from each status within each year.
bysort year status: egen all = total(f)

//It divides the count of transitions (f) by the total count of transitions from the same starting state (all)
//this gives the proportion of transitions to each possible next state, conditional on the current state and year.
generate p = f/all

// review intermediate output
//formatting to 3 decimal places (9 characters)
format %9.3f p

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

18 Mar 2024, 12:26

I hadn't looked at your code in #1, I was picking up from #2.

I didn't think you were using other software for the transition probabilities. I though you might have used some Stata program, perhaps something user-written, for the purpose.

For your over-arching problem of computing transition probabilities, I think the most straightforward way to get this right is to fill in the missing years. Now, you have to make some assumption about what happened during the missing years. I think for the situations you are working with here, you can fairly assume that the status in any given year that was not directly observed is the same as the status in the most recent preceding observed year. In effect, you are assuming that you have observed all ops, revisions, and deaths: there were no such events that occurred that were not in the original data. Then from there you can calculate the transition probabilities from each state to each other state in each year by:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float id str7 event float(treatment dead revised year op) long status 1 "op" 1 1 1 2001 1 2 1 "revised" 1 1 1 2004 1 3 1 "death" 1 1 1 2005 1 1 2 "op" 0 0 1 2001 1 2 2 "revised" 0 0 1 2007 1 3 19 "op" 0 1 0 2008 1 2 19 "death" 0 1 0 2016 1 1 45 "op" 0 0 1 2005 1 2 45 "revised" 0 0 1 2008 1 3 46 "op" 1 1 0 2007 1 2 46 "death" 1 1 0 2020 1 1 54 "op" 1 0 0 2001 1 2 76 "op" 1 1 0 2009 1 2 76 "death" 1 1 0 2015 1 1 89 "op" 1 1 0 2006 1 2 89 "death" 1 1 0 2010 1 1 end format %ty year label values treatment q1 label def q1 0 "control", modify label def q1 1 "treatment", modify label values dead q2 label def q2 0 "alive", modify label def q2 1 "dead", modify label values revised q3 label def q3 0 "success", modify label def q3 1 "revised", modify label values status status label def status 1 "death", modify label def status 2 "op", modify label def status 3 "revised", modify // FILLIN MISSING YEARS, CARRYING STATUS FORWARD xtset id year tsfill by id (year), sort: replace status = L1.status if missing(status) gen next_status:status = F1.status drop if missing(next_status) collapse (count) n_transitions = id, by(year status next_status) by year status: egen all_transitions_out = total(n_transitions) gen transition_probability = n_transitions/all_transitions_out

Note: Calculating separate transition probabilities for each year is relying on very scanty data. In most years, most of the Unless you have good reasons to believe that the probabilities really do change from one calendar year to the next, I recommend just calculating a single set of transition probabilities for all years. For that, you don't even have to write all of this code. After you have done the -tsfill- and -...replace status...- commands you can just run -xttrans status- and get the results directly (as percentages, not probabilities). If you really do need different transition probabilities for each year, then I strongly recommend getting a richer data set before proceeding.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

18 Mar 2024, 12:43

On further thought, rather than dropping the final observation for each id due to lack of information about the next status, if we take seriously the assumption that the original data is not missing any revisions or deaths, then it is fair to assume that for the final observation the next status will simply be the same is the current status: if there were a revision or death, there would have been another observation to show that. (I realize that it is a bit tenuous to assume that no transitions have gone unobserved, but I think without that assumption given all the gaps in the data you would be in no realistic position to calculate transition probabilities at all, and although the assumption is probably not strictly true, given the particular events in question here, it is not altogether unreasonable.)

So, if you want to go this route, the code changes slightly to:
[code]
// FILLIN MISSING YEARS, CARRYING STATUS FORWARD
xtset id year
tsfill
by id (year), sort: replace status = L1.status if missing(status)

// xttrans status

isid id year, sort
gen next_status:status = F1.status
replace next_status = cond(!dead, status, `="death":status') ///
if missing(next_status)

collapse (count) n_transitions = id, by(status next_status)
by status: egen all_transitions_out = total(n_transitions)
gen transition_probability = n_transitions/all_transitions_out
[/coe]
Comment
Rose Matthews

Join Date: Aug 2023

Posts: 154
#9

20 Mar 2024, 11:18

Dear Clyde,

as always thanks for your insight,
i’ve been chewing over yours answers over the past 48hrs .

i do want to clarify what you mention here (post7)

I recommend just calculating a single set of transition probabilities for all years.

do you mean calculating one transition probability for treatment 1 and control 0 for 2003-2021 for status = alive, status = dead status = revised (complete dataset)

therefore ending up with simply 3 values (dead-alive-revised) , for treatment = 1 and another 3 values for treatment = 0

(which Is really what I would like to continue working in my markov model)

However im not sure if i understood you correctly
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

20 Mar 2024, 11:58

Yes, that is what I meant. Sorry for not being clearer about that.
Comment
Rose Matthews

Join Date: Aug 2023

Posts: 154
#11

20 Mar 2024, 23:54

thank you for clarifying, indeed. using the following code below, rather than all the code I generated above is so much simpler considering I just need to create one transition probability for treatment 1 and control 0 for 2003-2021 for status = alive, status = dead status = revised (complete dataset).therefore ending up with simply 3 values (dead-alive-revised) , for treatment = 1 and another 3 values for treatment = 0

I have used the code below with the dataset provided in post 1

Code:

xtset id year tsfill by id (year), sort: replace status = L1.status if missing(status) xttrans status

I have obtained the following output:

Using the presentation by Peter Austin
https://www.stata.com/meeting/boston...14_nichols.pdf

I just wanted to clarify my interpretation is correct
Transition probability from State 1 to 2: 11.54% therefore 0.12
Transition probability from State 1 to 3 : 100% therefore 1.00
Transition probability from State 2 to 2 : 84.62%
Transition probability from State 2 to 3: 0.00%
Transition probability from State 3 to 2: 3.85%
Transition probability from state 3 to 3: 0.00%

Is this correct?

Q1. However what is the probability of remaining in State 1....?

My mistake here is that I kept both treatment and control in the same dataset, when I should have run the code in this post but instead with

Code:

keep if treatment ==1 ///This would give me the transition probabilities (as percentages) for treatment ==1, then I repeat the above for treatment == 0 (giving me the probabilities for the control)

q2. do you agree with this?

Appreciate your insight, many thanks
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#12

21 Mar 2024, 09:36

You have used the code correctly. But your interpretation of the -xttrans- output has things reversed. The cells give the probability of transition from the state in the rowstub to the state in the column header. So, the probability of transition from state 2 to state 1 is 11.54%, from state 2 to state 2 (i.e. stay in state 2) is 84.62%, and from state 2 to state 3 is 3.85%. Similarly, the probability of transition from state 3 to state 1 is 100% and to either state 2 or state 3 is 0%. As for transitions out of state 1, none were observed in the data, so nothing is reported in the -xttrans- output. This is not surprising since earlier in the thread we can see that state 1 is "dead." Dead is an absorbing state: once you are dead, you stay dead.

That said, I think there is a problem with your data. It is strange to observe that from state 3 ("revised") there is 100% probability of transition to state 1 ("dead".) The implication is that nobody ever survives a revision! Looking at your example data, I see that whenever an id comes to state 3, that is the final observation of their data--you have no follow-up on anybody beyond the revision. Given the results you got for the transition probabilities, it seems likely that the same is true in the entire data set. You need, therefore, to augment the data. The best way to do that would be to go back and get real data on what happened to these people in the year(s) after their revisions. If that is not feasible, then for every person whose data ends in a state other than death, you need to add another observation with a new state: lost to follow-up.

Code:

// FILLIN MISSING YEARS, CARRYING STATUS FORWARD xtset id year tsfill by id (year), sort: replace status = L1.status if missing(status) // ADD LOST TO FOLLOW-UP AS FINAL STATE FOR THOSE NOT DEAD AT END OF DATA label define status 0 "ltfu", add by id (year): gen expander = cond(_n == _N & status != 1, 2, 1) expand expander by id (year), sort: replace year = year +1 if _n == _N & expander == 2 by id (year), sort: replace status = 0 if _n == _N & status != 1 xttrans status

As for your second question, yes, you should have done this separately for treatment == 1 and treatment == 0. To make this work, however, the above code needs some additional modification to spread the value of treatment to all of the filled in observations. So, it becomes:

Code:

// FILLIN MISSING YEARS, CARRYING STATUS FORWARD xtset id year tsfill by id (treatment), sort: replace treatment = treatment[1] by id (year), sort: replace status = L1.status if missing(status) // ADD LOST TO FOLLOW-UP AS FINAL STATE FOR THOSE NOT DEAD AT END OF DATA label define status 0 "ltfu", add by id (year): gen expander = cond(_n == _N & status != 1, 2, 1) expand expander by id (year), sort: replace year = year +1 if _n == _N & expander == 2 by id (year), sort: replace status = 0 if _n == _N & status != 1 xttrans status if treatment xttrans status if !treatment
Comment

Announcement

setting my data up to calculate transitional probabilities, problem with part of the code (gen var=f.status)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment