wide to long...but different.

mathieu nacher

Join Date: Jan 2019

Posts: 38
#1

wide to long...but different.

31 Oct 2019, 13:51

hello, i am working with a data set that i would like to analyze using st commands. however, the data i was given is not in the right format.
there is one line per patient with several dates that correspond to different types of events. the reshape long would be used if the variable corresponded to the same thing (say hemoglobin at day 0, 1, 10, n)
but here there are failure events and different types measurements and all have their own datevariable.
any suggestions as to how to reshape the data set so i can have a single time variable and the different events referring to the same time variable?
thank you
mathieu
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

31 Oct 2019, 19:12

I can't understand your situation. If you want to -stset- your data you need to have a computable definition of the failure event and a way of identifying the time at which either the event occurred or the entity under observation was censored. You apparently have observations containing numerous time points and numerous events, but you have given no indication which, if any, of these events constitutes a failure, nor any way to identify censoring.

So please post back with a clearer explanation. Even the clearest explanation will almost surely be inadequate without also showing an example of your data. To show that, please use the -dataex- command.
If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
mathieu nacher

Join Date: Jan 2019

Posts: 38
#3

01 Nov 2019, 07:27

Hello, thanks for your reply. The data set is very large and has several possible failure events depending on the question. The person who collected the data used an excel file with one line per patient dans several datevariables that correspond to clinical events, or biological results, or treatment events. what i would like would be a long dataset with a single time variable and potential failure events (type 1 type 2 type 3...) so that i can streset to failure type 2 or type 3 depending on my research question.
i have made a data set with 3 time variables one is treatment, the second one is becoming at risk and the third one is the end of failure event. if i wish to look at time to failure i would want to have the 3rd as failure variable, if i want to have time to becoming at risk after treatment i would use the second as failure.
i hope this is clearer hereunder is a very simplified sample (i actually have 15 time variables for 15 different things)

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str78 patnume int(date_treatment datebecomesrisk datefailureevent) "5698" 20949 13815 . "87498963" 14368 14340 . "1441" 20370 14962 . "152568996" 19796 13170 13599 "589641750" 18044 14360 . "227382839" 18156 14846 15487 "343151272" 21035 13625 . "342144251" 20923 18441 18590 "247127712" 19318 16706 . "518520534" 15939 15953 . "1420" 21117 15383 16832 "880888223" 20563 13478 13534 "385851622" 18770 16595 17463 "421101868" 16301 16421 . "2610" 20972 17469 . "404781520" 21112 15475 15690 "143866957" 21130 17059 . "393688381" 20867 15775 16156 "843654811" 17371 15553 . "987209498" 20361 14374 . end format %tdnn/dd/CCYY date_treatment format %tdnn/dd/CCYY datebecomesrisk format %tdnn/dd/CCYY datefailureevent
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#4

01 Nov 2019, 22:29

Thank you for showing the example data. This is one of those unusual instances where I think that you should retain the wide layout you currently have. You can do the apprpriate -stset- commands with the data just as it is making use of the origin() and enter() options of that command. For example, if you want to look at time to becoming at risk after treatment, it would be:

Code:

stset datebecomesrisk, origin(datetreatment)

If you want to analyze time from becoming at risk to failure event it would be

Code:

stset datefailureevent, origin(datebecomesrisk)

For each of the intervals you wish to analyze, there will be an appropriate pair of variables to plug into the -stset- command in this way.

You could reshape the data to long and use the multiple observations per patient version of -stset-, but that is just a lot more complicated and offers you no advantage that I can see here.
Comment
mathieu nacher

Join Date: Jan 2019

Posts: 38
#5

02 Nov 2019, 03:31

Hello wide works for 3 but if i have 10 dates (each with a different variable name) for 10 potential failures and time-varyng explanatory variables that i wish to include in a cox model i need to go to the long format with the multiple observations. but it is not ike in the command reshape (no common stub...) so there seems to be no way to reshape te data.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#6

02 Nov 2019, 15:58

I guess I don't understand what you want to do. You can't have 10 different failure events in a single analysis, unless you are doing a competing risks model. So each failure event must be analyzed separately and requires a new -stset- command no matter how the data are laid out.
Comment
mathieu nacher

Join Date: Jan 2019

Posts: 38
#7

03 Nov 2019, 02:14

some of the time variables are potential failure events that can be analyzed in different models but most time variables actually correspond to several biological time-varying adjustment variables which i would like to include in a cox model.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#8

03 Nov 2019, 17:03

OK, in that case it really is a standard reshape wide-to-long, followed by an -stset- for multiple records per subject. You mentioned in the beginning that reshaping is difficult because the variable names are idiosyncratic rather than having a pattern. The solution is to rename them so as to create a pattern. If you use -dataex- to post an example of your data, I can illustrate the code.

If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

mathieu nacher

Join Date: Jan 2019
Posts: 38

04 Nov 2019, 09:26

ok thanks for your help! here is a simplified extract

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id int(datestartarv dateprurigo) float(cd4real cd8real cd4date cd8date datecureprurigo viralloaddate viralload)
     5698 20949 13815    74    . 13767     . .      .    .
 87498963 14368 14340    12   72     .     . .      .    .
     1441 20370 14962   263 2790     .     . .      .    .
152568996 19796 13170   8.1  326     .     . .      .    .
589641750 18044 14360 285.6  469     .     . .      .    .
227382839 18156 14846   3.3  379     .     . . -21007  1.7
343151272 21035 13625   167 2579 13791 13791 .      .    .
342144251 20923 18441    17  991     .     . . -17902  2.7
247127712 19318 16706 256.8 1721     .     . .      .    .
518520534 15939 15953   159  644 15958 15958 .      .    .
     1420 21117 15383    39  789     .     . .      . 2.81
880888223 20563 13478   224  630     .     . .  13593 3.41
385851622 18770 16595    23  541 16654 16654 . -19062 5.67
421101868 16301 16421 534.6  447     .     . .      .    .
     2610 20972 17469   241 1884     .     . .      .    .
404781520 21112 15475  13.1 1026     .     . . -20835  1.7
143866957 21130 17059    20    .     .     . .      .    .
393688381 20867 15775    86  407     .     . . -20258 4.48
843654811 17371 15553    20 1035     .     . .      .    .
987209498 20361 14374 183.3  659 14431 14431 .      .    .
end
format %tdnn/dd/CCYY datestartarv
format %tdnn/dd/CCYY dateprurigo

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29796

#10

04 Nov 2019, 16:46

I don't normally use the term "hot mess," but that is what you have here. You will be fortunate indeed if you find a way to do time-to-event analyses on this data, as most of the events you show have only missing values for the associated dates. The code shown below is a step in your direction. The bulk of the work, however, is the renaming. To the extent that there are "subsystems" in the way the original variables were named, it may be possible to shorten this code. For example, I could have written -rename cd?real valuecd?- instead of two separate commands for cd4 and cd8. But you describe the naming of the variables as irregular, and the more irregular it is, the more work that entails for you to write -rename- commands. What is critical is that for every event in the data there must be a variable named dateevent and, if it is associated with a numerical outcome (like a cd4 count) that numerical outcome's name has to be valueevent.

Now your data includes a number of different events. Some of these events just happen (e.g. starting arv), but some are actually measurements that produce values. The latter get their numerical values placed in a new variable called value. The former don't.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id int(datestartarv dateprurigo) float(cd4real cd8real cd4date cd8date datecureprurigo viralloaddate viralload)
     5698 20949 13815    74    . 13767     . .      .    .
 87498963 14368 14340    12   72     .     . .      .    .
     1441 20370 14962   263 2790     .     . .      .    .
152568996 19796 13170   8.1  326     .     . .      .    .
589641750 18044 14360 285.6  469     .     . .      .    .
227382839 18156 14846   3.3  379     .     . . -21007  1.7
343151272 21035 13625   167 2579 13791 13791 .      .    .
342144251 20923 18441    17  991     .     . . -17902  2.7
247127712 19318 16706 256.8 1721     .     . .      .    .
518520534 15939 15953   159  644 15958 15958 .      .    .
     1420 21117 15383    39  789     .     . .      . 2.81
880888223 20563 13478   224  630     .     . .  13593 3.41
385851622 18770 16595    23  541 16654 16654 . -19062 5.67
421101868 16301 16421 534.6  447     .     . .      .    .
     2610 20972 17469   241 1884     .     . .      .    .
404781520 21112 15475  13.1 1026     .     . . -20835  1.7
143866957 21130 17059    20    .     .     . .      .    .
393688381 20867 15775    86  407     .     . . -20258 4.48
843654811 17371 15553    20 1035     .     . .      .    .
987209498 20361 14374 183.3  659 14431 14431 .      .    .
end
format %tdnn/dd/CCYY datestartarv
format %tdnn/dd/CCYY dateprurigo
format %tdnn/dd/CCYY viralloaddate

rename *date date*

//  THE FOLLOWING IS EASY ENOUGH WITH SO FEW VARIABLES
//  IN YOUR REAL DATA THIS WILL BE A LOT OF MESSY WORK
rename cd4real valuecd4
rename cd8real valuecd8
rename viralload valueviralload


//  RESHAPE
reshape long date value, i(id) j(event) string
drop if missing(date) // THESE CANNOT BE USED IN A TIME-TO-EVENT ANALYSIS

I hope this gives you a start in the right direction.

Comment

mathieu nacher

Join Date: Jan 2019

Posts: 38
#11

07 Nov 2019, 07:20

thank you very much i will try!
Comment

Announcement