How to create binary variables from a composite string variable

Shamil Cooray

Join Date: Jul 2019

Posts: 12
#1

How to create binary variables from a composite string variable

03 Jul 2019, 05:41

Dear Community,
I am relatively new to Stata and am still learns how to use macros and loops which I suspect may be a part of the required solution.

I need to transform a string variable in a large dataset (~40,000 records) into a form that I can analyse. The string variable (e.g. procedure) is currently a composite string variable where each response is separated by a line return [(char(10)]. Each response is a phrase (e.g. cholecystectomy). For each patient record the variable may be missing, have one response (cholecystectomy) or multiple responses (cholecystectomy (char(10)) appendectomy etc.). I would like to create new binary variables for each of the procedures present. There are approximately 100 different procedure values. I don't have access to a reference list of all possible values this variable can take. Rather I would like the new variable names to be generated from the values already within this variable.

This dataset is coded in this way for about 20 other string variables so the method I use naturally needs to be easily reproducible.

I can use the split command to parse by char(10) but this creates many new variables (e.g. procedure1, procedure2 etc...) but these are not binary (e.g. cholecystectomy, appendectomy).

Code:

split procedure, parse(`=char(10)')

A solution to a similar problem "How to create binary variables from words in phrases?" was posted here but is designed around the new variables of interest being 'words' within phrases. I haven't been able to modify the code to make it suit phrases delimited by line breaks within a long string. Here is the solution posted by Robert Pickard that seems to be on the right track:

Originally posted by Robert Picard View Post

Here's another approach

Code:

clear input byte id str17 phrase 1 "new video post" 2 "newer tweet" 3 "add 12 new photos" 4 "removed 1 photo" end format %-17s phrase split phrase local nwords = r(nvars) forvalues i = 1/`nwords' { levelsof phrase`i', clean foreach word in `r(levels)' { // in case a word is not a valid Stata name local vname = strtoname("`word'") // ignore error if var already exists cap gen byte `vname' = 0 // look for space delimited words qui replace `vname' = 1 if strpos(" " + phrase`i' + " "," `word' ") } }

Thank you for sharing your expertise.

Shamil

Last edited by Shamil Cooray; 03 Jul 2019, 05:43. Reason: fixed typo in title
Tags: None

Jorrit Gosens

Join Date: Jan 2015
Posts: 1019

03 Jul 2019, 06:18

Code:

*a data example:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float patientID str39 procedure
1 "cholecystectomy?appendectomy"          
2 "cholecystectomy?something?somethingelse"
3 "appendectomy"                          
end
replace procedure = subinstr(procedure, "?",char(10),.)

*splitting your lists of procedures
split procedure, parse(`=char(10)')
drop procedure
reshape long procedure, i(patientID) j(procedure_no)
drop if procedure ==""

*making your binary variables:
levelsof procedure, local(procedurelist)
foreach thing of local procedurelist{
gen `thing'=0
replace `thing'=1 if procedure=="`thing'"
}

* EDIT: If you want end up wit a single line per patientID:
collapse (max) procedure_no `procedurelist', by(patientID)

Note: this assumes you procedure names are all a single word. If they have spaces, you'll need to get rid of those to make them valid Stata names. You would also need to get rid of them in the values for the procedure variable, as they would otherwise not be matched with the newly created variables.

Last edited by Jorrit Gosens; 03 Jul 2019, 06:48.

Comment

Shamil Cooray

Join Date: Jul 2019

Posts: 12
#3

03 Jul 2019, 06:54

Dear Jorrit,
Thank you for your solution.
Most procedure names are not single words but rather procedures, e.g. "in vitro fertilisation". Sorry for not making this clear in my initial post. Perhaps I could replace all " " with "_" within the variables of interest? Is there a better way of doing this?
Shamil
Comment

Jorrit Gosens

Join Date: Jan 2015
Posts: 1019

03 Jul 2019, 07:02

Code:

*a data example:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float patientID str48 procedure
1 "cholecystectomy?appendectomy"          
2 "cholecystectomy?something?in vitro fertilisation"
3 "appendectomy"                          
end
replace procedure = subinstr(procedure, "?",char(10),.)

*splitting your lists of procedures
split procedure, parse(`=char(10)')
drop procedure
reshape long procedure, i(patientID) j(procedure_no)
drop if procedure ==""

*making your binary variables:
replace procedure =strtoname(procedure)
levelsof procedure, local(procedurelist)
foreach thing of local procedurelist{
gen `thing'=0
replace `thing'=1 if procedure=="`thing'"
}
*If you want end up wit a single line per patientID:
collapse (max) procedure_no `procedurelist', by(patientID)

But you might want to do

Code:

gen somevar =strtoname(procedure)
browse if somevar!=procedure

to check what sort of changes are being made. See explanation here: https://www.stata.com/manuals13/m-5strtoname.pdf

Comment

Shamil Cooray

Join Date: Jul 2019

Posts: 12
#5

03 Jul 2019, 07:06

additionally, would there be a way of doing this independent of patientID, for instance, because patients may have more than record due to multiple admissions? the reshape command is returning an error at present because the ID variable doesn't uniquely identify the observations.
Shamil
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#6

03 Jul 2019, 07:12

Depends on what you'd like the result to be.
A single line with all treatments for each patient independent of admission number? If so, should multiple treatments be counted? I.e., should the 'dummy variables' have values higher than one?
Or single lines for each admission for each patient?
Comment
Shamil Cooray

Join Date: Jul 2019

Posts: 12
#7

03 Jul 2019, 07:14

the later, single lines for each admission for each patient

Last edited by Shamil Cooray; 03 Jul 2019, 07:14. Reason: typo corrected
Comment

Jorrit Gosens

Join Date: Jan 2015
Posts: 1019

03 Jul 2019, 07:27

Code:

*a data example:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float patientID str48 procedure
1 "cholecystectomy?appendectomy"          
2 "cholecystectomy?something?in vitro fertilisation"
3 "appendectomy"                          
3 "something else"                          
end
replace procedure = subinstr(procedure, "?",char(10),.)

*create admisison number if needed
bys patientID: gen admission_no=_n

*splitting your lists of procedures
split procedure, parse(`=char(10)')
drop procedure
reshape long procedure, i(patientID admission_no) j(procedure_no)
drop if procedure ==""

*making your binary variables:
replace procedure =strtoname(procedure)

levelsof procedure, local(procedurelist)
foreach thing of local procedurelist{
gen `thing'=0
replace `thing'=1 if procedure=="`thing'"
}
*If you want end up wit a single line per patientID (and admisison):
collapse (max) procedure_no `procedurelist', by(patientID admission_no)

Comment

Shamil Cooray

Join Date: Jul 2019

Posts: 12
#9

03 Jul 2019, 09:33

Dear Jorrit,
That's brilliant. It works well. How would collapse to a single line per patientID (and admission) with the newly created binary variables for procedures rather than the number of procedures (or in addition to the number of procedures)?

Thank you very much. I'm learning a lot from your code.

Shamil
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#10

03 Jul 2019, 09:59

I dont really understand your question.
The current code gives you, for each patient and admission:
- the total number of procedures in that admission
- a dummy indicating which procedures were done.

What would you like instead?

Note that the current line:

Code:

collapse (max) procedure_no `procedurelist', by(patientID admission_no)

collapses all individual procedure variables.
`procedurelist' is a list of all those variables names.
Comment
Shamil Cooray

Join Date: Jul 2019

Posts: 12
#11

04 Jul 2019, 02:25

My mistake, on running the code again it does seem to do what I need. I think I may have made an error last time. Thank you very much.

When I run this code on more entire dataset the reshape command takes a long time given the computation involved (dataset has ~26,000 observations and 70 variables). To get around this issue I've been testing it on a limited sample of my dataset (all observations but on 3 variables).

Do you have any suggestions on the workflow to run this code on about 20 other variables which are multi-line composite string variables like "procedure"? Presumably, I could run the code through each of the 20 variables sequentially using a macro?

I'm running Stat/IC 15.1 for Windows (64-bit). My machine is about 18 months old and unfortunately upgrading it isn't an option at this stage.

Thank you very much for your time.

Shamil
Comment

Jorrit Gosens

Join Date: Jan 2015
Posts: 1019

#12

04 Jul 2019, 03:51

That is a little more complex, in particular if you want to end up with single lines per patient/admission.
Running collapse gets rid of any variables not specified in the command, so you cant immediately run the above code on all variables with your complex strings.
Below approach gets around this by running the above code on a smaller set of your data (a single one of your string vars), and then sticking the results in a temporary file. That temporary file gets updated each time the loop uses a new variable. In the code below, replace

Code:

foreach item in procedure othervar{

With

Code:

foreach item in all your variable names{

Code:

*a data example:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float patientID str50(procedure othervar)
1 "cholecystectomy?appendectomy"                     "other thing one?other thing two"                    
2 "cholecystectomy?something?in vitro fertilisation" "other thing two?other thing three?other thing four"
3 "appendectomy"                                     "other thing one"                                    
3 "something else"                                   "other thing four"                                  
end
replace procedure = subinstr(procedure, "?",char(10),.)
replace othervar= subinstr(othervar, "?",char(10),.)
*create admisison number if needed
bys patientID: gen admission_no=_n
* End creating example data

*Code to run:
tempfile original
save `original'

preserve
keep patientID admission_no
sort patientID admission_no
tempfile merged
save `merged'
restore

foreach item in procedure othervar{
    use `original', clear
    *splitting your lists of procedures
    split `item', parse(`=char(10)')
    drop `item'
    reshape long `item', i(patientID admission_no) j(`item'_no)
    drop if `item' ==""

    *making your binary variables:
    replace `item' =strtoname(`item')

    levelsof `item', local(listofthings)
    foreach thing of local listofthings{
        gen `thing'=0
        replace `thing'=1 if `item'=="`thing'"
    }
    collapse (max) `item'_no `listofthings', by(patientID admission_no)
    merge 1:1 patientID admission_no using `merged', nogenerate
    save `merged', replace
}
use `merged', clear
save somefilename

And then just let it run.
Go for a coffee, or lunch, or read some articles. Patience is a virtue. Or at least you can keep that in mind until you can afford a quicker machine.

Edit:
You should be able to speed this up a bunch by keeping only the required variables when you reshape.

Code:

foreach item in procedure othervar{
    use `original', clear
    *splitting your lists of procedures
    split `item', parse(`=char(10)')
    drop `item'
    keep patientID admission_no `item'*
    reshape long `item', i(patientID admission_no) j(`item'_no)
    drop if `item' ==""

    *making your binary variables:
    replace `item' =strtoname(`item')

    levelsof `item', local(listofthings)
    foreach thing of local listofthings{
        gen `thing'=0
        replace `thing'=1 if `item'=="`thing'"
    }
    collapse (max) `item'_no `listofthings', by(patientID admission_no)
    merge 1:1 patientID admission_no using `merged', nogenerate
    save `merged', replace
}

But make sure to include all the variables that you want to keep.

Last edited by Jorrit Gosens; 04 Jul 2019, 04:13.

Comment

Shamil Cooray

Join Date: Jul 2019

Posts: 12
#13

08 Jul 2019, 05:58

Hi Jorrit,
This solution looks great. Sorry for the delay- it has taken me some time to work through it.
The code obviously works perfectly using the test data however I am getting an error when I translate it to my actual dataset.

This is the code as translated to my dataset:

Code:

tempfile original save `original' preserve keep episodeid babyepisodeid sort episodeid babyepisodeid tempfile merged save `merged' restore * after "foreach item in" add in variables that need to be split foreach item in obstetriccomplications{ use `original', clear *splitting your lists of procedures split `item', parse(`=char(10)') drop `item' keep episodeid babyepisodeid `item' *add list of variables to keep reshape long `item', i(episodeid babyepisodeid) j(`item'_no) drop if `item' =="" *making your binary variables: replace `item' =strtoname(`item') levelsof `item', local(listofthings) foreach thing of local listofthings{ gen `thing'=0 replace `thing'=1 if `item'=="`thing'" } collapse (max) `item'_no `listofthings', by(episodeid babyepisodeid) merge 1:1 episodeid babyepisodeid using `merged', nogenerate save `merged', replace } use `merged', clear save BOS2test.dta

This the Stata output from the first -foreach- command:

. * after "foreach item in" add in variables that need to be split
. foreach item in obstetriccomplications{
2. use `original', clear
3. *splitting your lists of procedures
. split `item', parse(`=char(10)')
4. drop `item'
5. keep episodeid babyepisodeid `item' *add list of variables to keep
6. reshape long `item', i(episodeid babyepisodeid) j(`item'_no)
7. drop if `item' ==""
8.
. *making your binary variables:
. replace `item' =strtoname(`item')
9.
. levelsof `item', local(listofthings)
10. foreach thing of local listofthings{
11. gen `thing'=0
12. replace `thing'=1 if `item'=="`thing'"
13. }
14. collapse (max) `item'_no `listofthings', by(episodeid babyepisodeid)
15. merge 1:1 episodeid babyepisodeid using `merged', nogenerate
16. save `merged', replace
17. }
variables created as string:
obstetricc~1 obstetricc~3 obstetricc~5 obstetricc~7 obstetricc~9
obstetricc~2 obstetricc~4 obstetricc~6 obstetricc~8 obstetric~10
obstetriccomplications ambiguous abbreviation
r(111);

I can't work out why the obstetriccomplications variable is being lost after the -split- command such that when it is called in the -drop- command in the following line it cannot be found?

I had it working perfectly prior the most recent modification which built in the 2nd loop. I know it must be something really simple but I can't seem to nut it out...

Thanks for your expertise.

Shamil
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#14

08 Jul 2019, 06:10

In the line:

Code:

keep patientID admission_no `item'*

The asterisk is important.
You still have an asterisk in that line now, but you include a space, which means Stata will evaluate that asterisk and everything after it as a comment.

Assuming `item'==obstetriccomplications
The difference is:

Code:

keep patientID admission_no obstetriccomplications *some note

-> Stata will assume obstetriccomplications is an exact variable name, or an abbreviation of a single variable, and it finds many variables that are abbreviated as obstetriccomplications (obstetriccomplications1, obstetriccomplications2, etc)

Code:

keep patientID admission_no obstetriccomplications* *some note

-> Stata accepts this to be a wildcard meaning all variables abbreviated as obstetriccomplications

And one more tip:
if you have code with an error that doenst really help pinpoint where it goes wrong, include the line:

Code:

set trace on

as a first line in your ado file. This gives more detailed output, and will show you exactly where in the loop Stata didnt understand instructions.

Last edited by Jorrit Gosens; 08 Jul 2019, 06:13.
1 like
Comment
Shamil Cooray

Join Date: Jul 2019

Posts: 12
#15

08 Jul 2019, 06:13

Following further experimentation I have narrowed the cause of the error down to the -keep- command that preceeds -reshape-. Can't quite work out what I have done wrong?
Comment

Announcement