Dear Community,
I am relatively new to Stata and am still learns how to use macros and loops which I suspect may be a part of the required solution.
I need to transform a string variable in a large dataset (~40,000 records) into a form that I can analyse. The string variable (e.g. procedure) is currently a composite string variable where each response is separated by a line return [(char(10)]. Each response is a phrase (e.g. cholecystectomy). For each patient record the variable may be missing, have one response (cholecystectomy) or multiple responses (cholecystectomy (char(10)) appendectomy etc.). I would like to create new binary variables for each of the procedures present. There are approximately 100 different procedure values. I don't have access to a reference list of all possible values this variable can take. Rather I would like the new variable names to be generated from the values already within this variable.
This dataset is coded in this way for about 20 other string variables so the method I use naturally needs to be easily reproducible.
I can use the split command to parse by char(10) but this creates many new variables (e.g. procedure1, procedure2 etc...) but these are not binary (e.g. cholecystectomy, appendectomy).
A solution to a similar problem "How to create binary variables from words in phrases?" was posted here but is designed around the new variables of interest being 'words' within phrases. I haven't been able to modify the code to make it suit phrases delimited by line breaks within a long string. Here is the solution posted by Robert Pickard that seems to be on the right track:
Thank you for sharing your expertise.
Shamil
I am relatively new to Stata and am still learns how to use macros and loops which I suspect may be a part of the required solution.
I need to transform a string variable in a large dataset (~40,000 records) into a form that I can analyse. The string variable (e.g. procedure) is currently a composite string variable where each response is separated by a line return [(char(10)]. Each response is a phrase (e.g. cholecystectomy). For each patient record the variable may be missing, have one response (cholecystectomy) or multiple responses (cholecystectomy (char(10)) appendectomy etc.). I would like to create new binary variables for each of the procedures present. There are approximately 100 different procedure values. I don't have access to a reference list of all possible values this variable can take. Rather I would like the new variable names to be generated from the values already within this variable.
This dataset is coded in this way for about 20 other string variables so the method I use naturally needs to be easily reproducible.
I can use the split command to parse by char(10) but this creates many new variables (e.g. procedure1, procedure2 etc...) but these are not binary (e.g. cholecystectomy, appendectomy).
Code:
split procedure, parse(`=char(10)')
Originally posted by Robert Picard
View Post
Thank you for sharing your expertise.
Shamil
Comment