Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to create binary variables from a composite string variable

    Dear Community,
    I am relatively new to Stata and am still learns how to use macros and loops which I suspect may be a part of the required solution.

    I need to transform a string variable in a large dataset (~40,000 records) into a form that I can analyse. The string variable (e.g. procedure) is currently a composite string variable where each response is separated by a line return [(char(10)]. Each response is a phrase (e.g. cholecystectomy). For each patient record the variable may be missing, have one response (cholecystectomy) or multiple responses (cholecystectomy (char(10)) appendectomy etc.). I would like to create new binary variables for each of the procedures present. There are approximately 100 different procedure values. I don't have access to a reference list of all possible values this variable can take. Rather I would like the new variable names to be generated from the values already within this variable.

    This dataset is coded in this way for about 20 other string variables so the method I use naturally needs to be easily reproducible.

    I can use the split command to parse by char(10) but this creates many new variables (e.g. procedure1, procedure2 etc...) but these are not binary (e.g. cholecystectomy, appendectomy).

    Code:
    split procedure, parse(`=char(10)')
    A solution to a similar problem "How to create binary variables from words in phrases?" was posted here but is designed around the new variables of interest being 'words' within phrases. I haven't been able to modify the code to make it suit phrases delimited by line breaks within a long string. Here is the solution posted by Robert Pickard that seems to be on the right track:

    Originally posted by Robert Picard View Post
    Here's another approach

    Code:
    clear
    input byte id str17 phrase
    1 "new video post"
    2 "newer tweet"
    3 "add 12 new photos"
    4 "removed 1 photo"
    end
    format %-17s phrase
    
    split phrase
    local nwords = r(nvars)
    forvalues i = 1/`nwords' {
    
    levelsof phrase`i', clean
    foreach word in `r(levels)' {
    // in case a word is not a valid Stata name
    local vname = strtoname("`word'")
    // ignore error if var already exists
    cap gen byte `vname' = 0
    // look for space delimited words
    qui replace `vname' = 1 if strpos(" " + phrase`i' + " "," `word' ")
    }
    
    }

    Thank you for sharing your expertise.

    Shamil
    Last edited by Shamil Cooray; 03 Jul 2019, 05:43. Reason: fixed typo in title

  • #2
    Code:
    *a data example:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float patientID str39 procedure
    1 "cholecystectomy?appendectomy"          
    2 "cholecystectomy?something?somethingelse"
    3 "appendectomy"                          
    end
    replace procedure = subinstr(procedure, "?",char(10),.)
    
    *splitting your lists of procedures
    split procedure, parse(`=char(10)')
    drop procedure
    reshape long procedure, i(patientID) j(procedure_no)
    drop if procedure ==""
    
    *making your binary variables:
    levelsof procedure, local(procedurelist)
    foreach thing of local procedurelist{
    gen `thing'=0
    replace `thing'=1 if procedure=="`thing'"
    }
    
    * EDIT: If you want end up wit a single line per patientID:
    collapse (max) procedure_no `procedurelist', by(patientID)
    Note: this assumes you procedure names are all a single word. If they have spaces, you'll need to get rid of those to make them valid Stata names. You would also need to get rid of them in the values for the procedure variable, as they would otherwise not be matched with the newly created variables.
    Last edited by Jorrit Gosens; 03 Jul 2019, 06:48.

    Comment


    • #3
      Dear Jorrit,
      Thank you for your solution.
      Most procedure names are not single words but rather procedures, e.g. "in vitro fertilisation". Sorry for not making this clear in my initial post. Perhaps I could replace all " " with "_" within the variables of interest? Is there a better way of doing this?
      Shamil

      Comment


      • #4
        Code:
        *a data example:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float patientID str48 procedure
        1 "cholecystectomy?appendectomy"          
        2 "cholecystectomy?something?in vitro fertilisation"
        3 "appendectomy"                          
        end
        replace procedure = subinstr(procedure, "?",char(10),.)
        
        *splitting your lists of procedures
        split procedure, parse(`=char(10)')
        drop procedure
        reshape long procedure, i(patientID) j(procedure_no)
        drop if procedure ==""
        
        *making your binary variables:
        replace procedure =strtoname(procedure)
        levelsof procedure, local(procedurelist)
        foreach thing of local procedurelist{
        gen `thing'=0
        replace `thing'=1 if procedure=="`thing'"
        }
        *If you want end up wit a single line per patientID:
        collapse (max) procedure_no `procedurelist', by(patientID)
        But you might want to do
        Code:
        gen somevar =strtoname(procedure)
        browse if somevar!=procedure
        to check what sort of changes are being made. See explanation here: https://www.stata.com/manuals13/m-5strtoname.pdf

        Comment


        • #5
          additionally, would there be a way of doing this independent of patientID, for instance, because patients may have more than record due to multiple admissions? the reshape command is returning an error at present because the ID variable doesn't uniquely identify the observations.
          Shamil

          Comment


          • #6
            Depends on what you'd like the result to be.
            A single line with all treatments for each patient independent of admission number? If so, should multiple treatments be counted? I.e., should the 'dummy variables' have values higher than one?
            Or single lines for each admission for each patient?

            Comment


            • #7
              the later, single lines for each admission for each patient
              Last edited by Shamil Cooray; 03 Jul 2019, 07:14. Reason: typo corrected

              Comment


              • #8
                Code:
                *a data example:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input float patientID str48 procedure
                1 "cholecystectomy?appendectomy"          
                2 "cholecystectomy?something?in vitro fertilisation"
                3 "appendectomy"                          
                3 "something else"                          
                end
                replace procedure = subinstr(procedure, "?",char(10),.)
                
                *create admisison number if needed
                bys patientID: gen admission_no=_n
                
                *splitting your lists of procedures
                split procedure, parse(`=char(10)')
                drop procedure
                reshape long procedure, i(patientID admission_no) j(procedure_no)
                drop if procedure ==""
                
                *making your binary variables:
                replace procedure =strtoname(procedure)
                
                levelsof procedure, local(procedurelist)
                foreach thing of local procedurelist{
                gen `thing'=0
                replace `thing'=1 if procedure=="`thing'"
                }
                *If you want end up wit a single line per patientID (and admisison):
                collapse (max) procedure_no `procedurelist', by(patientID admission_no)

                Comment


                • #9
                  Dear Jorrit,
                  That's brilliant. It works well. How would collapse to a single line per patientID (and admission) with the newly created binary variables for procedures rather than the number of procedures (or in addition to the number of procedures)?

                  Thank you very much. I'm learning a lot from your code.

                  Shamil

                  Comment


                  • #10
                    I dont really understand your question.
                    The current code gives you, for each patient and admission:
                    - the total number of procedures in that admission
                    - a dummy indicating which procedures were done.

                    What would you like instead?

                    Note that the current line:
                    Code:
                    collapse (max) procedure_no `procedurelist', by(patientID admission_no)
                    collapses all individual procedure variables.
                    `procedurelist' is a list of all those variables names.

                    Comment


                    • #11
                      My mistake, on running the code again it does seem to do what I need. I think I may have made an error last time. Thank you very much.

                      When I run this code on more entire dataset the reshape command takes a long time given the computation involved (dataset has ~26,000 observations and 70 variables). To get around this issue I've been testing it on a limited sample of my dataset (all observations but on 3 variables).

                      Do you have any suggestions on the workflow to run this code on about 20 other variables which are multi-line composite string variables like "procedure"? Presumably, I could run the code through each of the 20 variables sequentially using a macro?

                      I'm running Stat/IC 15.1 for Windows (64-bit). My machine is about 18 months old and unfortunately upgrading it isn't an option at this stage.

                      Thank you very much for your time.

                      Shamil

                      Comment


                      • #12
                        That is a little more complex, in particular if you want to end up with single lines per patient/admission.
                        Running collapse gets rid of any variables not specified in the command, so you cant immediately run the above code on all variables with your complex strings.
                        Below approach gets around this by running the above code on a smaller set of your data (a single one of your string vars), and then sticking the results in a temporary file. That temporary file gets updated each time the loop uses a new variable. In the code below, replace
                        Code:
                        foreach item in procedure othervar{
                        With
                        Code:
                        foreach item in all your variable names{

                        Code:
                        *a data example:
                        * Example generated by -dataex-. To install: ssc install dataex
                        clear
                        input float patientID str50(procedure othervar)
                        1 "cholecystectomy?appendectomy"                     "other thing one?other thing two"                    
                        2 "cholecystectomy?something?in vitro fertilisation" "other thing two?other thing three?other thing four"
                        3 "appendectomy"                                     "other thing one"                                    
                        3 "something else"                                   "other thing four"                                  
                        end
                        replace procedure = subinstr(procedure, "?",char(10),.)
                        replace othervar= subinstr(othervar, "?",char(10),.)
                        *create admisison number if needed
                        bys patientID: gen admission_no=_n
                        * End creating example data
                        
                        *Code to run:
                        tempfile original
                        save `original'
                        
                        preserve
                        keep patientID admission_no
                        sort patientID admission_no
                        tempfile merged
                        save `merged'
                        restore
                        
                        foreach item in procedure othervar{
                            use `original', clear
                            *splitting your lists of procedures
                            split `item', parse(`=char(10)')
                            drop `item'
                            reshape long `item', i(patientID admission_no) j(`item'_no)
                            drop if `item' ==""
                        
                            *making your binary variables:
                            replace `item' =strtoname(`item')
                        
                            levelsof `item', local(listofthings)
                            foreach thing of local listofthings{
                                gen `thing'=0
                                replace `thing'=1 if `item'=="`thing'"
                            }
                            collapse (max) `item'_no `listofthings', by(patientID admission_no)
                            merge 1:1 patientID admission_no using `merged', nogenerate
                            save `merged', replace
                        }
                        use `merged', clear
                        save somefilename
                        And then just let it run.
                        Go for a coffee, or lunch, or read some articles. Patience is a virtue. Or at least you can keep that in mind until you can afford a quicker machine.


                        Edit:
                        You should be able to speed this up a bunch by keeping only the required variables when you reshape.
                        Code:
                        foreach item in procedure othervar{
                            use `original', clear
                            *splitting your lists of procedures
                            split `item', parse(`=char(10)')
                            drop `item'
                            keep patientID admission_no `item'*
                            reshape long `item', i(patientID admission_no) j(`item'_no)
                            drop if `item' ==""
                        
                            *making your binary variables:
                            replace `item' =strtoname(`item')
                        
                            levelsof `item', local(listofthings)
                            foreach thing of local listofthings{
                                gen `thing'=0
                                replace `thing'=1 if `item'=="`thing'"
                            }
                            collapse (max) `item'_no `listofthings', by(patientID admission_no)
                            merge 1:1 patientID admission_no using `merged', nogenerate
                            save `merged', replace
                        }
                        But make sure to include all the variables that you want to keep.
                        Last edited by Jorrit Gosens; 04 Jul 2019, 04:13.

                        Comment


                        • #13
                          Hi Jorrit,
                          This solution looks great. Sorry for the delay- it has taken me some time to work through it.
                          The code obviously works perfectly using the test data however I am getting an error when I translate it to my actual dataset.

                          This is the code as translated to my dataset:

                          Code:
                          tempfile original
                          save `original'
                          
                          preserve
                          keep episodeid babyepisodeid
                          sort episodeid babyepisodeid
                          tempfile merged
                          save `merged'
                          restore
                          
                          * after "foreach item in" add in variables that need to be split
                          foreach item in obstetriccomplications{
                              use `original', clear
                              *splitting your lists of procedures
                              split `item', parse(`=char(10)')
                              drop `item'
                              keep episodeid babyepisodeid `item' *add list of variables to keep
                              reshape long `item', i(episodeid babyepisodeid) j(`item'_no)
                              drop if `item' ==""
                          
                              *making your binary variables:
                              replace `item' =strtoname(`item')
                          
                              levelsof `item', local(listofthings)
                              foreach thing of local listofthings{
                                  gen `thing'=0
                                  replace `thing'=1 if `item'=="`thing'"
                              }
                              collapse (max) `item'_no `listofthings', by(episodeid babyepisodeid)
                              merge 1:1 episodeid babyepisodeid using `merged', nogenerate
                              save `merged', replace
                          }
                          use `merged', clear
                          save BOS2test.dta

                          This the Stata output from the first -foreach- command:


                          . * after "foreach item in" add in variables that need to be split
                          . foreach item in obstetriccomplications{
                          2. use `original', clear
                          3. *splitting your lists of procedures
                          . split `item', parse(`=char(10)')
                          4. drop `item'
                          5. keep episodeid babyepisodeid `item' *add list of variables to keep
                          6. reshape long `item', i(episodeid babyepisodeid) j(`item'_no)
                          7. drop if `item' ==""
                          8.
                          . *making your binary variables:
                          . replace `item' =strtoname(`item')
                          9.
                          . levelsof `item', local(listofthings)
                          10. foreach thing of local listofthings{
                          11. gen `thing'=0
                          12. replace `thing'=1 if `item'=="`thing'"
                          13. }
                          14. collapse (max) `item'_no `listofthings', by(episodeid babyepisodeid)
                          15. merge 1:1 episodeid babyepisodeid using `merged', nogenerate
                          16. save `merged', replace
                          17. }
                          variables created as string:
                          obstetricc~1 obstetricc~3 obstetricc~5 obstetricc~7 obstetricc~9
                          obstetricc~2 obstetricc~4 obstetricc~6 obstetricc~8 obstetric~10
                          obstetriccomplications ambiguous abbreviation
                          r(111);

                          I can't work out why the obstetriccomplications variable is being lost after the -split- command such that when it is called in the -drop- command in the following line it cannot be found?

                          I had it working perfectly prior the most recent modification which built in the 2nd loop. I know it must be something really simple but I can't seem to nut it out...

                          Thanks for your expertise.

                          Shamil

                          Comment


                          • #14
                            In the line:
                            Code:
                                keep patientID admission_no `item'*
                            The asterisk is important.
                            You still have an asterisk in that line now, but you include a space, which means Stata will evaluate that asterisk and everything after it as a comment.

                            Assuming `item'==obstetriccomplications
                            The difference is:
                            Code:
                                keep patientID admission_no obstetriccomplications *some note
                            -> Stata will assume obstetriccomplications is an exact variable name, or an abbreviation of a single variable, and it finds many variables that are abbreviated as obstetriccomplications (obstetriccomplications1, obstetriccomplications2, etc)

                            Code:
                                keep patientID admission_no obstetriccomplications* *some note
                            -> Stata accepts this to be a wildcard meaning all variables abbreviated as obstetriccomplications

                            And one more tip:
                            if you have code with an error that doenst really help pinpoint where it goes wrong, include the line:
                            Code:
                            set trace on
                            as a first line in your ado file. This gives more detailed output, and will show you exactly where in the loop Stata didnt understand instructions.
                            Last edited by Jorrit Gosens; 08 Jul 2019, 06:13.

                            Comment


                            • #15
                              Following further experimentation I have narrowed the cause of the error down to the -keep- command that preceeds -reshape-. Can't quite work out what I have done wrong?

                              Comment

                              Working...
                              X