Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variables with Multiple Choice Options

    Hi all, I have been trying to trouble shoot an issue with some variables, to no avail and hoping someone on this group can help me out.
    I have several string variables in an excel file (str8, str13) where the information is stored as numerical options that corresponded with a choice option from a select all that apply question. See picture

    There are a few issues I need to address for data management:
    1. Remove the commas
    2. Create new variable that is numerical or byte, so numbers can be labeled (e.g. 1 "awareness" 2 "education" etc) like a categorical variable (this would be a new variable).
    3. Then, should I create new variables (wide format) for each of the 9 possible responses (Q shorterm) - seems like alot --
    I would like to use the MRTAB function or something similar to assess a) which option was selected most frequently? b) were certain choices most often selected together?

    For our analysis, we are seeking to answer if response choices were correlated with demographic characteristics and the frequency of each response option itself.
    Steps 1 & 2 are the primary objectives (remove comma if needed) and create non-string variable so I can label 1 - 9 responses (many to one) ... which I cannot do while it is a string (and if I do it after I encode, it labels the observations with the label text, which is not correct).

    Any assistance would be greatly appreciated! String variables and I are not getting along :-)
    Thanks,
    Ashlee

    Click image for larger version

Name:	STR Variables NANN.png
Views:	1
Size:	56.8 KB
ID:	1676972

  • #2
    Hi Ashlee,

    Welcome to the forum.

    1. Please provide a data example as specified in the FAQ above, and paste in into a comment on this thread. You can generate the example with the -dataex- command. An image of some data is not very useful. It is much more helpful to have data that we can copy and paste into a do file.

    2. If you want to use -mrtab- it will probably be easiest to turn each of these columns into a set of indicator variables. It is not enough to simply "remove the commas."

    3. Absolutely do not use -encode- to change a string representation of a number into a numeric representation. That is not what -encode- is for; -encode- is for creating factor variables, and it will not maintain your string encoded numbering system. This will screw up your analysis. If you want to turn a string into a number, use the real() function instead. Remember that if you want to use real, you need to make sure you have one number per column first!

    4. You should look over this piece of documentation if you haven't already.

    5. It is often better to ask a specific question about specific problem you are having. It sounds a bit like you are asking us to do your project for you.

    Comment


    • #3
      Thank you Daniel for this information.
      I am not asking for someone to do my project, rather, trying to understand and learn the best practices around data management for string variables and ensuring I create a dataset that is useful.

      My specific question remains - 1. What command would allow me to change/modify/recode the str variables I have with multiple choice options (numbers), into a format, where I can label each unique option (much like a categorical variable). The variables of concern in the example below would be the first 3 listed.

      Here is an example generated using dataex.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str45 INSTITUTIONADOPTED str8 RACIALDISPARITY str13 BIGGESTBARRIERS byte(POSITION WORKED RACE)
      "5, 9, 10, 11, 12" "7"       "1, 2, 3" 8 3 1
      ""                 "4, 5, 7" "6"       4 6 1
      ""                 ""        ""        . . .
      ""                 "4, 5, 7" "4"       6 4 1
      ""                 ""        ""        . . .
      ""                 ""        ""        . . .
      ""                 ""        ""        . . .
      ""                 "10"      "2"       8 7 1
      ""                 "3"       "4, 5"    8 5 1
      ""                 ""        ""        . . .
      end
      label values POSITION rolelbl
      label def rolelbl 4 "Educator", modify
      label def rolelbl 6 "Nurse Manager", modify
      label def rolelbl 8 "Staff Nurse", modify
      label values WORKED workyrlbl
      label def workyrlbl 3 "3 - 5 years", modify
      label def workyrlbl 4 "6 - 10 years", modify
      label def workyrlbl 5 "11 - 15 years", modify
      label def workyrlbl 6 "16 - 20 years", modify
      label def workyrlbl 7 "More than 20 years", modify
      label values RACE race
      label def race 1 "White", modify

      Comment


      • #4
        Thank you Ashlee,

        Perhaps someone else knows of a simple way to do this with a single command. I would generate a new indicator variable for each possible value. Take for example RACIALDISPARITY:

        forv i = 1/10{
        gen RACIALDISPARITY_`i' = 0
        replace RACIALDISPARITY_`i' = 1 if strpos(RACIALDISPARITY, "`i'") != 0
        }
        As long as the local `i' covers all of the options for racial disparity, this should give you an indicator variables for each option. If I understand the -mrtab- command documentation correctly, this should be what you need to use it in indicator mode. I used your data example to test and it looks like this works on my end.

        Edit: Just realized I could simplify the above code and remove an unneeded line.
        Last edited by Daniel Schaefer; 08 Aug 2022, 20:20.

        Comment


        • #5
          The method in #4 could be simplified to

          Code:
          forv i = 1/10 {
               gen RACIALDISPARITY_`i' = strpos(RACIALDISPARITY, "`i'") != 0
          }
          but it's very dangerous as a character "1" will be found within "10", say,even when "1" is not present as itself. The very first observation in the data example in #3 shows this problem.

          To get tabulations of frequencies, you can use tabsplit from tab_chi on SSC. I have less experience with the mrtab command (not function) published in the Stata Journal (but updated on SSC).


          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str45 INSTITUTIONADOPTED str8 RACIALDISPARITY str13 BIGGESTBARRIERS byte(POSITION WORKED RACE)
          "5, 9, 10, 11, 12" "7"       "1, 2, 3" 8 3 1
          ""                 "4, 5, 7" "6"       4 6 1
          ""                 ""        ""        . . .
          ""                 "4, 5, 7" "4"       6 4 1
          ""                 ""        ""        . . .
          ""                 ""        ""        . . .
          ""                 ""        ""        . . .
          ""                 "10"      "2"       8 7 1
          ""                 "3"       "4, 5"    8 5 1
          ""                 ""        ""        . . .
          end
          label values POSITION rolelbl
          label def rolelbl 4 "Educator", modify
          label def rolelbl 6 "Nurse Manager", modify
          label def rolelbl 8 "Staff Nurse", modify
          label values WORKED workyrlbl
          label def workyrlbl 3 "3 - 5 years", modify
          label def workyrlbl 4 "6 - 10 years", modify
          label def workyrlbl 5 "11 - 15 years", modify
          label def workyrlbl 6 "16 - 20 years", modify
          label def workyrlbl 7 "More than 20 years", modify
          label values RACE race
          label def race 1 "White", modify
          
          foreach v in INSTITUTIONADOPTED RACIALDISPARITY  BIGGESTBARRIERS { 
              tabsplit `v', sort parse(,)
          }
          
          
          INSTITUTION |
              ADOPTED |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                   10 |          1       20.00       20.00
                   11 |          1       20.00       40.00
                   12 |          1       20.00       60.00
                    5 |          1       20.00       80.00
                    9 |          1       20.00      100.00
          ------------+-----------------------------------
                Total |          5      100.00
          
          RACIALDISPA |
                 RITY |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                    7 |          3       33.33       33.33
                    4 |          2       22.22       55.56
                    5 |          2       22.22       77.78
                   10 |          1       11.11       88.89
                    3 |          1       11.11      100.00
          ------------+-----------------------------------
                Total |          9      100.00
          
          BIGGESTBARR |
                 IERS |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                    2 |          2       25.00       25.00
                    4 |          2       25.00       50.00
                    1 |          1       12.50       62.50
                    3 |          1       12.50       75.00
                    5 |          1       12.50       87.50
                    6 |          1       12.50      100.00
          ------------+-----------------------------------
                Total |          8      100.00
          If you want separate indicators (dummy variables in a common but horrible jargon), then here is one fairly simple method. First create more suitable versions of each composite string

          1. replace commas with spaces.
          2. add leading and trailing spaces.

          You can do both at once with

          Code:
          gen foo2 = " " + subinstr(foo, ",", " ", .) + "  "
          Now you can apply Daniel's idea by including leading and trailing spaces so that a search for say " 1 " will not find " 1 " within " 11 " or " 12 " and so on.

          There are other methods. A discussion of this problem is in press for Stata Journal 22(4).

          Comment


          • #6
            Thank you both for your generous response. I will review the code and examples.

            Comment


            • #7
              Originally posted by Nick Cox View Post
              The method in #4 could be simplified to

              Code:
              forv i = 1/10 {
              gen RACIALDISPARITY_`i' = strpos(RACIALDISPARITY, "`i'") != 0
              }
              but it's very dangerous as a character "1" will be found within "10", say,even when "1" is not present as itself. The very first observation in the data example in #3 shows this problem.

              To get tabulations of frequencies, you can use tabsplit from tab_chi on SSC. I have less experience with the mrtab command (not function) published in the Stata Journal (but updated on SSC).


              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str45 INSTITUTIONADOPTED str8 RACIALDISPARITY str13 BIGGESTBARRIERS byte(POSITION WORKED RACE)
              "5, 9, 10, 11, 12" "7" "1, 2, 3" 8 3 1
              "" "4, 5, 7" "6" 4 6 1
              "" "" "" . . .
              "" "4, 5, 7" "4" 6 4 1
              "" "" "" . . .
              "" "" "" . . .
              "" "" "" . . .
              "" "10" "2" 8 7 1
              "" "3" "4, 5" 8 5 1
              "" "" "" . . .
              end
              label values POSITION rolelbl
              label def rolelbl 4 "Educator", modify
              label def rolelbl 6 "Nurse Manager", modify
              label def rolelbl 8 "Staff Nurse", modify
              label values WORKED workyrlbl
              label def workyrlbl 3 "3 - 5 years", modify
              label def workyrlbl 4 "6 - 10 years", modify
              label def workyrlbl 5 "11 - 15 years", modify
              label def workyrlbl 6 "16 - 20 years", modify
              label def workyrlbl 7 "More than 20 years", modify
              label values RACE race
              label def race 1 "White", modify
              
              foreach v in INSTITUTIONADOPTED RACIALDISPARITY BIGGESTBARRIERS {
              tabsplit `v', sort parse(,)
              }
              
              
              INSTITUTION |
              ADOPTED | Freq. Percent Cum.
              ------------+-----------------------------------
              10 | 1 20.00 20.00
              11 | 1 20.00 40.00
              12 | 1 20.00 60.00
              5 | 1 20.00 80.00
              9 | 1 20.00 100.00
              ------------+-----------------------------------
              Total | 5 100.00
              
              RACIALDISPA |
              RITY | Freq. Percent Cum.
              ------------+-----------------------------------
              7 | 3 33.33 33.33
              4 | 2 22.22 55.56
              5 | 2 22.22 77.78
              10 | 1 11.11 88.89
              3 | 1 11.11 100.00
              ------------+-----------------------------------
              Total | 9 100.00
              
              BIGGESTBARR |
              IERS | Freq. Percent Cum.
              ------------+-----------------------------------
              2 | 2 25.00 25.00
              4 | 2 25.00 50.00
              1 | 1 12.50 62.50
              3 | 1 12.50 75.00
              5 | 1 12.50 87.50
              6 | 1 12.50 100.00
              ------------+-----------------------------------
              Total | 8 100.00
              If you want separate indicators (dummy variables in a common but horrible jargon), then here is one fairly simple method. First create more suitable versions of each composite string

              1. replace commas with spaces.
              2. add leading and trailing spaces.

              You can do both at once with

              Code:
              gen foo2 = " " + subinstr(foo, ",", " ", .) + " "
              Now you can apply Daniel's idea by including leading and trailing spaces so that a search for say " 1 " will not find " 1 " within " 11 " or " 12 " and so on.

              There are other methods. A discussion of this problem is in press for Stata Journal 22(4).
              Consistent with this response, I am using the code below to split my variables of interest and to generate the frequencies for the split variables. I am getting the error of: too few variables specified.


              foreach v in ICD10s AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS AT AU AV AW AX AY AZ BA BB {
              tabsplit `v', sort parse(,)
              }
              Last edited by Nader Mehri; 14 Feb 2024, 20:05.

              Comment


              • #8
                I guess you're running lines one by one from a do-file editor. So tabsplit can't see the local macro contents. See Stata tip 138: Local macros have local scope - Nicholas J. Cox, 2020 (sagepub.com)

                Comment

                Working...
                X