Variables with Multiple Choice Options

Ashlee Va

Join Date: Aug 2022

Posts: 8
#1

Variables with Multiple Choice Options

08 Aug 2022, 15:17

Hi all, I have been trying to trouble shoot an issue with some variables, to no avail and hoping someone on this group can help me out.
I have several string variables in an excel file (str8, str13) where the information is stored as numerical options that corresponded with a choice option from a select all that apply question. See picture

There are a few issues I need to address for data management:
1. Remove the commas
2. Create new variable that is numerical or byte, so numbers can be labeled (e.g. 1 "awareness" 2 "education" etc) like a categorical variable (this would be a new variable).
3. Then, should I create new variables (wide format) for each of the 9 possible responses (Q shorterm) - seems like alot --
I would like to use the MRTAB function or something similar to assess a) which option was selected most frequently? b) were certain choices most often selected together?

For our analysis, we are seeking to answer if response choices were correlated with demographic characteristics and the frequency of each response option itself.
Steps 1 & 2 are the primary objectives (remove comma if needed) and create non-string variable so I can label 1 - 9 responses (many to one) ... which I cannot do while it is a string (and if I do it after I encode, it labels the observations with the label text, which is not correct).

Any assistance would be greatly appreciated! String variables and I are not getting along :-)
Thanks,
Ashlee
Tags: None
Daniel Schaefer

Join Date: Mar 2020

Posts: 806
#2

08 Aug 2022, 15:54

Hi Ashlee,

Welcome to the forum.

1. Please provide a data example as specified in the FAQ above, and paste in into a comment on this thread. You can generate the example with the -dataex- command. An image of some data is not very useful. It is much more helpful to have data that we can copy and paste into a do file.

2. If you want to use -mrtab- it will probably be easiest to turn each of these columns into a set of indicator variables. It is not enough to simply "remove the commas."

3. Absolutely do not use -encode- to change a string representation of a number into a numeric representation. That is not what -encode- is for; -encode- is for creating factor variables, and it will not maintain your string encoded numbering system. This will screw up your analysis. If you want to turn a string into a number, use the real() function instead. Remember that if you want to use real, you need to make sure you have one number per column first!

4. You should look over this piece of documentation if you haven't already.

5. It is often better to ask a specific question about specific problem you are having. It sounds a bit like you are asking us to do your project for you.
Comment

Ashlee Va

Join Date: Aug 2022
Posts: 8

08 Aug 2022, 16:30

Thank you Daniel for this information.
I am not asking for someone to do my project, rather, trying to understand and learn the best practices around data management for string variables and ensuring I create a dataset that is useful.

My specific question remains - 1. What command would allow me to change/modify/recode the str variables I have with multiple choice options (numbers), into a format, where I can label each unique option (much like a categorical variable). The variables of concern in the example below would be the first 3 listed.

Here is an example generated using dataex.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str45 INSTITUTIONADOPTED str8 RACIALDISPARITY str13 BIGGESTBARRIERS byte(POSITION WORKED RACE)
"5, 9, 10, 11, 12" "7"       "1, 2, 3" 8 3 1
""                 "4, 5, 7" "6"       4 6 1
""                 ""        ""        . . .
""                 "4, 5, 7" "4"       6 4 1
""                 ""        ""        . . .
""                 ""        ""        . . .
""                 ""        ""        . . .
""                 "10"      "2"       8 7 1
""                 "3"       "4, 5"    8 5 1
""                 ""        ""        . . .
end
label values POSITION rolelbl
label def rolelbl 4 "Educator", modify
label def rolelbl 6 "Nurse Manager", modify
label def rolelbl 8 "Staff Nurse", modify
label values WORKED workyrlbl
label def workyrlbl 3 "3 - 5 years", modify
label def workyrlbl 4 "6 - 10 years", modify
label def workyrlbl 5 "11 - 15 years", modify
label def workyrlbl 6 "16 - 20 years", modify
label def workyrlbl 7 "More than 20 years", modify
label values RACE race
label def race 1 "White", modify

Comment

Daniel Schaefer

Join Date: Mar 2020

Posts: 806
#4

08 Aug 2022, 19:33

Thank you Ashlee,

Perhaps someone else knows of a simple way to do this with a single command. I would generate a new indicator variable for each possible value. Take for example RACIALDISPARITY:

forv i = 1/10{
gen RACIALDISPARITY_`i' = 0
replace RACIALDISPARITY_`i' = 1 if strpos(RACIALDISPARITY, "`i'") != 0
}

As long as the local `i' covers all of the options for racial disparity, this should give you an indicator variables for each option. If I understand the -mrtab- command documentation correctly, this should be what you need to use it in indicator mode. I used your data example to test and it looks like this works on my end.

Edit: Just realized I could simplify the above code and remove an unneeded line.

Last edited by Daniel Schaefer; 08 Aug 2022, 20:20.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35213

09 Aug 2022, 04:42

The method in #4 could be simplified to

Code:

forv i = 1/10 {
     gen RACIALDISPARITY_`i' = strpos(RACIALDISPARITY, "`i'") != 0
}

but it's very dangerous as a character "1" will be found within "10", say,even when "1" is not present as itself. The very first observation in the data example in #3 shows this problem.

To get tabulations of frequencies, you can use tabsplit from tab_chi on SSC. I have less experience with the mrtab command (not function) published in the Stata Journal (but updated on SSC).

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str45 INSTITUTIONADOPTED str8 RACIALDISPARITY str13 BIGGESTBARRIERS byte(POSITION WORKED RACE)
"5, 9, 10, 11, 12" "7"       "1, 2, 3" 8 3 1
""                 "4, 5, 7" "6"       4 6 1
""                 ""        ""        . . .
""                 "4, 5, 7" "4"       6 4 1
""                 ""        ""        . . .
""                 ""        ""        . . .
""                 ""        ""        . . .
""                 "10"      "2"       8 7 1
""                 "3"       "4, 5"    8 5 1
""                 ""        ""        . . .
end
label values POSITION rolelbl
label def rolelbl 4 "Educator", modify
label def rolelbl 6 "Nurse Manager", modify
label def rolelbl 8 "Staff Nurse", modify
label values WORKED workyrlbl
label def workyrlbl 3 "3 - 5 years", modify
label def workyrlbl 4 "6 - 10 years", modify
label def workyrlbl 5 "11 - 15 years", modify
label def workyrlbl 6 "16 - 20 years", modify
label def workyrlbl 7 "More than 20 years", modify
label values RACE race
label def race 1 "White", modify

foreach v in INSTITUTIONADOPTED RACIALDISPARITY  BIGGESTBARRIERS { 
    tabsplit `v', sort parse(,)
}


INSTITUTION |
    ADOPTED |      Freq.     Percent        Cum.
------------+-----------------------------------
         10 |          1       20.00       20.00
         11 |          1       20.00       40.00
         12 |          1       20.00       60.00
          5 |          1       20.00       80.00
          9 |          1       20.00      100.00
------------+-----------------------------------
      Total |          5      100.00

RACIALDISPA |
       RITY |      Freq.     Percent        Cum.
------------+-----------------------------------
          7 |          3       33.33       33.33
          4 |          2       22.22       55.56
          5 |          2       22.22       77.78
         10 |          1       11.11       88.89
          3 |          1       11.11      100.00
------------+-----------------------------------
      Total |          9      100.00

BIGGESTBARR |
       IERS |      Freq.     Percent        Cum.
------------+-----------------------------------
          2 |          2       25.00       25.00
          4 |          2       25.00       50.00
          1 |          1       12.50       62.50
          3 |          1       12.50       75.00
          5 |          1       12.50       87.50
          6 |          1       12.50      100.00
------------+-----------------------------------
      Total |          8      100.00

If you want separate indicators (dummy variables in a common but horrible jargon), then here is one fairly simple method. First create more suitable versions of each composite string

1. replace commas with spaces.
2. add leading and trailing spaces.

You can do both at once with

Code:

gen foo2 = " " + subinstr(foo, ",", " ", .) + "  "

Now you can apply Daniel's idea by including leading and trailing spaces so that a search for say " 1 " will not find " 1 " within " 11 " or " 12 " and so on.

There are other methods. A discussion of this problem is in press for Stata Journal 22(4).

Comment

Ashlee Va

Join Date: Aug 2022

Posts: 8
#6

09 Aug 2022, 07:42

Thank you both for your generous response. I will review the code and examples.
Comment

Nader Mehri

Join Date: Jun 2019
Posts: 189

14 Feb 2024, 20:02

Originally posted by Nick Cox View Post

The method in #4 could be simplified to

Code:

forv i = 1/10 {
gen RACIALDISPARITY_`i' = strpos(RACIALDISPARITY, "`i'") != 0
}

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str45 INSTITUTIONADOPTED str8 RACIALDISPARITY str13 BIGGESTBARRIERS byte(POSITION WORKED RACE)
"5, 9, 10, 11, 12" "7" "1, 2, 3" 8 3 1
"" "4, 5, 7" "6" 4 6 1
"" "" "" . . .
"" "4, 5, 7" "4" 6 4 1
"" "" "" . . .
"" "" "" . . .
"" "" "" . . .
"" "10" "2" 8 7 1
"" "3" "4, 5" 8 5 1
"" "" "" . . .
end
label values POSITION rolelbl
label def rolelbl 4 "Educator", modify
label def rolelbl 6 "Nurse Manager", modify
label def rolelbl 8 "Staff Nurse", modify
label values WORKED workyrlbl
label def workyrlbl 3 "3 - 5 years", modify
label def workyrlbl 4 "6 - 10 years", modify
label def workyrlbl 5 "11 - 15 years", modify
label def workyrlbl 6 "16 - 20 years", modify
label def workyrlbl 7 "More than 20 years", modify
label values RACE race
label def race 1 "White", modify

foreach v in INSTITUTIONADOPTED RACIALDISPARITY BIGGESTBARRIERS {
tabsplit `v', sort parse(,)
}


INSTITUTION |
ADOPTED | Freq. Percent Cum.
------------+-----------------------------------
10 | 1 20.00 20.00
11 | 1 20.00 40.00
12 | 1 20.00 60.00
5 | 1 20.00 80.00
9 | 1 20.00 100.00
------------+-----------------------------------
Total | 5 100.00

RACIALDISPA |
RITY | Freq. Percent Cum.
------------+-----------------------------------
7 | 3 33.33 33.33
4 | 2 22.22 55.56
5 | 2 22.22 77.78
10 | 1 11.11 88.89
3 | 1 11.11 100.00
------------+-----------------------------------
Total | 9 100.00

BIGGESTBARR |
IERS | Freq. Percent Cum.
------------+-----------------------------------
2 | 2 25.00 25.00
4 | 2 25.00 50.00
1 | 1 12.50 62.50
3 | 1 12.50 75.00
5 | 1 12.50 87.50
6 | 1 12.50 100.00
------------+-----------------------------------
Total | 8 100.00

Code:

gen foo2 = " " + subinstr(foo, ",", " ", .) + " "

Consistent with this response, I am using the code below to split my variables of interest and to generate the frequencies for the split variables. I am getting the error of: too few variables specified.

foreach v in ICD10s AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS AT AU AV AW AX AY AZ BA BB {
tabsplit `v', sort parse(,)
}

Last edited by Nader Mehri; 14 Feb 2024, 20:05.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35213
#8

15 Feb 2024, 03:07

I guess you're running lines one by one from a do-file editor. So tabsplit can't see the local macro contents. See Stata tip 138: Local macros have local scope - Nicholas J. Cox, 2020 (sagepub.com)
1 like
Comment

Announcement