Combining dichotomous variables to categorical ones

Regina Bock

Join Date: Jul 2018

Posts: 4
#1

Combining dichotomous variables to categorical ones

25 Jul 2018, 15:46

Dear community, I am currently dealing with a somewhat problematic dataset (at least for a beginner like me).
So there are many dichotomous variables in this dataset. I'll give you two examples: Example 1: The question in the questionnaire was: "Which rooms do you use?" and this is what I see in the dataset: V1: Kitchen (1: quoted, 0: not quoted) V2: WC (1: quoted, 0: not quoted) V3: living room (1: quoted, 0: not quoted) V4: corridor (1: quoted, 0: not quoted). Now I want to make out of these 4 variables (V1-V4) one categorical variable (let's call it V5 "used rooms") with the values 1 = kitchen, 2 = WC, 3 = living room and 4 = hallway. So that every "1" in one variable is a category in the new variable V5.
So my first question is: With which command is it possible to merge these variables in the way I described above? Example 2: They asked in the questionnaire about the income in categories: "What is your daily wage?" V6: more than 150 (1: quoted, 0: not quoted) V7: 101-150 (1: quoted, 0: not quoted) V8: 51-100 (1: quoted, 0: not quoted) V9: 1-50 (1: quoted, 0: not quoted) V10: no payment (1: quoted, 0: not quoted) Now I need to combine these 5 variables (V6-V10) into one variable (V11: "Daily wage") in the same way like the example above. But now there is the problem that there are cases where people have selected multiple variables to this question. So my second question is: Is there a command to tell STATA, for example, to select the upper variable selected from the person and discard all the others (with the theoretical assumption that the first statement is closest to the true value)? So for example someone who choose "more than 150" and "101-150", so we assume that the "more than 150" is closer to the true value. Does anyone have a solution to this dilemma or can help me somehow? I am happy about any considerations! THANK YOU!!!
Tags: None

Carole J. Wilson

Join Date: Jan 2015
Posts: 932

25 Jul 2018, 16:58

The following previous post addresses the problem you encounter in Example 1:

https://www.statalist.org/forums/for...rical-variable

Re-purposing Clyde Schechter's code for your situation yields the following code:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(v1 v2 v3 v4)
1 0 0 0
1 0 0 0
0 0 1 0
0 0 1 0
1 0 0 0
end

//    VERIFY ONLY ONE OPTION SELECTED PER OBSERVATION
egen n_count = rowtotal(v1-v4)
assert n_count <= 1
drop n_count

//    COMBINE TO A SINGLE VARIABLE
gen v5 = .
local i = 1
foreach v of varlist v1-v4 {
    replace v5 = `i' if `v' == 1
    local ++i
}
list, noobs clean

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1

Comment

Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#3

25 Jul 2018, 17:09

Actually, the above code works perfectly for your Example 2 as well because it loops through the variables and replaces earlier value with later values if there is more than one. You will need to remove the section to verify only one option selected.

EXAMPLE 2

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(v6 v7 v8 v9) 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 end // COMBINE TO A SINGLE VARIABLE gen v10 = . local i = 1 foreach v of varlist v6-v9 { replace v10 = `i' if `v' == 1 local ++i } list, noobs clean

If you have a situation where you have multiple responses in which you DON'T want later values recorded, DO NOT run this code. For example, in Example 1, if the respondent could mention more than one room, then if they select both kitchen and hallway, you will only see hallway in the new variable.

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment

Natasha Chhabra

Join Date: Jan 2025
Posts: 2

18 Jan 2025, 10:11

Hi all, I have a similar problem but this code didn't work for me. I am trying to create a new variable using a set of 17 dichotomous variables. In my questionnaire, the respondent could select multiple options. In my final categorical variable, I don't mind if they chose more than one option and so it seems pretty straightforward but my code with gen (), replace is not giving me the accurate numbers in the end and I understand this has to do with folks choosing more than one option. Is there a way to create a new categorical variable that reflects the 1s from each of these variables? The following is what my dataex output looks like :

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(YL_Q3_01 YL_Q3_02 YL_Q3_04 YL_Q3_05 YL_Q3_06 YL_Q3_07 YL_Q3_08 YL_Q3_09 YL_Q3_10 YL_Q3_11 YL_Q3_12 YL_Q3_13 YL_Q3_14 YL_Q3_15 YL_Q3_16 YL_Q3_17)
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
end
label values YL_Q3_01 LABV
label values YL_Q3_02 LABV
label values YL_Q3_04 LABV
label values YL_Q3_05 LABV
label values YL_Q3_06 LABV
label values YL_Q3_07 LABV
label values YL_Q3_08 LABV
label values YL_Q3_09 LABV
label values YL_Q3_10 LABV
label values YL_Q3_11 LABV
label values YL_Q3_12 LABV
label values YL_Q3_13 LABV
label values YL_Q3_14 LABV
label values YL_Q3_15 LABV
label values YL_Q3_16 LABV
label values YL_Q3_17 LABV
label def LABV 0 "No", modify
label def LABV 1 "Yes", modify

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35059
#5

18 Jan 2025, 14:35

#4 doesn't explain your rules for your composite variable, nor does it show any of your code attempts. If you're just looking for any variable being 1 across observations, you can use egen's rowmax() function. If that's not the answer please give us more detail.
Comment
Natasha Chhabra

Join Date: Jan 2025

Posts: 2
#6

18 Jan 2025, 19:50

Thanks so much for your response! My question in the survey was what kind of internet user are you and the respondent could select multiple options. My current code below doesn't give me the right composite variable I'm looking for because the dichotomous variable for YL_Q1_1 is 15,334 yes/1. When I run the following code, I get 10,123 for YL_Q1_1. I'm assuming this is because I'm replacing respondents as I go along if they chose more than one option:

gen engaging_online = .
replace engaging_online = 1 if YL_Q1_1 == 1
replace engaging_online = 2 if YL_Q3_02 == 1
replace engaging_online = 3 if YL_Q3_03 == 1
replace engaging_online = 4 if YL_Q3_04 == 1
replace engaging_online = 5 if YL_Q3_05 == 1
replace engaging_online = 6 if YL_Q3_06 == 1
replace engaging_online = 7 if YL_Q3_07 == 1
replace engaging_online = 8 if YL_Q3_08 == 1
replace engaging_online = 9 if YL_Q3_09 == 1
replace engaging_online = 10 if YL_Q3_10 == 1
replace engaging_online = 11 if YL_Q3_11 == 1
replace engaging_online = 12 if YL_Q3_12 == 1
replace engaging_online = 13 if YL_Q3_13 == 1
replace engaging_online = 14 if YL_Q3_14 == 1
replace engaging_online = 15 if YL_Q3_15 == 1
replace engaging_online = 16 if YL_Q3_16 == 1
replace engaging_online = 17 if YL_Q3_17 == 1

I'm looking for a way to code it so I don't replace those respondents who chose more than one option. Thanks!
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35059

19 Jan 2025, 03:51

Note first that your code returns the identifier of the last variable met with a value of 1. If you wanted that, the repetitive code could be slimmed down to

Code:

gen engaging_online = .
forval j = 1/17 {
    local J : di %02.0f `j'
    replace engaging_online = `j' if YL_Q3_`J' == 1
}

You don't want that, but please note for what follows that I am guessing that

Code:

YL_Q1_1

is a typo for

Code:

YL_Q3_01

If not, then your code needs to be more complicated than what follows.

There's an entire (but short) paper on this topic.

SJ-7-4 dm0034 . . . Stata tip 52: Generating composite categorical variables
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q4/07 SJ 7(4):582--583 (no commands)
tip on how to generate categorical variables using
tostring and egen, group()

https://journals.sagepub.com/doi/epd...867X0800700407

The top recommendation from that is

Code:

egen wanted = group(YL_Q3_??), label

where the varlist would need to include

Code:

YL_Q1_1

if as said that really is a name you're using.

EDIT

Looking back at #4 I see that you really do have YL_Q3_01 but that you don't have YL_Q3_03 at least in your data example. So #4 and #6 are inconsistent in detail. No matter. I think you need something a little more complicated than just above.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(YL_Q3_01 YL_Q3_02 YL_Q3_04 YL_Q3_05 YL_Q3_06 YL_Q3_07 YL_Q3_08 YL_Q3_09 YL_Q3_10 YL_Q3_11 YL_Q3_12 YL_Q3_13 YL_Q3_14 YL_Q3_15 YL_Q3_16 YL_Q3_17)
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
end
label values YL_Q3_01 LABV
label values YL_Q3_02 LABV
label values YL_Q3_04 LABV
label values YL_Q3_05 LABV
label values YL_Q3_06 LABV
label values YL_Q3_07 LABV
label values YL_Q3_08 LABV
label values YL_Q3_09 LABV
label values YL_Q3_10 LABV
label values YL_Q3_11 LABV
label values YL_Q3_12 LABV
label values YL_Q3_13 LABV
label values YL_Q3_14 LABV
label values YL_Q3_15 LABV
label values YL_Q3_16 LABV
label values YL_Q3_17 LABV
label def LABV 0 "No", modify
label def LABV 1 "Yes", modify

gen wanted = "" 

forval j = 1/17 {
    local J : di %02.0f `j'
    capture replace wanted = wanted + " `j'" if YL_Q3_`J' == 1 
}

replace wanted = trim(wanted)

egen group = group(wanted), label 

tab group, nola 

tab group

The point is that all the information is carried by the categories with value 1.

Results:

Code:

. tab group, nola 

group(wante |
         d) |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          5       50.00       50.00
          2 |          1       10.00       60.00
          3 |          2       20.00       80.00
          4 |          1       10.00       90.00
          5 |          1       10.00      100.00
------------+-----------------------------------
      Total |         10      100.00

. 
. tab group 

group(wante |
         d) |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          5       50.00       50.00
       1 14 |          1       10.00       60.00
         17 |          2       20.00       80.00
          2 |          1       10.00       90.00
       6 15 |          1       10.00      100.00
------------+-----------------------------------
      Total |         10      100.00

Last edited by Nick Cox; 19 Jan 2025, 04:35.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35059
#8

19 Jan 2025, 06:53

Following on #7 it is naturally possible that

1. You would prefer to see something less cryptic than say 1 14 or 6 15. That might mean using variable labels, or shorter versions of them.

2. There is a more natural or at least more helpful order to the composite variables, say from most popular on. If you want something different, again it's important to spell out what that would be.
Comment

Announcement