Convert dummy variables into a categorical variable

Lorenna Munnecom

Join Date: Oct 2017
Posts: 3

Convert dummy variables into a categorical variable

04 Oct 2017, 11:28

Hello,

I am new with Stata and I cannot find a solution to convert these variables into a unique categorical variable "Nationality" :
Here is an example of the dataset:

USA	UK	France	Germany	Spain	Australia	Canada	Russia
No	No	No	Yes	No	No	Yes	No
No	Yes	No	No	No	No	No	No
Yes	No	No	No	No	No	No	No
Yes	No	No	No	No	No	No	No
No	Yes	No	No	No	No	No	No
No	No	No	No	Yes	No	No	No
No	No	No	No	No	Yes	No	No

Does one of you have any suggestion on how to convert these variables into categorical variable as follow:

Nationality
1-USA
2-UK
3- France
4- Germany
5- Spain
6- Australia
7- Canada
8- Russia

Tags: categorical, convert, dummy, stata, variables

Clyde Schechter

Join Date: Apr 2014

Posts: 29688
#2

04 Oct 2017, 12:18

Not possible without additional information. Look at the first row of your table: both Germany and Canada are specified yes. In order to create a single nationality variable out of this, you would have to specify some sort of hierarchy among the different nations to determine which one to use in the case of multiple nationalities. Or you could allow a category "Multiple" nationalities in your variable. If you clarify how you want to handle this situation, coding will not be a problem.

When posting back, please show an actual Stata data example, not an HTML table, by using the -dataex- command. Run -ssc install dataex- to get the command and then run -help dataex- to read the simple instructions for its use. By using -dataex- you will enable those who are inclined to help out to make a complete and faithful replica of your Stata example with a simple copy/paste operation. HTML tables of the kind you show are sometimes difficult to bring in to Stata. And even when they go in easily, they do not show, for an example relevant to your situation, whether the variables are string variables or value-labeled numeric variables. (The code could be different in these two cases.)
Comment

Lorenna Munnecom

Join Date: Oct 2017
Posts: 3

04 Oct 2017, 12:44

Hi,

Thank you for the advice.
Here is the Stata data example using -dataex-.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int Numberofnights float(USA UK France Russia Australia Germany Spain Canada)
5 0 0 0 0 1 0 0 0
5 0 0 0 0 1 0 0 0
2 0 0 0 0 0 0 1 0
5 0 0 0 0 0 0 1 0
4 0 0 0 0 1 0 0 0
end
label values USA USA2
label def USA2 0 "No", modify
label values UK UK2
label def UK2 0 "No", modify
label values France France2
label def France2 0 "No", modify
label values Russia Russia2
label def Russia2 0 "No", modify
label values Australia Australia2
label def Australia2 0 "No", modify
label def Australia2 1 "Yes", modify
label values Germany Germany2
label def Germany2 0 "No", modify
label values Spain Spain2
label def Spain2 0 "No", modify
label def Spain2 1 "Yes", modify
label values Canada Canada2
label def Canada2 0 "No", modify

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29688

04 Oct 2017, 12:52

This should do it:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int Numberofnights float(USA UK France Russia Australia Germany Spain Canada)
5 0 0 0 0 1 0 0 0
5 0 0 0 0 1 0 0 0
2 0 0 0 0 0 0 1 0
5 0 0 0 0 0 0 1 0
4 0 0 0 0 1 0 0 0
end
label values USA USA2
label def USA2 0 "No", modify
label values UK UK2
label def UK2 0 "No", modify
label values France France2
label def France2 0 "No", modify
label values Russia Russia2
label def Russia2 0 "No", modify
label values Australia Australia2
label def Australia2 0 "No", modify
label def Australia2 1 "Yes", modify
label values Germany Germany2
label def Germany2 0 "No", modify
label values Spain Spain2
label def Spain2 0 "No", modify
label def Spain2 1 "Yes", modify
label values Canada Canada2
label def Canada2 0 "No", modify

//    VERIFY ONLY ONE NATIONALITY SELECTED PER OBSERVATION
egen n_count = rowtotal(USA-Canada)
assert n_count <= 1
drop n_count

//    COMBINE TO A SINGLE NATIONALITY VARIABLE
gen nationality = .
local i = 1
foreach v of varlist USA-Canada {
    replace nationality = `i' if `v' == 1
    label define nationality `i' "`v'", add
    local ++i
}
label values nationality nationality
list, noobs clean

Comment

Lorenna Munnecom

Join Date: Oct 2017

Posts: 3
#5

04 Oct 2017, 12:59

Thank you so much. It was exactly what I was looking for !
1 like
Comment
Regina Bock

Join Date: Jul 2018

Posts: 4
#6

26 Jul 2018, 03:56

Dear community,
I am currently dealing with a somewhat problematic dataset (at least for a beginner like me). And one of my problems is exactly the one that Lorenna had, but somehow the command does not help in the right way.

So in my example they asked in the questionnaire about the income in categories: "What is your daily wage?"

V6: more than 150 (1: quoted, 0: not quoted)
V7: 101-150 (1: quoted, 0: not quoted)
V8: 51-100 (1: quoted, 0: not quoted)
V9: 1-50 (1: quoted, 0: not quoted)
V10: no payment (1: quoted, 0: not quoted)

Now I need to combine these 5 variables (V6-V10) into one variable (V11: "Daily wage") in the same way like the example above. Now there are cases where people have selected multiple variables to this question. So I want to select the upper variable selected from the person and discard all the others (with the theoretical assumption that the first statement is closest to the true value)? So for example someone who choose "more than 150" and "101-150", so we assume that the "more than 150" is closer to the true value.
When I examine the command from Clyde Schechter the number of obs go down from 234 (in the old variables V6-V10) to 159 in the new variable (V11).
Also the people, who choose the 1 (quoted) in V6 for example should be the same amount of people like the first category of V11, but that isn't like that also.
What could be the mistake I made?
This is what I let STATA do:

gen V11 = .
local i = 1
foreach v of varlist V6-V10 {
replace V11 = `i' if `v' == 1
label define V11 `i' "`v'", add
local ++i
}
label values V11 V11
list, noobs clean

The second problem I have is that I need to combine dichotomous variables to one categorical variable where a multiple responses was useful.
Example: The question in the questionnaire was: "Which rooms do you use?" and this is what I see in the dataset:
V1: Kitchen (1: quoted, 0: not quoted)
V2: WC (1: quoted, 0: not quoted)
V3: living room (1: quoted, 0: not quoted)
V4: corridor (1: quoted, 0: not quoted).

Now I want to make out of these 4 variables (V1-V4) one categorical variable (let's call it V5 "used rooms") with the values 1 = kitchen, 2 = WC, 3 = living room and 4 = hallway. So that every "1" in one variable is a category in the new variable V5. But I don't want to overwrite the responses the person made and pick just one of them. I want to have all the responses in the categories. Is this even possible??

Does anyone have any suggestions? I would be very happy about any help!
Comment
Regina Bock

Join Date: Jul 2018

Posts: 4
#7

26 Jul 2018, 04:06

Oh, I'm sorry. I solved the first problem (about the difference in the number of obs) myself. It comes from the merging of the variables, because STATA only considers one of the responses and "delete" the others. But now it takes the last made response - how can I manage it so STATA takes the first made response?
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#8

26 Jul 2018, 08:23

So if someone said "kitchen" (V1=1) and "living room" (V3=1), what do you want V5 to look like?

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
Regina Bock

Join Date: Jul 2018

Posts: 4
#9

26 Jul 2018, 14:35

So I want, that someone who says "kitchen" (V1=1) is counted by kitchen but also by living room.
So that I have one variable V5 with 4 categories.
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#10

26 Jul 2018, 14:47

I’m just not understanding this. A single observation has v1=1 & v3=1. You want V5 that takes values 1,2,3, or 4. So what value should this observation have for v5? Or do you want v5 to look like this “1010”.

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment

Adam Ross Nelson

Join Date: Aug 2015
Posts: 37

#11

26 Nov 2020, 21:37

I'm giving this excellent post a bump. To ask: is there an instal that can do this automatically? It is a common task. And, it is essentially reversing the generate option following a tabulate command.

Thoughts?

Originally posted by Clyde Schechter View Post

This should do it:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int Numberofnights float(USA UK France Russia Australia Germany Spain Canada)
5 0 0 0 0 1 0 0 0
5 0 0 0 0 1 0 0 0
2 0 0 0 0 0 0 1 0
5 0 0 0 0 0 0 1 0
4 0 0 0 0 1 0 0 0
end
label values USA USA2
label def USA2 0 "No", modify
label values UK UK2
label def UK2 0 "No", modify
label values France France2
label def France2 0 "No", modify
label values Russia Russia2
label def Russia2 0 "No", modify
label values Australia Australia2
label def Australia2 0 "No", modify
label def Australia2 1 "Yes", modify
label values Germany Germany2
label def Germany2 0 "No", modify
label values Spain Spain2
label def Spain2 0 "No", modify
label def Spain2 1 "Yes", modify
label values Canada Canada2
label def Canada2 0 "No", modify

// VERIFY ONLY ONE NATIONALITY SELECTED PER OBSERVATION
egen n_count = rowtotal(USA-Canada)
assert n_count <= 1
drop n_count

// COMBINE TO A SINGLE NATIONALITY VARIABLE
gen nationality = .
local i = 1
foreach v of varlist USA-Canada {
replace nationality = `i' if `v' == 1
label define nationality `i' "`v'", add
local ++i
}
label values nationality nationality
list, noobs clean

Comment

Adam Ross Nelson

Join Date: Aug 2015
Posts: 37

#12

24 Apr 2022, 14:08

Another techniqe that may be useful on this (from Stata tech support folks)...

- The egen group() command can reverse this operation but it will not check for
rows with all zeros or rows with multiple entries:
- Note, it will assign new values to the categories so that they will not match
the original values.

Code:

. sysuse auto, clear
(1978 automobile data)

. keep rep

. tab rep78, gen(reps)

     Repair |
record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.90        2.90
          2 |          8       11.59       14.49
          3 |         30       43.48       57.97
          4 |         18       26.09       84.06
          5 |         11       15.94      100.00
------------+-----------------------------------
      Total |         69      100.00

. egen x = group(reps*)
(5 missing values generated)

. list in 1/10 , clean

       rep78   reps1   reps2   reps3   reps4   reps5   x 
  1.       3       0       0       1       0       0   3 
  2.       3       0       0       1       0       0   3 
  3.       .       .       .       .       .       .   . 
  4.       3       0       0       1       0       0   3 
  5.       4       0       0       0       1       0   2 
  6.       3       0       0       1       0       0   3 
  7.       .       .       .       .       .       .   . 
  8.       3       0       0       1       0

Announcement

Convert dummy variables into a categorical variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment