Creating a dummy variable - new to Stata

William May

Join Date: Mar 2017

Posts: 5
#1

Creating a dummy variable - new to Stata

28 Mar 2017, 06:56

Hi everyone, Masters Economics student here, struggling with using Stata. Probably a very simple question to those that are competent in using Stata, but it's got me confused.

I have a dataset (British Household Panel Survey), with an independent variable "qmastat", which is an individual's self-reported marital status. The variable is numeric and the measurements are nominal. The values are as follows:

- 9 = missing/wild
- 8 = inapplicable
- 2 = refused
- 1 = not answered
0 = child under 16
1 = married
2 = living as couple
3 = widowed
4 = divorced
5 = separated
6 = never married
7 = civil partnership
8 = dissolved civil partnership

I am trying to create a new dummy variable called "married", which should take on the value of 1 if married (qmastat = 1), 0 if not married (qmastat = 2, 3, 4, 5, 6, 7 or 8), and . if missing (qmastat < 0).

I know that if I were to create a simpler dummy variable (such as gender), I would use the commands: generate female = 1 if qsex == 2 ; replace female = 0 if qsex == 1; replace female = . if qsex < 0.

Please could someone offer some advice with regards to how I would create a dummy variable for married? I'm guessing it would be something along the lines of generate married = 1 if qmastat = 1; replace married = 0 if qmastat > 1 ?

Also, if i were, for example, to want to create a similar variable for married or living as a couple (qmastat = 1 or 2), how would I go about programming that?

Many thanks in advanced.

Will
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#2

28 Mar 2017, 06:59

-help recode-
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#3

28 Mar 2017, 07:02

Code:

gen married = qmastat == 1 if qmastat > 0

assigns 1 if married, 0 if known not to be married and missing otherwise.

Code:

gen married2 = qmastat == 1

lumps the unknowns in with the known non-marrieds.

You do can both of these with a single generate; no harm in the replace, but no need for it either.

Note that

Code:

search dummy variable

points to resources, e.g.

FAQ . . . . . . . . . . . . . . . . . . . . . . . Creating dummy variables
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W. Gould
7/16 How do I create dummy variables?
http://www.stata.com/support/faqs/data-management/
creating-dummy-variables/
1 like
Comment
Guest
#4

28 Mar 2017, 07:06

Did you try:

gen d_married = 0 // Generates a zero for all values of qmastat given qmastat exists for all variables.
replace d_married = 1 if qmastat == 1 // Generates the dummy if a individual reported to be married.

In general Stata Syntax needs == to read = with the if condition.

Similar:

gen d_couple = 0
replace d_couple = 1 if qmastat == 1 | qmastat == 2

| reads as or in the Stata Syntax. Do not confuse it with & (binding and). For more arguments also see the "inlist" command via: help inlist
Comment
William May

Join Date: Mar 2017

Posts: 5
#5

28 Mar 2017, 08:37

Fantastic - thanks very much for the help everyone!

How would this change if I just wanted to create a dummy variable based on one measurement in the original variable? If I have a question in the dataset on highest academic qualification attained (numeric variable, nominal measurement), called qqfachi, which takes on the following measurements:

- 9 = Missing/wild
- 9 = Inapplicable
- 7 = Proxy respondent
1 = Higher degree
2 = 1st degree
3 = Vocational
4 = Advance level
3 = Ordinary level
6 = CSE
7 = No formal qualification

If I wanted co create a dummy called degree for those who hold a degree, would it be:

Code:
gen degree = 1 if qqfachi == 2
replace degree = 0 if qqfachi == 1 | > 2
replace degree = . if qqfachi < 0

Sorry for all the questions -only just beginning to get used to this. Many thanks in advance
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

28 Mar 2017, 09:08

Using the generate and replace approach, the following is clearest.

Code:

generate degree = 0 replace degree = 1 if qqfachi==2 replace degree = . if qqafchi<0

The code you provide will not work because your or condition is specified incorrectly. To correct your code,

Code:

gen degree = 1 if qqfachi == 2 replace degree = 0 if qqfachi == 1 | qqfachi> 2

And the last line is not needed because the generate command will set qqfachi missing if it is not 2.
Comment

Giulia Rap

Join Date: Mar 2018
Posts: 6

18 Oct 2018, 06:53

Good afternoon to everyone,
I need your help for a similar problem. I have a panel dataset and I need to create a dummy variable standing for the economic shock. I cannot figure out how to create a dummy variable taking value = 1 when current economic growth exceeds 3% lagged economic growth.

The following is an example of my dataset:

Code:

input str32 origin_name int year float growth_rate_origin str32 destination_name float(growth_rate_destination origin_gdp_growth destination_gdp_growth lgrowth_rate_origin)
"Afghanistan" 1980     . "Afghanistan"     . . .     .
"Afghanistan" 1981     . "Afghanistan"     . . .     .
"Afghanistan" 1982     . "Afghanistan"     . . .     .
"Afghanistan" 1983     . "Afghanistan"     . . .     .
"Afghanistan" 1984     . "Afghanistan"     . . .     .
"Afghanistan" 1985     . "Afghanistan"     . . .     .
"Afghanistan" 1986     . "Afghanistan"     . . .     .
"Afghanistan" 1987     . "Afghanistan"     . . .     .
"Afghanistan" 1988     . "Afghanistan"     . . .     .
"Afghanistan" 1989     . "Afghanistan"     . . .     .
"Afghanistan" 1990     . "Afghanistan"     . . .     .
"Afghanistan" 1991     . "Afghanistan"     . . .     .
"Afghanistan" 1992     . "Afghanistan"     . . .     .
"Afghanistan" 1993     . "Afghanistan"     . . .     .
"Afghanistan" 1994     . "Afghanistan"     . . .     .
"Afghanistan" 1995     . "Afghanistan"     . . .     .
"Afghanistan" 1996     . "Afghanistan"     . . .     .
"Afghanistan" 1997     . "Afghanistan"     . . .     .
"Afghanistan" 1998     . "Afghanistan"     . . .     .
"Afghanistan" 1999     . "Afghanistan"     . . .     .
"Afghanistan" 2000     . "Afghanistan"     . . .     .
"Afghanistan" 2001     . "Afghanistan"     . . .     .
"Afghanistan" 2002     . "Afghanistan"     . . .     .
"Afghanistan" 2003   8.7 "Afghanistan"   8.7 . .     .
"Afghanistan" 2004    .7 "Afghanistan"    .7 . .   8.7
"Afghanistan" 2005  11.8 "Afghanistan"  11.8 . .    .7
"Afghanistan" 2006   5.4 "Afghanistan"   5.4 . .  11.8
"Afghanistan" 2007  13.3 "Afghanistan"  13.3 . .   5.4
"Afghanistan" 2008   3.9 "Afghanistan"   3.9 . .  13.3
"Afghanistan" 2009  20.6 "Afghanistan"  20.6 . .   3.9
"Afghanistan" 2010   8.4 "Afghanistan"   8.4 . .  20.6
"Afghanistan" 2011   6.5 "Afghanistan"   6.5 . .   8.4
"Afghanistan" 2012    14 "Afghanistan"    14 . .   6.5
"Afghanistan" 2013   5.7 "Afghanistan"   5.7 . .    14
"Afghanistan" 2014   2.7 "Afghanistan"   2.7 . .   5.7
"Afghanistan" 2015     1 "Afghanistan"     1 . .   2.7
"Afghanistan" 2016   2.2 "Afghanistan"   2.2 . .     1
"Afghanistan" 2017   2.7 "Afghanistan"   2.7 . .   2.2
"Afghanistan" 2018   2.3 "Afghanistan"   2.3 . .   2.7
"Afghanistan" 2019     3 "Afghanistan"     3 . .   2.3
"Afghanistan" 2020   3.5 "Afghanistan"   3.5 . .     3
"Afghanistan" 2021     4 "Afghanistan"     4 . .   3.5
"Afghanistan" 2022   4.5 "Afghanistan"   4.5 . .     4
"Afghanistan" 2023     5 "Afghanistan"     5 . .   4.5
"Albania"     1980   2.7 "Albania"       2.7 . .     5
"Albania"     1981   5.7 "Albania"       5.7 . .   2.7
"Albania"     1982   2.9 "Albania"       2.9 . .   5.7
"Albania"     1983   1.1 "Albania"       1.1 . .   2.9
"Albania"     1984     2 "Albania"         2 . .   1.1
"Albania"     1985  -1.5 "Albania"      -1.5 . .     2
"Albania"     1986   5.6 "Albania"       5.6 . .  -1.5
"Albania"     1987   -.8 "Albania"       -.8 . .   5.6
"Albania"     1988  -1.4 "Albania"      -1.4 . .   -.8
"Albania"     1989   9.8 "Albania"       9.8 . .  -1.4
"Albania"     1990   -10 "Albania"       -10 . .   9.8
"Albania"     1991   -28 "Albania"       -28 . .   -10
"Albania"     1992  -7.2 "Albania"      -7.2 . .   -28
"Albania"     1993   9.6 "Albania"       9.6 . .  -7.2
"Albania"     1994   9.4 "Albania"       9.4 . .   9.6
"Albania"     1995   8.9 "Albania"       8.9 . .   9.4
"Albania"     1996   9.1 "Albania"       9.1 . .   8.9
"Albania"     1997 -10.9 "Albania"     -10.9 . .   9.1
"Albania"     1998   8.8 "Albania"       8.8 . . -10.9
"Albania"     1999  12.9 "Albania"      12.9 . .   8.8
"Albania"     2000   6.9 "Albania"       6.9 . .  12.9
"Albania"     2001   8.3 "Albania"       8.3 . .   6.9
"Albania"     2002   4.5 "Albania"       4.5 . .   8.3
"Albania"     2003   5.5 "Albania"       5.5 . .   4.5
end

I was thinking about

Code:

 gen lgdpgrowth=l.gdpgrowth

Code:

 forvalues i=1(1)8888{
gen dummy`i'=1 if lgrowth_rate_origin<=0,03*growth_rate_origin
replace dummy`i'=0 if lgrowth_rate_origin>0,03*growth_rate_origin
}

but it does not give me the right output.
Can someone help me, please?

Last edited by Giulia Rap; 18 Oct 2018, 06:56.

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10069
#8

18 Oct 2018, 07:55

For this kind of question, you are better off using Stata's panel data tools to account for missing years across panels. I assume that your absolute growth rate values are in percent (at least that is how it looks to me).

Code:

encode origin_name, gen(origin) xtset origin year gen wanted= growth_rate_origin>=(L.growth_rate_origin +3) & !missing(L.growth_rate_origin)

ADDED IN EDIT: Your variables are stored as floats, so you may run into precision issues. You can thus also try

Code:

gen wanted= float(growth_rate_origin)>=(float(L.growth_rate_origin +3)) & !missing(L.growth_rate_origin)

Last edited by Andrew Musau; 18 Oct 2018, 08:27.
Comment
Giulia Rap

Join Date: Mar 2018

Posts: 6
#9

18 Oct 2018, 08:47

Thank Andrew for your quick reply and help.
Yes, I have annual percentage growth rate taken from IMF database.

Thank you very much, it was much easier than I thought!
Comment
Imesh Waasala

Join Date: Nov 2018

Posts: 6
#10

15 Nov 2018, 13:19

Hi everyone,

Can you give me an advise creating the dummy variables?

This is a panel data set. It has multiple years from 2000- 2010. In each year the month data has been collected is different. I added the month in which data collected to every year using

Code:

gen data_month=6 if Year==2010

and so on.

Now I want to create dummy variables and assign "0" for months which data has not been collected.

Can you please advice?

Many thanks!!
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 362
#11

15 Jan 2020, 20:11

Hi Statalist. I am trying to generate a dummy for a categorical variable which is similar to the original post in this thread in that it relates to a survey question which provides the respondent with many options from which to respond. In my example, the question asks if the respondent is religious or not, and if so, to state their religion. As such, there are very many options for different religions and one option for "no religion". Each response option is given values, for example "1000 Buddhist", "2010 Anglican" "2030 Baptist" ...... "2330 Uniting Church" .... "3000 Hinduism" "4000 Islam" "5000 Judaism". In all there are about 30 response options, of which one is "7000 No Religion". (I note that I only have positive values as I removed the negative values associated with non-response).

How could I generate a dummy that '==1' for Religion and '==0' for No Religion? In this respect there are many values associated with not being aligned with one of the many listed religions and even after reading all the posts I could find on Statalist, I did not find a solution to this specific problem (not to say there isn't one). I am confident my attempt is flawed (see below)

Code:

gen relig=0 if religb==7000 replace relig=1 if religb!=7000

Any help is appreciated. Regards, Chris
Comment
Melanie Boekholt

Join Date: Feb 2019

Posts: 7
#12

16 Jan 2020, 04:11

Hi Chris,

Code:

gen relig = 1 replace relig = 0 if religb == 7000

This should work out for you if I got your plan correct and you dropped all non-response.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#13

16 Jan 2020, 06:25

We can't be clear what the answer to #11 is because there is no data example. Thus the explanation is that "7000 No Religion" is a value of the variable, which if correct implies that the variable concerned is string, in which case

CODE]gen wanted= religb != "7000 No religion" [/CODE]

is a way to do it. On the other hand, Melanie Boekholt is guessing that you don't mean what you say and that "7000 No Religion" is a value label, in which case her code will work if and only if 7000 is the corresponding numeric value.

Either way, note that

Code:

gen wanted = <true or false condition>

is a concise way to get values of 1 if the stated condition is true and 0 if it is false. For more, see

https://www.stata.com/support/faqs/d...rue-and-false/

https://www.stata.com/support/faqs/d...mmy-variables/

https://journals.sagepub.com/doi/ful...36867X19830921

For how to give a decent data example please see as always https://www.statalist.org/forums/help#stata
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 362
#14

16 Jan 2020, 18:05

Thank you Melanie Boekholt that looks correct. Yes I dropped all non-responses (#11). Just the reverse of what I did - I guess I should have figured that out. ...with time and practice I guess....

Nick Cox, my apologies for not providing more information. I've tried to capture the list of response options in the attached .png file as shown to me when I tabulate this variable. As you can see the values I listed in #11 with each response option are value labels (my apologies for not clarifying). After reading your links in #13, I thought I should note this is a survey question from the Hilda survey and many respondents do not answer it. Their responses are recorded as non-responses and the value labels for the various non-response options take on negative values which I've removed and are therefore not listed below. I hope this clarifies my post in #11. Thank you again.
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 362
#15

28 Jan 2020, 17:04

After creating a dummy based on #12, 'relig' had four times the observations (176,564) compared to when I tabulate 'religb' (41,031) on which 'relig' is based. This is in spite of removing the non-responses via

Code:

foreach var of varlist _all { replace `var' = . if `var' < 0 }

#14 shows the list of value labels associated with this variable of which there are 41,031 observations. There are 135,533 missing values (non-responses) and it seems that based on the code in #12 the dummy variable (relig) includes these. [added] I now realise that I didn't remove the non-responses, I just converted them to '.' - missings.

So to remove the 'missings' I added the following code to #12 (below) and it seems to have addressed the problem.

Code:

gen relig = 1 if religb !=. replace relig = 0 if religb == 7000 & religb !=.

Can someone please confirm that this is the correct approach?

Regards, Chris
Comment

Announcement