Generating a New Variable based on Conditional IF Statements

Pat Burns

Join Date: May 2018

Posts: 49
#1

Generating a New Variable based on Conditional IF Statements

26 Mar 2019, 07:15

Hello STATA Experts: I am trying to create a new variable based on the existence of certain conditions in two existing variables (see code below). It appears to be dropping most of the cases for "Fathers". Not sure what is wrong here. Also, how is this treating missing variables?
Any help would be appreciated. Pat

generate parentfigure=.
replace parentfigure=0 if X1HPAR1==2 | X1HPAR2==2
replace parentfigure=1 if X1HPAR1==1
label define parentfigurelab 0 "Father" 1 "Mother"
label values parentfigure parentfigurelab
label variable parentfigure "Parent Figure"
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35193
#2

26 Mar 2019, 08:07

Let's rewrite your definitions to answer the question. cond() undeservedly has a reputation of being a difficult function, but once you know how to read it and write it, all difficulties evaporate, and it has the advantage that you can make all your rules explicit.

Code:

gen parentfigure = cond(X1HPAR1 == 1, 1, cond(X1HPAR1 == 2 | X1HPAR2 == 2, 0, .))

So that

if X1HPAR1 equals 1, new variable is 1
otherwise if X1HPAR1 equals 2 or X1HPAR2 equals 2, new variable is 0
otherwise new variable is missing.

These rules imply that

* if X1HPAR1 is missing, then the new variable is missing.

* if X1HPAR2 is missing, then all depends on X1HPAR1 -- but that if that is also missing, or otherwise not 1 or 2, then the new variable is missing.

"dropping most of the cases" presumably doesn't mean in your mind dropping most of the observations as no generate or replace or label command has as a direct or indirect effect observations being dropped. I guess you mean that many resulting values are missing.

To see what is going on

Code:

tab X1HPAR1 X1HPAR2, missing

shows missings explicitly.
Comment
Pat Burns

Join Date: May 2018

Posts: 49
#3

26 Mar 2019, 10:29

Hi Nick: Thank you so much. I produced the same results with my commands that you did. I think I am not asking the right question. I have two variables (shown below) X1HPAR1 and X1HPAR2. I want to create a new variable that consists of only the Biological Mothers and the Biological Fathers. So I want the values where X1HPAR1 == 1 for the Biological Mothers (12,544 cases) and where X1HPAR1==2 and X1HPAR2==2 for the Biological Fathers (201 + 9,136= 9,337 cases). I only get 327 cases for the Biological Father? I am not sure where the rest of them are. I hope I am making sense and as always, appreciate your help.
Best, Pat

. tab X1HPAR1
X1 RELATIONSHIP OF PRNT 1 TO FOCAL |
CHILD | Freq. Percent Cum.
--------------------------------------+-----------------------------------
1: BIOLOGICAL MOTHER | 12,544 93.62 93.62
2: BIOLOGICAL FATHER | 201 1.50 95.12
3: ADOPTIVE MOTHER | 231 1.72 96.84
4: ADOPTIVE FATHER | 4 0.03 96.87
5: STEP-MOTHER | 66 0.49 97.37
7: FOSTER MOTHER | 32 0.24 97.60
9: OTHER FEMALE PARENT OR GUARDIAN | 14 0.10 97.71
10: OTHER MALE PARENT OR GUARDIAN | 1 0.01 97.72
11: FEMALE PARTNER OF CHILD'S PARENT | 52 0.39 98.10
13: FEMALE PARENT FIGURE UNKNOWN TYPE | 7 0.05 98.16
14: MALE PARENT FIGURE UNKNOWN TYPE | 1 0.01 98.16
15: NO RESIDENT PARENT 1 | 246 1.84 100.00
--------------------------------------+-----------------------------------
Total | 13,399 100.00

. tab X1HPAR2

X1 RELATIONSHIP OF PRNT 2 TO FOCAL |
CHILD | Freq. Percent Cum.
--------------------------------------+-----------------------------------
-9: NOT ASCERTAINED | 1 0.01 0.01
-1: NOT APPLICABLE | 3,034 22.64 22.65
2: BIOLOGICAL FATHER | 9,136 68.18 90.84
3: ADOPTIVE MOTHER | 5 0.04 90.87
4: ADOPTIVE FATHER | 228 1.70 92.57
6: STEP-FATHER | 445 3.32 95.90
7: FOSTER MOTHER | 1 0.01 95.90
8: FOSTER FATHER | 18 0.13 96.04
9: OTHER FEMALE PARENT OR GUARDIAN | 2 0.01 96.05
10: OTHER MALE PARENT OR GUARDIAN | 16 0.12 96.17
11: FEMALE PARTNER OF CHILD'S PARENT | 11 0.08 96.25
12: MALE PARTNER OF CHILD'S PARENT | 338 2.52 98.78
14: MALE PARENT FIGURE UNKNOWN TYPE | 5 0.04 98.81
15: NO RESIDENT PARENT 2 | 159 1.19 100.00
--------------------------------------+-----------------------------------
Total | 13,399 100.00
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35193
#4

26 Mar 2019, 13:12

I am not familiar with this dataset at all and am not really clear what you seek. Rather than make guesses I would rather leave the field open to people who are.
Comment
Pat Burns

Join Date: May 2018

Posts: 49
#5

26 Mar 2019, 15:28

Okay thank you. I am working on the ECLS-K:2011 dataset sponsored by the National Center for Ed Statistics.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9942
#6

26 Mar 2019, 16:40

. I have two variables (shown below) X1HPAR1 and X1HPAR2. I want to create a new variable that consists of only the Biological Mothers and the Biological Fathers. So I want the values where X1HPAR1 == 1 for the Biological Mothers (12,544 cases) and where X1HPAR1==2 and X1HPAR2==2 for the Biological Fathers (201 + 9,136= 9,337 cases). I only get 327 cases for the Biological Father? I am not sure where the rest of them are.

Your output from tabulating the variables suggests that your data is in wide layout and each observation is a child. Assuming that a child has two parents present, "X1HPAR1" tells you the relationship of parent 1 and "X1HPAR2" tells you the relationship of the second parent. To achieve what you want, you will need to reshape your data first. See

Code:

help reshape long

or run, copy and paste the result of

Code:

dataex in 1/5

for specific advice.
Comment
Pat Burns

Join Date: May 2018

Posts: 49
#7

27 Mar 2019, 06:24

Hi Andrew:
I just checked and the data is in long form. Each child is an observation with over 1000 data (demographics, assessment results) per child. I know there are 9,337 cases of biological fathers, but when I try to create a new variable of just fathers, using the above STATA code, I get only 327 fathers. I will keep working on it. Thank you for your input.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9942

27 Mar 2019, 06:38

I think this will clear up the apparent mystery

Code:

expand 2, gen(x)
gen X1HPAR= cond(x==0, X1HPAR1, X1HPAR2)
gen parentfigure = cond(X1HPAR == 1, 1, cond(X1HPAR == 2, 0, .))
label define parentfigure 1 "BIOLOGICAL MOTHER" 0 "BIOLOGICAL FATHER"
label values parentfigure parentfigure
tab X1HPAR

Comment

Pat Burns

Join Date: May 2018

Posts: 49
#9

27 Mar 2019, 09:33

Andrew: First thank you; that worked! Can you explain the first two lines of your code? I am unfamiliar with "expand 2, gen(x)". Thanks again.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9942
#10

27 Mar 2019, 14:06

This is documented in

Code:

help expand

expand replaces each observation in the dataset with n copies of the
observation, where n is equal to the required expression rounded to the
nearest integer. If the expression is less than 1 or equal to missing,
it is interpreted as if it were 1, and the observation is retained but
not duplicated.

generate(newvar) creates new variable newvar containing 0 if the
observation originally appeared in the dataset and 1 if the
observation is a duplicate. For instance, after an expand, you could
revert to the original observations by typing keep if newvar==0.

So I am duplicating each observation (expand2) and at the same time creating an indicator equal to zero for the original set of observations. Then I can create a variable that combines "X1HPAR1" and "X1HPAR2", i.e., "X1HPAR". This is not the conventional way to do this, I still refer you again to post #6 and reshape long. However, it illustrates my point that your data is in wide layout.
Comment
Pat Burns

Join Date: May 2018

Posts: 49
#11

30 Mar 2019, 06:27

Andrew:
Thank you for clearing this up for me. You are correct in that my data is in wide layout.... I now know what that means! I am new to STATA as you can tell and appreciate you sharing your expertise. Pat
Comment
Pat Burns

Join Date: May 2018

Posts: 49
#12

06 May 2019, 14:44

Andrew: A while back, you gave me some advice regarding transforming data from wide to long format. I have done some work with the reshape long command and can now see its value.
In relation to post #10 above, I reshaped my data and now have two records for each CHILDID, reflecting Parent 1 and Parent 2
reshape long X1HPAR, i(CHILDID) j(parentfigure)
(note: j = 1 2)

Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 14235 -> 28470
Number of variables 181 -> 181
j variable (2 values) -> parentfigure
xij variables:
X1HPAR1 X1HPAR2 -> X1HPAR
-----------------------------------------------------------------------------

. tab X1HPAR

X1HPAR | Freq. Percent Cum.
--------------------------------------+-----------------------------------
-1: NOT APPLICABLE | 2,954 11.32 11.32
1: BIOLOGICAL MOTHER | 12,223 46.85 58.18
2: BIOLOGICAL FATHER | 9,088 34.84 93.01
3: ADOPTIVE MOTHER | 227 0.87 93.88
4: ADOPTIVE FATHER | 225 0.86 94.74
5: STEP-MOTHER | 64 0.25 94.99
6: STEP-FATHER | 432 1.66 96.65
7: FOSTER MOTHER | 32 0.12 96.77
8: FOSTER FATHER | 17 0.07 96.83
9: OTHER FEMALE PARENT OR GUARDIAN | 16 0.06 96.90
10: OTHER MALE PARENT OR GUARDIAN | 16 0.06 96.96
11: FEMALE PARTNER OF CHILD'S PARENT | 60 0.23 97.19
12: MALE PARTNER OF CHILD'S PARENT | 330 1.26 98.45
13: FEMALE PARENT FIGURE UNKNOWN TYPE | 6 0.02 98.47
14: MALE PARENT FIGURE UNKNOWN TYPE | 6 0.02 98.50
15: NO RESIDENT PARENT 2 | 392 1.50 100.00
--------------------------------------+-----------------------------------
Total | 26,088 100.00

.

I have since created a number of new variables based on the value of X1HPAR1. For example, BioMother Highest Education:
gen MOMED = X12PAR1ED_I if X1HPAR == 1
(16,247 missing values generated)

BioFather's Age, BioMother's Work status, etc. When I try to determine the "missing" cases for each of these variables, it is overstated because I have essentially two records for each child now... # observations is doubled. I don't want to revert back to the wide format because I then lose the variables I created. Is there way to get at missing cases for these newly created variables, as I would like to try to impute with max likelihood or multiple imputation.
Also, variables that already exist in the database are also doubled ... does that matter? Will it matter for regression analysis. Hoping that these questions make sense.
Many thanks, Pat
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9942
#13

06 May 2019, 15:09

You can use the if qualifier both in determining the missing cases and running the regressions, no need to revert to wide layout. For example,

Code:

*BIOLOGICAL MOTHERS list if missing(VARNAME) & parentfigure *BIOLOGICAL FATHERS list if missing(VARNAME2) & !parentfigure *BIOLOGICAL MOTHERS regress VARS if parentfigure
Comment
Pat Burns

Join Date: May 2018

Posts: 49
#14

06 May 2019, 15:41

Andrew: Thanks... I get the first two. How does the last line of code work?
Comment
Pat Burns

Join Date: May 2018

Posts: 49
#15

06 May 2019, 15:43

Actually, do you mean to have the !parentfigure in line 1 as well?
Comment

Announcement

Generating a New Variable based on Conditional IF Statements

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment