Hi, sorry if this question has already been asked, I haven't been able to find an answer.
I am working on a dataset that has about 50 variables and over 30,000 respondents. Most of the variables are dichotomous yes/no questions. I need to create a series of dummy variables that are related to each other. I want to make sure the coding I have used is correct for what I need and if not, what I should be doing instead.
Variables I need (with code I used to obtain them):
Variable for Depression Symptom 1 (yes/no dummy variable created from 3 yes/no questions where you need a yes to any)
gen byte ds1 = 0
replace ds1= 1 if Q35==1 | Q36==1 | Q37==1
replace ds1 = 0 if Q35==0 & Q36==0 & Q37==0
Variable for Depression Symptom 2 (yes/no dummy variable created from 6 yes/no questions where you need a yes to any)
gen byte ds2= 0
replace ds2 = 1 if Q39==1 | Q40==1 | Q43==1 | Q44==1 | Q47==1 | Q51==1
replace ds2 = 0 if Q39==0 & Q40==0 & Q43==0 & Q44==0 & Q47==0 & Q51==0
Variable for Depression Symptom 3 (yes/no dummy variable created from 2 yes/no questions where you need a yes to either)
gen byte ds3= 0
replace ds3 = 1 if Q48==1 | Q52==1
replace ds3 = 0 if Q48==0 & Q52==0
Variable for Depression (yes/no dummy variable created from yes to any of the above depression symptoms)
gen byte dep = 0
replace dep = 1 if ds1==1 | ds2==1 | ds3==1
replace dep= 0 if ds1==0 & ds2==0 & ds3==0
Variable for Bipolar (yes/no dummy variable created from 13 yes/no questions where you need a yes to any)
gen byte bp = 0
replace bp = 1 if Q66_1==1 | Q66_2==1 | Q66_3==1 | Q66_4==1 | Q66_5==1 | ///
Q66_6==1 | Q66_7 ==1 | Q66_8==1 | Q66_9==1 | Q66_10==1 | Q66_11==1 | ///
Q66_12==1 | Q66_13==1
replace bp = 0 if Q66_1==0 & Q66_2==0 & Q66_3==0 & Q66_4==0 & Q66_5==0 & ///
Q66_6==0 & Q66_7==0 & Q66_8==0 & Q66_9==0 & Q66_10==0 & Q66_11==0 & ///
Q66_12==0 & Q66_13==0
Variable for Psychosis (this is already only 1 yes/no question)
variable name: psych
Variable for Mental Health (yes/no dummy variable created from yes to depression, bipolar, or psychosis)
gen MH = 0
replace MH = 1 if ds1==1 | ds2==1 | ds3==1 | bp==1 | psych==1
replace MH = 0 if ds1==0 & ds2==0 & ds3==0 & bp==0 & psych=0
From what I understand, the MH variable I created above double counts individuals that have reported symptoms for more than one diagnosis (i.e. could have reported bipolar and psychosis). For this variable I get 12,560 observations for "yes." But, I am having a hard time conceptualizing what this variable has actually calculated.
I need to create a variable that counts the unduplicated participants with a MH diagnosis. I have attempted to do this with the following code:
gen yn_MH= 0
replace yn_MH = 1 if dep==1 & bp==0 & psych==0
replace yn_MH= 2 if bp==1 & psych==0 & dep==0
replace yn_MH = 3 if psych==1 & bp==0 & dep==0
replace yn_MH = 4 if dep==1 & bp==1 & psych==0
replace yn_MH = 5 if dep==1 & bp==1 & psych==1
replace yn_MH= 6 if dep==1 & bp==0 & psych==1
replace yn_MH = 7 if dep==0 & bp==1 & pscyh==1
replace yn_MH = 0 if dep==0 & bp==0 & psych==0
gen MH_uniq=yn_MH
recode MH_uniq (7=1) (6=1) (5=1) (4=1) (3=1) (2=1)
For the MH_uniq variable I get 10,730 observations for "yes." I'm not sure if this is the best way to do it or if it is calculating what I actually need. I have been using the yn_MH variable to report those that only have depression, have depression & bipolar, have all three, and etc. for all the possible combinations. Essentially, I need to know how many people are in each group/combination of MH diagnoses.
Finally, I also need a variable for those that have depression symptoms 1 or 2 (or both) but do not have symptom 3. I'm not sure how to do this. (I tried with a similar approach as above, but my resulting sum for those with Symptom 1/2 and those with Symptoms 3 did not equal the count i got from calculating those with only depression (and no bipolar or psychosis).
Thanks!!
I am working on a dataset that has about 50 variables and over 30,000 respondents. Most of the variables are dichotomous yes/no questions. I need to create a series of dummy variables that are related to each other. I want to make sure the coding I have used is correct for what I need and if not, what I should be doing instead.
Variables I need (with code I used to obtain them):
Variable for Depression Symptom 1 (yes/no dummy variable created from 3 yes/no questions where you need a yes to any)
gen byte ds1 = 0
replace ds1= 1 if Q35==1 | Q36==1 | Q37==1
replace ds1 = 0 if Q35==0 & Q36==0 & Q37==0
Variable for Depression Symptom 2 (yes/no dummy variable created from 6 yes/no questions where you need a yes to any)
gen byte ds2= 0
replace ds2 = 1 if Q39==1 | Q40==1 | Q43==1 | Q44==1 | Q47==1 | Q51==1
replace ds2 = 0 if Q39==0 & Q40==0 & Q43==0 & Q44==0 & Q47==0 & Q51==0
Variable for Depression Symptom 3 (yes/no dummy variable created from 2 yes/no questions where you need a yes to either)
gen byte ds3= 0
replace ds3 = 1 if Q48==1 | Q52==1
replace ds3 = 0 if Q48==0 & Q52==0
Variable for Depression (yes/no dummy variable created from yes to any of the above depression symptoms)
gen byte dep = 0
replace dep = 1 if ds1==1 | ds2==1 | ds3==1
replace dep= 0 if ds1==0 & ds2==0 & ds3==0
Variable for Bipolar (yes/no dummy variable created from 13 yes/no questions where you need a yes to any)
gen byte bp = 0
replace bp = 1 if Q66_1==1 | Q66_2==1 | Q66_3==1 | Q66_4==1 | Q66_5==1 | ///
Q66_6==1 | Q66_7 ==1 | Q66_8==1 | Q66_9==1 | Q66_10==1 | Q66_11==1 | ///
Q66_12==1 | Q66_13==1
replace bp = 0 if Q66_1==0 & Q66_2==0 & Q66_3==0 & Q66_4==0 & Q66_5==0 & ///
Q66_6==0 & Q66_7==0 & Q66_8==0 & Q66_9==0 & Q66_10==0 & Q66_11==0 & ///
Q66_12==0 & Q66_13==0
Variable for Psychosis (this is already only 1 yes/no question)
variable name: psych
Variable for Mental Health (yes/no dummy variable created from yes to depression, bipolar, or psychosis)
gen MH = 0
replace MH = 1 if ds1==1 | ds2==1 | ds3==1 | bp==1 | psych==1
replace MH = 0 if ds1==0 & ds2==0 & ds3==0 & bp==0 & psych=0
From what I understand, the MH variable I created above double counts individuals that have reported symptoms for more than one diagnosis (i.e. could have reported bipolar and psychosis). For this variable I get 12,560 observations for "yes." But, I am having a hard time conceptualizing what this variable has actually calculated.
I need to create a variable that counts the unduplicated participants with a MH diagnosis. I have attempted to do this with the following code:
gen yn_MH= 0
replace yn_MH = 1 if dep==1 & bp==0 & psych==0
replace yn_MH= 2 if bp==1 & psych==0 & dep==0
replace yn_MH = 3 if psych==1 & bp==0 & dep==0
replace yn_MH = 4 if dep==1 & bp==1 & psych==0
replace yn_MH = 5 if dep==1 & bp==1 & psych==1
replace yn_MH= 6 if dep==1 & bp==0 & psych==1
replace yn_MH = 7 if dep==0 & bp==1 & pscyh==1
replace yn_MH = 0 if dep==0 & bp==0 & psych==0
gen MH_uniq=yn_MH
recode MH_uniq (7=1) (6=1) (5=1) (4=1) (3=1) (2=1)
For the MH_uniq variable I get 10,730 observations for "yes." I'm not sure if this is the best way to do it or if it is calculating what I actually need. I have been using the yn_MH variable to report those that only have depression, have depression & bipolar, have all three, and etc. for all the possible combinations. Essentially, I need to know how many people are in each group/combination of MH diagnoses.
Finally, I also need a variable for those that have depression symptoms 1 or 2 (or both) but do not have symptom 3. I'm not sure how to do this. (I tried with a similar approach as above, but my resulting sum for those with Symptom 1/2 and those with Symptoms 3 did not equal the count i got from calculating those with only depression (and no bipolar or psychosis).
Thanks!!
Comment