Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question about combining variables and getting unique counts

    Hi, sorry if this question has already been asked, I haven't been able to find an answer.

    I am working on a dataset that has about 50 variables and over 30,000 respondents. Most of the variables are dichotomous yes/no questions. I need to create a series of dummy variables that are related to each other. I want to make sure the coding I have used is correct for what I need and if not, what I should be doing instead.

    Variables I need (with code I used to obtain them):

    Variable for Depression Symptom 1 (yes/no dummy variable created from 3 yes/no questions where you need a yes to any)
    gen byte ds1 = 0
    replace ds1= 1 if Q35==1 | Q36==1 | Q37==1
    replace ds1 = 0 if Q35==0 & Q36==0 & Q37==0

    Variable for Depression Symptom 2 (yes/no dummy variable created from 6 yes/no questions where you need a yes to any)
    gen byte ds2= 0
    replace ds2 = 1 if Q39==1 | Q40==1 | Q43==1 | Q44==1 | Q47==1 | Q51==1
    replace ds2 = 0 if Q39==0 & Q40==0 & Q43==0 & Q44==0 & Q47==0 & Q51==0

    Variable for Depression Symptom 3 (yes/no dummy variable created from 2 yes/no questions where you need a yes to either)
    gen byte ds3= 0
    replace ds3 = 1 if Q48==1 | Q52==1
    replace ds3 = 0 if Q48==0 & Q52==0

    Variable for Depression (yes/no dummy variable created from yes to any of the above depression symptoms)
    gen byte dep = 0
    replace dep = 1 if ds1==1 | ds2==1 | ds3==1
    replace dep= 0 if ds1==0 & ds2==0 & ds3==0

    Variable for Bipolar (yes/no dummy variable created from 13 yes/no questions where you need a yes to any)
    gen byte bp = 0
    replace bp = 1 if Q66_1==1 | Q66_2==1 | Q66_3==1 | Q66_4==1 | Q66_5==1 | ///
    Q66_6==1 | Q66_7 ==1 | Q66_8==1 | Q66_9==1 | Q66_10==1 | Q66_11==1 | ///
    Q66_12==1 | Q66_13==1
    replace bp = 0 if Q66_1==0 & Q66_2==0 & Q66_3==0 & Q66_4==0 & Q66_5==0 & ///
    Q66_6==0 & Q66_7==0 & Q66_8==0 & Q66_9==0 & Q66_10==0 & Q66_11==0 & ///
    Q66_12==0 & Q66_13==0

    Variable for Psychosis (this is already only 1 yes/no question)
    variable name: psych

    Variable for Mental Health (yes/no dummy variable created from yes to depression, bipolar, or psychosis)
    gen MH = 0
    replace MH = 1 if ds1==1 | ds2==1 | ds3==1 | bp==1 | psych==1
    replace MH = 0 if ds1==0 & ds2==0 & ds3==0 & bp==0 & psych=0

    From what I understand, the MH variable I created above double counts individuals that have reported symptoms for more than one diagnosis (i.e. could have reported bipolar and psychosis). For this variable I get 12,560 observations for "yes." But, I am having a hard time conceptualizing what this variable has actually calculated.

    I need to create a variable that counts the unduplicated participants with a MH diagnosis. I have attempted to do this with the following code:
    gen yn_MH= 0
    replace yn_MH = 1 if dep==1 & bp==0 & psych==0
    replace yn_MH= 2 if bp==1 & psych==0 & dep==0
    replace yn_MH = 3 if psych==1 & bp==0 & dep==0
    replace yn_MH = 4 if dep==1 & bp==1 & psych==0
    replace yn_MH = 5 if dep==1 & bp==1 & psych==1
    replace yn_MH= 6 if dep==1 & bp==0 & psych==1
    replace yn_MH = 7 if dep==0 & bp==1 & pscyh==1
    replace yn_MH = 0 if dep==0 & bp==0 & psych==0
    gen MH_uniq=yn_MH
    recode MH_uniq (7=1) (6=1) (5=1) (4=1) (3=1) (2=1)

    For the MH_uniq variable I get 10,730 observations for "yes." I'm not sure if this is the best way to do it or if it is calculating what I actually need. I have been using the yn_MH variable to report those that only have depression, have depression & bipolar, have all three, and etc. for all the possible combinations. Essentially, I need to know how many people are in each group/combination of MH diagnoses.

    Finally, I also need a variable for those that have depression symptoms 1 or 2 (or both) but do not have symptom 3. I'm not sure how to do this. (I tried with a similar approach as above, but my resulting sum for those with Symptom 1/2 and those with Symptoms 3 did not equal the count i got from calculating those with only depression (and no bipolar or psychosis).

    Thanks!!

  • #2
    Your code looks correct to me, although it is far more complicated than it needs to be. If I can assume that your yes/no variables do not contain any missing values (or that you want to treat a missing value as if it were a "no" response), then these can be markedly simplified:

    Code:
    gen ds1 = inlist(1, Q35, Q36, Q37)
    
    gen ds2 = inlist(1, Q39, Q40, Q43, Q44, Q47, Q51)
    
    // etc.
    As for your MH variable, it, too can be simplified as

    Code:
    gen MH = inlist(1, ds1, ds2, ds3, bp, psych)
    These shorter commands will produce the same results you are currently getting.

    I don't know why you think that your MH variable "double counts" people with more than one disease. It will be 1 if the person has any of the three depression symptoms, or bipolar disorder, or psychosis,, and 0 if they have none of these. Those 12,560 "yes" results for MH are all the people who have at least one of these. (And each such person is counted only once even if he or she has more than one).

    I'm not sure what you're finally looking for, but at least as far as I understand it your yn_MH and MH_uniq variables are not needed.

    To get a variable indicating who has symptom ds1 or ds2, but not ds3, the code would be:

    Code:
    gen wanted = inlist(1, ds1, ds2) & ds3 != 1


    Comment


    • #3
      Welcome to Statalist.

      One thing you don't tell us is whether any of your 0/1 indicator variables have missing values, and if so, how you want the derived variables to be treated. That might make a difference to what follows. For now, let's assume you do not have missing values.

      It is not clear to me if you understand the following. You are working with logical expressions, in which Stata treats 0 as false and 1 (actually, any nonzero value) as true, so the value of Q35==1 is 1 if Q35 is 1 and 0 otherwise. And the value of Q35==1 | Q36==1 | Q37==1 is true (1) if one or more of the three expressions are true, and 0 otherwise.

      You can simplify your code to single commands like
      Code:
      gen byte ds1= Q35==1 | Q36==1 | Q37==1
      or further to
      Code:
      gen byte ds1= Q35 | Q36 | Q37
      Or, assuming the variables Q35, Q36, and Q37 follow each other in your dataset with no intervening variables
      Code:
      egen byte ds1= anymatch(Q35-Q37), numlist(1)
      This sort of technique will greatly simplify the readability of your code, which makes it easier to be sure you're not making any mistakes.

      From what I understand, the MH variable I created above double counts individuals that have reported symptoms for more than one diagnosis (i.e. could have reported bipolar and psychosis).
      That is incorrect. The MH variable will be either 1 or 0, how could MH "double-count" any more than any of your other 0/1 indicator variables?

      I think your yn_MH variable would be better constructed doing the following
      Code:
      gen byte yn_MH = dep + 2*bp + 4*psych
      so that 0 will be none of the three, 1 will be only dep, 2 only bp, 3 dep and bp, 4 only psych, 5 dep and psych, 6 bp and psych, 7 dep and bp and psych, and 0 for none of them

      Then
      Code:
      gen MH_uniq = yn_MH>0

      Comment


      • #4
        Code:
         gen byte ds1= inlist(1, Q35, Q36, Q37)
        is yet another way to write simpler code than in #1 (with the same simplifying assumption as @William Lisowski).

        See also https://www.stata.com/support/faqs/d...rue-and-false/
        Last edited by Nick Cox; 01 Mar 2020, 00:49.

        Comment


        • #5
          Hi all,

          Thank you so much! I knew there must have been a simpler way to go about my coding. I do have missing values as respondents were allowed to skip questions or select "prefer not to respond." what is the best way to treat these missing values in the code you have provided if I do not want them to be treated as "no" response?


          Also, the MH variable has 12,560 "yes" responses but the yn_MH variable (using the code provided by William) I get a total of 10,730 "yes" responses. What is the reason for the discrepancy?
          Last edited by Roma Shah; 01 Mar 2020, 16:28.

          Comment


          • #6
            Handling missing values can be complicated in situations like yours. The simplest possibility is to take the code proposed by me, William, or Nick, and just add -if !missing(Q35, Q36, Q37)- (or whatever the list of variables is in a particular command) on the end.

            The drawback to that approach is this: suppose somebody responds yes to Q35, and omits questions 36 and 37. Then, really we already know that they should qualify as ds1 = 1 because of Q35. But the -if !missing(…)- approach will lead to a missing value here. (Note that exactly this same drawback applies to your original code in #1.)

            Here is an example of the code for ds1 that will deal with this kind of situation and draw the correct logical inference where possible.

            Code:
            mvencode Q35 Q36 Q37, mv(0.5)
            egen ds1 = rowmax(Q35 Q36 Q37)
            mvdecode Q35 Q36 Q37, mv(0.5)
            The same approach will work for all of your variables because all of them are defined as disjunctions of yes responses. If you have other variables to define that involve using and instead of or to combine variables, then it would be -rowmin()- instead of -rowmax()-.

            Comment


            • #7
              Since this

              the MH variable has 12,560 "yes" responses but the yn_MH variable (using the code provided by William) I get a total of 10,730 "yes" responses
              was addressed to my code, I'll add to Clyde's response (which mine crossed with) by explaining that the MH variable will be "yes" if one of the five variables is 1, even if any of the others are all missing, while the yn_MH variable requires all three variables to be non-missing.

              Your variables are not yes/no indicator variables, they are yes/no/skipped/prefer not to respond questions taking four values, and you need to think about all four of these possibilities.

              Comment


              • #8
                Okay, thank you so much! I'm going to read up on missing values so I can better understand what my code is achieving and how to best get the output I need. Thanks again!

                Comment

                Working...
                X