Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating a New Variable based on Conditional IF Statements

    Hello STATA Experts: I am trying to create a new variable based on the existence of certain conditions in two existing variables (see code below). It appears to be dropping most of the cases for "Fathers". Not sure what is wrong here. Also, how is this treating missing variables?
    Any help would be appreciated. Pat

    generate parentfigure=.
    replace parentfigure=0 if X1HPAR1==2 | X1HPAR2==2
    replace parentfigure=1 if X1HPAR1==1
    label define parentfigurelab 0 "Father" 1 "Mother"
    label values parentfigure parentfigurelab
    label variable parentfigure "Parent Figure"

  • #2
    Let's rewrite your definitions to answer the question. cond() undeservedly has a reputation of being a difficult function, but once you know how to read it and write it, all difficulties evaporate, and it has the advantage that you can make all your rules explicit.

    Code:
    gen parentfigure = cond(X1HPAR1 == 1, 1, cond(X1HPAR1 == 2 | X1HPAR2 == 2, 0, .))
    So that

    if X1HPAR1 equals 1, new variable is 1
    otherwise if X1HPAR1 equals 2 or X1HPAR2 equals 2, new variable is 0
    otherwise new variable is missing.

    These rules imply that

    * if X1HPAR1 is missing, then the new variable is missing.

    * if X1HPAR2 is missing, then all depends on X1HPAR1 -- but that if that is also missing, or otherwise not 1 or 2, then the new variable is missing.

    "dropping most of the cases" presumably doesn't mean in your mind dropping most of the observations as no generate or replace or label command has as a direct or indirect effect observations being dropped. I guess you mean that many resulting values are missing.

    To see what is going on

    Code:
    tab X1HPAR1 X1HPAR2, missing
    shows missings explicitly.



    Comment


    • #3
      Hi Nick: Thank you so much. I produced the same results with my commands that you did. I think I am not asking the right question. I have two variables (shown below) X1HPAR1 and X1HPAR2. I want to create a new variable that consists of only the Biological Mothers and the Biological Fathers. So I want the values where X1HPAR1 == 1 for the Biological Mothers (12,544 cases) and where X1HPAR1==2 and X1HPAR2==2 for the Biological Fathers (201 + 9,136= 9,337 cases). I only get 327 cases for the Biological Father? I am not sure where the rest of them are. I hope I am making sense and as always, appreciate your help.
      Best, Pat



      . tab X1HPAR1
      X1 RELATIONSHIP OF PRNT 1 TO FOCAL |
      CHILD | Freq. Percent Cum.
      --------------------------------------+-----------------------------------
      1: BIOLOGICAL MOTHER | 12,544 93.62 93.62
      2: BIOLOGICAL FATHER | 201 1.50 95.12
      3: ADOPTIVE MOTHER | 231 1.72 96.84
      4: ADOPTIVE FATHER | 4 0.03 96.87
      5: STEP-MOTHER | 66 0.49 97.37
      7: FOSTER MOTHER | 32 0.24 97.60
      9: OTHER FEMALE PARENT OR GUARDIAN | 14 0.10 97.71
      10: OTHER MALE PARENT OR GUARDIAN | 1 0.01 97.72
      11: FEMALE PARTNER OF CHILD'S PARENT | 52 0.39 98.10
      13: FEMALE PARENT FIGURE UNKNOWN TYPE | 7 0.05 98.16
      14: MALE PARENT FIGURE UNKNOWN TYPE | 1 0.01 98.16
      15: NO RESIDENT PARENT 1 | 246 1.84 100.00
      --------------------------------------+-----------------------------------
      Total | 13,399 100.00

      . tab X1HPAR2

      X1 RELATIONSHIP OF PRNT 2 TO FOCAL |
      CHILD | Freq. Percent Cum.
      --------------------------------------+-----------------------------------
      -9: NOT ASCERTAINED | 1 0.01 0.01
      -1: NOT APPLICABLE | 3,034 22.64 22.65
      2: BIOLOGICAL FATHER | 9,136 68.18 90.84
      3: ADOPTIVE MOTHER | 5 0.04 90.87
      4: ADOPTIVE FATHER | 228 1.70 92.57
      6: STEP-FATHER | 445 3.32 95.90
      7: FOSTER MOTHER | 1 0.01 95.90
      8: FOSTER FATHER | 18 0.13 96.04
      9: OTHER FEMALE PARENT OR GUARDIAN | 2 0.01 96.05
      10: OTHER MALE PARENT OR GUARDIAN | 16 0.12 96.17
      11: FEMALE PARTNER OF CHILD'S PARENT | 11 0.08 96.25
      12: MALE PARTNER OF CHILD'S PARENT | 338 2.52 98.78
      14: MALE PARENT FIGURE UNKNOWN TYPE | 5 0.04 98.81
      15: NO RESIDENT PARENT 2 | 159 1.19 100.00
      --------------------------------------+-----------------------------------
      Total | 13,399 100.00


      Comment


      • #4
        I am not familiar with this dataset at all and am not really clear what you seek. Rather than make guesses I would rather leave the field open to people who are.

        Comment


        • #5
          Okay thank you. I am working on the ECLS-K:2011 dataset sponsored by the National Center for Ed Statistics.

          Comment


          • #6
            . I have two variables (shown below) X1HPAR1 and X1HPAR2. I want to create a new variable that consists of only the Biological Mothers and the Biological Fathers. So I want the values where X1HPAR1 == 1 for the Biological Mothers (12,544 cases) and where X1HPAR1==2 and X1HPAR2==2 for the Biological Fathers (201 + 9,136= 9,337 cases). I only get 327 cases for the Biological Father? I am not sure where the rest of them are.
            Your output from tabulating the variables suggests that your data is in wide layout and each observation is a child. Assuming that a child has two parents present, "X1HPAR1" tells you the relationship of parent 1 and "X1HPAR2" tells you the relationship of the second parent. To achieve what you want, you will need to reshape your data first. See

            Code:
            help reshape long
            or run, copy and paste the result of

            Code:
            dataex in 1/5
            for specific advice.

            Comment


            • #7
              Hi Andrew:
              I just checked and the data is in long form. Each child is an observation with over 1000 data (demographics, assessment results) per child. I know there are 9,337 cases of biological fathers, but when I try to create a new variable of just fathers, using the above STATA code, I get only 327 fathers. I will keep working on it. Thank you for your input.

              Comment


              • #8
                I think this will clear up the apparent mystery

                Code:
                expand 2, gen(x)
                gen X1HPAR= cond(x==0, X1HPAR1, X1HPAR2)
                gen parentfigure = cond(X1HPAR == 1, 1, cond(X1HPAR == 2, 0, .))
                label define parentfigure 1 "BIOLOGICAL MOTHER" 0 "BIOLOGICAL FATHER"
                label values parentfigure parentfigure
                tab X1HPAR

                Comment


                • #9
                  Andrew: First thank you; that worked! Can you explain the first two lines of your code? I am unfamiliar with "expand 2, gen(x)". Thanks again.

                  Comment


                  • #10
                    This is documented in

                    Code:
                    help expand
                    expand replaces each observation in the dataset with n copies of the
                    observation, where n is equal to the required expression rounded to the
                    nearest integer. If the expression is less than 1 or equal to missing,
                    it is interpreted as if it were 1, and the observation is retained but
                    not duplicated.

                    generate(newvar) creates new variable newvar containing 0 if the
                    observation originally appeared in the dataset and 1 if the
                    observation is a duplicate. For instance, after an expand, you could
                    revert to the original observations by typing keep if newvar==0.
                    So I am duplicating each observation (expand2) and at the same time creating an indicator equal to zero for the original set of observations. Then I can create a variable that combines "X1HPAR1" and "X1HPAR2", i.e., "X1HPAR". This is not the conventional way to do this, I still refer you again to post #6 and reshape long. However, it illustrates my point that your data is in wide layout.

                    Comment


                    • #11
                      Andrew:
                      Thank you for clearing this up for me. You are correct in that my data is in wide layout.... I now know what that means! I am new to STATA as you can tell and appreciate you sharing your expertise. Pat

                      Comment


                      • #12
                        Andrew: A while back, you gave me some advice regarding transforming data from wide to long format. I have done some work with the reshape long command and can now see its value.
                        In relation to post #10 above, I reshaped my data and now have two records for each CHILDID, reflecting Parent 1 and Parent 2
                        reshape long X1HPAR, i(CHILDID) j(parentfigure)
                        (note: j = 1 2)

                        Data wide -> long
                        -----------------------------------------------------------------------------
                        Number of obs. 14235 -> 28470
                        Number of variables 181 -> 181
                        j variable (2 values) -> parentfigure
                        xij variables:
                        X1HPAR1 X1HPAR2 -> X1HPAR
                        -----------------------------------------------------------------------------

                        . tab X1HPAR

                        X1HPAR | Freq. Percent Cum.
                        --------------------------------------+-----------------------------------
                        -1: NOT APPLICABLE | 2,954 11.32 11.32
                        1: BIOLOGICAL MOTHER | 12,223 46.85 58.18
                        2: BIOLOGICAL FATHER | 9,088 34.84 93.01
                        3: ADOPTIVE MOTHER | 227 0.87 93.88
                        4: ADOPTIVE FATHER | 225 0.86 94.74
                        5: STEP-MOTHER | 64 0.25 94.99
                        6: STEP-FATHER | 432 1.66 96.65
                        7: FOSTER MOTHER | 32 0.12 96.77
                        8: FOSTER FATHER | 17 0.07 96.83
                        9: OTHER FEMALE PARENT OR GUARDIAN | 16 0.06 96.90
                        10: OTHER MALE PARENT OR GUARDIAN | 16 0.06 96.96
                        11: FEMALE PARTNER OF CHILD'S PARENT | 60 0.23 97.19
                        12: MALE PARTNER OF CHILD'S PARENT | 330 1.26 98.45
                        13: FEMALE PARENT FIGURE UNKNOWN TYPE | 6 0.02 98.47
                        14: MALE PARENT FIGURE UNKNOWN TYPE | 6 0.02 98.50
                        15: NO RESIDENT PARENT 2 | 392 1.50 100.00
                        --------------------------------------+-----------------------------------
                        Total | 26,088 100.00


                        .

                        I have since created a number of new variables based on the value of X1HPAR1. For example, BioMother Highest Education:
                        gen MOMED = X12PAR1ED_I if X1HPAR == 1
                        (16,247 missing values generated)


                        BioFather's Age, BioMother's Work status, etc. When I try to determine the "missing" cases for each of these variables, it is overstated because I have essentially two records for each child now... # observations is doubled. I don't want to revert back to the wide format because I then lose the variables I created. Is there way to get at missing cases for these newly created variables, as I would like to try to impute with max likelihood or multiple imputation.
                        Also, variables that already exist in the database are also doubled ... does that matter? Will it matter for regression analysis. Hoping that these questions make sense.
                        Many thanks, Pat

                        Comment


                        • #13
                          You can use the if qualifier both in determining the missing cases and running the regressions, no need to revert to wide layout. For example,

                          Code:
                          *BIOLOGICAL MOTHERS
                          list if missing(VARNAME) & parentfigure
                          
                          *BIOLOGICAL FATHERS
                          list if missing(VARNAME2) & !parentfigure
                          
                          *BIOLOGICAL MOTHERS
                          ​​​​​​​regress VARS if parentfigure

                          Comment


                          • #14
                            Andrew: Thanks... I get the first two. How does the last line of code work?

                            Comment


                            • #15
                              Actually, do you mean to have the !parentfigure in line 1 as well?

                              Comment

                              Working...
                              X