Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating parents' education variable for families with multiple subfamilies.

    Hi everyone,

    I'm currently writing a master's thesis and using a dataset that looks like this:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(fam_id mem_id relation gender marital educ)
    1 1 1 1 2 4
    1 2 2 0 2 3
    1 3 3 1 1 2
    2 1 1 1 2 4
    2 2 2 0 2 4
    2 3 3 0 1 3
    2 4 3 0 1 2
    2 5 3 0 1 0
    3 1 1 1 2 3
    3 2 2 0 2 4
    3 3 3 1 2 4
    3 4 5 0 2 4
    3 5 6 1 1 0
    4 1 1 0 2 3
    4 2 2 1 2 3
    4 3 3 1 2 4
    4 4 5 0 2 4
    4 5 6 1 1 0
    4 6 4 0 2 4
    4 7 5 1 2 3
    4 8 6 1 1 1
    4 9 6 1 1 0
    5 1 1 1 3 4
    5 2 3 0 2 3
    5 3 5 1 2 4
    5 4 6 1 1 0
    6 1 1 1 2 4
    6 2 2 0 2 4
    6 3 3 0 3 3
    6 4 6 1 1 0
    end
    The explanation of the variables is as follows:
    • Relationship: 1 "Head of household," 2 "Spouse," 3 "Child/stepchild," 4 "Foster child," 5 "Son/daughter-in-law," 6 "Grandchild"
    • Gender: 0 "Female," 1 "Male"
    • Marital Status: 1 "Not married," 2 "Married," 3 "Divorced"
    • Education: 0 "Uneducated," 1 "Elementary school," 2 "Middle school," 3 "High school," 4 "College"
    I'm trying to create parents' education variables for each child and grandchild, but it's not working quite as intended, especially for the subfamilies within the main families. If there is more than one subfamily, it uses the parents' education from the latest subfamily (as shown in the attached picture). As you can see, the parent's education in the red box is referencing the one in the green box.

    I'm using the code shown in this topic:
    HTML Code:
    https://www.statalist.org/forums/forum/general-stata-discussion/general/1518828-generating-parents-education-variable
    (modified with help from ChatGPT):

    Code:
    // Generate binary variables for child and parent relationships
    gen byte child1 = (relation == 3)
    gen byte child2 = (relation == 6)
    gen byte father1 = (relation == 1 | relation == 2) & (gender == 1) & (marital == 2 | marital == 3)
    gen byte father2 = (relation == 3 | relation == 4 | relation == 5) & (gender == 1) & (marital == 2 | marital == 3)
    gen byte mother1 = (relation == 1 | relation == 2) & (gender == 0) & (marital == 2 | marital == 3)
    gen byte mother2 = (relation == 3 | relation == 4 | relation == 5) & (gender == 0) & (marital == 2 | marital == 3)
    
    // Calculate minimum educational attainment for fathers and mothers
    egen father_educ1 = min(educ / father1), by(fam_id)
    egen mother_educ1 = min(educ / mother1), by(fam_id)
    
    // Recode educational attainment to missing for non-child cases
    replace father_educ1 = . if !child1
    replace mother_educ1 = . if !child1
    
    egen father_educ2 = min(educ / father2), by(fam_id)
    egen mother_educ2 = min(educ / mother2), by(fam_id)
    
    // Recode educational attainment to missing for non-child cases
    replace father_educ2 = . if !child2
    replace mother_educ2 = . if !child2
    Any kind of help would be greatly appreciated.

    Attached Files

  • #2
    Thanks for the data example. I notice your id 4 is the error case and I can reproduce the same on my end after running your code.

    Code:
    . list fam_id mem_id relation gender marital educ father_educ1 mother_educ1 father_educ2 mother_educ2 if fam_id == 4, clean noobs
    
        fam_id   mem_id              relation   gender       marital                educ   father_ed~1   mother_ed~1   father_ed~2   mothe~c2  
             4        1     Head of household   Female       Married         High school             .             .             .          .  
             4        2                Spouse     Male       Married         High school             .             .             .          .  
             4        3       Child/stepchild     Male       Married             College   High school   High school             .          .  
             4        4   Son/daughter-in-law   Female       Married             College             .             .             .          .  
             4        5            Grandchild     Male   Not married          Uneducated             .             .   High school    College  
             4        6          Foster child   Female       Married             College             .             .             .          .  
             4        7   Son/daughter-in-law     Male       Married         High school             .             .             .          .  
             4        8            Grandchild     Male   Not married   Elementary school             .             .   High school    College  
             4        9            Grandchild     Male   Not married          Uneducated             .             .   High school    College
    The issue is that rather than differentiate between different sub-parents, your code finds the parent with the minimum education and uses that value everywhere. The problem is with the following section of code. Note that code for grandparent education uses the same logic, but should work just fine assuming there is exactly one pair of grandparents per household.

    Code:
    egen father_educ2 = min(educ / father2), by(fam_id)
    egen mother_educ2 = min(educ / mother2), by(fam_id)
    How can we know which parents certain grandchildren belong to? I'm going to assume that mem_ids are ordered in such a way that grandchildren are always listed directly after their parents. That seems to be the case for your data example. I start by grouping family members by parents to create subfamily ids.

    Code:
    bysort fam_id (mem_id): gen father_subfam = sum(father2)
    bysort fam_id (mem_id): gen mother_subfam = sum(mother2)
    The code above produces the running sum for the father2 and mother2 indicator variables respectively within families. If we haven't found a parent yet (these should be rows for grandparents), the respective subgroup ids will equal zero. We will take advantage of this property in the next two lines. We assign the appropriate parent's education (located in the first row of each family/subfamily group sorted by mem_id) to any rows with grandchildren, ignoring cases where the subgroup id equals zero (the grandparent case). It may be sufficient to only use cases where grandchildren are marked (child2), but I felt it was better to be explicit than implicit in this case.

    Code:
    bysort fam_id father_subfam (mem_id): gen father_educ2 = educ[1] if father_subfam & child2
    bysort fam_id mother_subfam (mem_id): gen mother_educ2 = educ[1] if mother_subfam & child2
    It looks to me like the four new lines above gives the correct result for the edge case in family 4.

    Code:
    . list fam_id mem_id relation gender marital educ father_educ1 mother_educ1 father_educ2 mother_educ2 if fam_id == 4, clean noobs
    
        fam_id   mem_id              relation   gender       marital                educ   father_ed~1   mother_ed~1   father_ed~2   mothe~c2  
             4        1     Head of household   Female       Married         High school             .             .             .          .  
             4        2                Spouse     Male       Married         High school             .             .             .          .  
             4        3       Child/stepchild     Male       Married             College   High school   High school             .          .  
             4        4   Son/daughter-in-law   Female       Married             College             .             .             .          .  
             4        5            Grandchild     Male   Not married          Uneducated             .             .       College    College  
             4        6          Foster child   Female       Married             College             .             .             .          .  
             4        7   Son/daughter-in-law     Male       Married         High school             .             .             .          .  
             4        8            Grandchild     Male   Not married   Elementary school             .             .   High school    College  
             4        9            Grandchild     Male   Not married          Uneducated             .             .   High school    College
    I also hand checked the rest against your original solution and it seems to give the same results everywhere else. You can see all of my code from start to finish below.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(fam_id mem_id relation gender marital educ)
    1 1 1 1 2 4
    1 2 2 0 2 3
    1 3 3 1 1 2
    2 1 1 1 2 4
    2 2 2 0 2 4
    2 3 3 0 1 3
    2 4 3 0 1 2
    2 5 3 0 1 0
    3 1 1 1 2 3
    3 2 2 0 2 4
    3 3 3 1 2 4
    3 4 5 0 2 4
    3 5 6 1 1 0
    4 1 1 0 2 3
    4 2 2 1 2 3
    4 3 3 1 2 4
    4 4 5 0 2 4
    4 5 6 1 1 0
    4 6 4 0 2 4
    4 7 5 1 2 3
    4 8 6 1 1 1
    4 9 6 1 1 0
    5 1 1 1 3 4
    5 2 3 0 2 3
    5 3 5 1 2 4
    5 4 6 1 1 0
    6 1 1 1 2 4
    6 2 2 0 2 4
    6 3 3 0 3 3
    6 4 6 1 1 0
    end
    
    label define rlab 1 "Head of household" 2 "Spouse" 3 "Child/stepchild" 4 "Foster child" 5 "Son/daughter-in-law" 6 "Grandchild"
    label values relation rlab
    label define glab 0 "Female" 1 "Male"
    label values gender glab
    label define mslab 1 "Not married" 2 "Married" 3 "Divorced"
    label values marital mslab
    label define elab 0 "Uneducated" 1 "Elementary school" 2 "Middle school" 3 "High school" 4 "College"
    label values educ elab
    
    // Generate binary variables for child and parent relationships
    gen byte child1 = (relation == 3)
    gen byte child2 = (relation == 6)
    gen byte father1 = (relation == 1 | relation == 2) & (gender == 1) & (marital == 2 | marital == 3)
    gen byte father2 = (relation == 3 | relation == 4 | relation == 5) & (gender == 1) & (marital == 2 | marital == 3)
    gen byte mother1 = (relation == 1 | relation == 2) & (gender == 0) & (marital == 2 | marital == 3)
    gen byte mother2 = (relation == 3 | relation == 4 | relation == 5) & (gender == 0) & (marital == 2 | marital == 3)
    
    // Calculate minimum educational attainment for fathers and mothers
    egen father_educ1 = min(educ / father1), by(fam_id)
    egen mother_educ1 = min(educ / mother1), by(fam_id)
    
    // Recode educational attainment to missing for non-child cases
    replace father_educ1 = . if !child1
    replace mother_educ1 = . if !child1
    
    /*
    egen father_educ2 = min(educ / father2), by(fam_id)
    egen mother_educ2 = min(educ / mother2), by(fam_id)
    
    // Recode educational attainment to missing for non-child cases
    replace father_educ2 = . if !child2
    replace mother_educ2 = . if !child2
    */
    
    bysort fam_id (mem_id): gen father_subfam = sum(father2)
    bysort fam_id (mem_id): gen mother_subfam = sum(mother2)
    
    bysort fam_id father_subfam (mem_id): gen father_educ2 = educ[1] if father_subfam & child2
    bysort fam_id mother_subfam (mem_id): gen mother_educ2 = educ[1] if mother_subfam & child2
    
    label values father_educ1 elab
    label values father_educ2 elab
    label values mother_educ1 elab
    label values mother_educ2 elab
    list fam_id mem_id relation gender marital educ father_educ1 mother_educ1 father_educ2 mother_educ2 if fam_id == 4, clean noobs

    Comment


    • #3
      By the way, shout out to Mike Lacy, who divides by the father/mother indicator variable in the original post because dividing by zero transforms the result to missing in Stata and dividing by 1 has no effect. I'm going to file that away into my mental repository of nice Stata/indicator techniques. From the original thread:

      Code:
      egen FaDip = min(diploma/father), by(famID)
      egen MoDip = min(diploma/mother), by(famID)

      Comment


      • #4
        Re #3: the use of this trick of dividing by the indicator (or even an expression that evaluates to an indicator for a condition) is controversial. I used to use it myself, but have abandoned it. Yes, it works, and it has the virtue of concision. The problem is that it makes the code less readable. To interpret those commands you have to know that father is in fact an indicator, rather than, say, the id of another observation containing information about the current observation's father, or something else. So this makes the code somewhat opaque to somebody who is not familiar with the details of the data set. And, in fact, if you have been away from that data set for a few months and now have to return to it, the code may be opaque to you, depending on how good your long-term memory of the details is.

        You also have to be a Stata aficionado to recognize that the intent is that division by a zero value of the denominator will lead to a missing value of the expression, thereby filtering the expression to those cases where the denominator is non-zero. It's not intuitive. And if you review your code with somebody who is not steeped in Stata, they will likely be puzzled by it.

        There is no single right way to code. And there are always tradeoffs. But I have come down on the side of:
        Code:
        egen FaDip = min(cond(father, diploma, .)), by(famID)
        favoring transparency/readability over concision for this computation. The code means exactly what it explicitly says.

        Comment


        • #5
          Shortly after I became familiar enough with this trick to suggest it as a solution, I came around to Clyde's view, namely that it's probably not a good idea. Unless there's some huge savings in speed or concision, I'd choose transparency over brevity.

          Comment


          • #6
            First of all, I’d like to thank everyone who responded to my question. Your input has greatly helped me progress in my thesis writing.

            Daniel Schaefer : Thank you for your solution. I tested the code on my dataset, and it works as intended, providing the results I wanted.

            Clyde Schechter and Mike Lacy : Thank you for your insights. They have helped me maintain the mindset to understand each line of code I use, rather than just copying and pasting anything that works without fully grasping its intent and purpose. I'm still a novice in Stata and coding, but I intend to pursue a PhD in Economics in the future, and having the right mindset will certainly help me along the way.

            Comment

            Working...
            X