Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating family identifier in dataset of siblings

    I have a dataset that has information on different age 30 earnings among siblings. I want to generate a "family_id" variable that is stable over time. The dataset has a family_id variable that can be used to see siblings in a given year, but the family ID changes every year. I can only observe people in this dataset between ages 6 and 17.

    The dataset looks like this:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str13 name str9 id int year str8 family_id_in_year byte age long age30_income
    "Mike Johnson"  "929371031" 1993 "10392842" 15 30000
    "Mike Johnson"  "929371031" 1994 "13928401" 16 30000
    "Sally Johnson" "918302912" 1994 "13928401"  6 50000
    "Mike Johnson"  "929371031" 1995 "38103927" 17 30000
    "Sally Johnson" "918302912" 1995 "38103927"  7 50000
    "Jane Johnson"  "917374820" 1995 "38103927"  6 40000
    "Sally Johnson" "918302912" 1996 "23145565"  8 50000
    "Jane Johnson"  "917374820" 1996 "23145565"  7 40000
    "Tyler Bates"   "910393920" 1996 "10392911" 10 23000
    end

    I want the dataset to end up looking like this:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str13 name str9 id byte sibling_position str9(sibling1_id sibling2_id sibling3_id) str29 family long age30_income
    "Mike Johnson"  "929371031" 1 "929371031" "918302912" "917374820" "929371031-918302912-917374820" 30000
    "Sally Johnson" "918302912" 2 "929371031" "918302912" "917374820" "929371031-918302912-917374820" 50000
    "Jane Johnson"  "917374820" 3 "929371031" "918302912" "917374820" "929371031-918302912-917374820" 40000
    "Tyler Bates"   "910393920" 1 "910393920" ""          ""          "910393920"                     23000
    end
    That is, I want to create a "family" variable that is a string of oldest-sibling/next-oldest-sibling/next-oldest sibling.

    Does anyone know how I could do this? If so, it would be much appreciated!

  • #2
    Does this help?

    Code:
    . ssc describe group_id
    
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------
    package group_id from http://fmwww.bc.edu/repec/bocode/g
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------
    
    TITLE
          'GROUP_ID': module to group identifiers when values for specified variables match
    
    DESCRIPTION/AUTHOR(S)
          
           group_id consolidates values of an identifier variable when
          observations are matched using other variables in the dataset.
          When a     match is found between two observations with different
          identifier  values, all records that share the same identifier
          values are updated  to the new consolidated value, even if
          they do not match by    themselves.
          
          KW: group
          KW: identifier
          KW: match
          
          Requires: Stata version 9.2
          
          Distribution-Date: 20100423
          
          Author: Robert Picard
          Support: email [email protected]
          
    
    INSTALLATION FILES                             (type net install group_id)
          group_id.ado
          group_id.hlp
          group_id.dlg
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------
    (type ssc install group_id to install)

    Comment


    • #3
      Yes, thank you!

      In case it is of interest to anyone, here is my solution with -group_id-:

      Code:
      
      * Load data
      clear
      input str13 name str9 id int year str8 family_id_in_year byte age long age30_income
      "Mike Johnson"  "929371031" 1993 "10392842" 15 30000
      "Mike Johnson"  "929371031" 1994 "13928401" 16 30000
      "Sally Johnson" "918302912" 1994 "13928401"  6 50000
      "Mike Johnson"  "929371031" 1995 "38103927" 17 30000
      "Sally Johnson" "918302912" 1995 "38103927"  7 50000
      "Jane Johnson"  "917374820" 1995 "38103927"  6 40000
      "Sally Johnson" "918302912" 1996 "23145565"  8 50000
      "Jane Johnson"  "917374820" 1996 "23145565"  7 40000
      "Tyler Bates"   "910393920" 1996 "10392911" 10 23000
      end
      
      
      * Get year of birth (for ordering sibling position)
      gen yob = year - age
      
      * Use -group_id- to get stable family identifiers
      gen new_id = id
      group_id new_id, matchby(family year)
      rename new_id family_id
      
      * De-annualize data
      drop year family_id_in_year age
      duplicates drop
      
      * Create columns with sibling information
      sort family_id yob
      bysort family_id: gen sibling_num = _n
      bysort family_id: gen sibling1 = id[1]
      bysort family_id: gen sibling2 = id[2]
      bysort family_id: gen sibling3 = id[3]
      
      * Remake family ID as string of siblings
      drop family_id
      egen family = concat(sibling1 sibling2 sibling3), punct("-")
      
      * Reorder
      order family age30_income, after(sibling3)

      Comment

      Working...
      X