Generating family identifier in dataset of siblings

Noah Spencer

Join Date: Jan 2019
Posts: 125

Generating family identifier in dataset of siblings

07 Jun 2024, 13:47

I have a dataset that has information on different age 30 earnings among siblings. I want to generate a "family_id" variable that is stable over time. The dataset has a family_id variable that can be used to see siblings in a given year, but the family ID changes every year. I can only observe people in this dataset between ages 6 and 17.

The dataset looks like this:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str13 name str9 id int year str8 family_id_in_year byte age long age30_income
"Mike Johnson"  "929371031" 1993 "10392842" 15 30000
"Mike Johnson"  "929371031" 1994 "13928401" 16 30000
"Sally Johnson" "918302912" 1994 "13928401"  6 50000
"Mike Johnson"  "929371031" 1995 "38103927" 17 30000
"Sally Johnson" "918302912" 1995 "38103927"  7 50000
"Jane Johnson"  "917374820" 1995 "38103927"  6 40000
"Sally Johnson" "918302912" 1996 "23145565"  8 50000
"Jane Johnson"  "917374820" 1996 "23145565"  7 40000
"Tyler Bates"   "910393920" 1996 "10392911" 10 23000
end

I want the dataset to end up looking like this:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str13 name str9 id byte sibling_position str9(sibling1_id sibling2_id sibling3_id) str29 family long age30_income
"Mike Johnson"  "929371031" 1 "929371031" "918302912" "917374820" "929371031-918302912-917374820" 30000
"Sally Johnson" "918302912" 2 "929371031" "918302912" "917374820" "929371031-918302912-917374820" 50000
"Jane Johnson"  "917374820" 3 "929371031" "918302912" "917374820" "929371031-918302912-917374820" 40000
"Tyler Bates"   "910393920" 1 "910393920" ""          ""          "910393920"                     23000
end

That is, I want to create a "family" variable that is a string of oldest-sibling/next-oldest-sibling/next-oldest sibling.

Does anyone know how I could do this? If so, it would be much appreciated!

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35698

07 Jun 2024, 14:43

Does this help?

Code:

. ssc describe group_id

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
package group_id from http://fmwww.bc.edu/repec/bocode/g
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

TITLE
      'GROUP_ID': module to group identifiers when values for specified variables match

DESCRIPTION/AUTHOR(S)
      
       group_id consolidates values of an identifier variable when
      observations are matched using other variables in the dataset.
      When a     match is found between two observations with different
      identifier  values, all records that share the same identifier
      values are updated  to the new consolidated value, even if
      they do not match by    themselves.
      
      KW: group
      KW: identifier
      KW: match
      
      Requires: Stata version 9.2
      
      Distribution-Date: 20100423
      
      Author: Robert Picard
      Support: email [email protected]
      

INSTALLATION FILES                             (type net install group_id)
      group_id.ado
      group_id.hlp
      group_id.dlg
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
(type ssc install group_id to install)

Comment

Noah Spencer

Join Date: Jan 2019
Posts: 125

08 Jun 2024, 08:51

Yes, thank you!

In case it is of interest to anyone, here is my solution with -group_id-:

Code:


* Load data
clear
input str13 name str9 id int year str8 family_id_in_year byte age long age30_income
"Mike Johnson"  "929371031" 1993 "10392842" 15 30000
"Mike Johnson"  "929371031" 1994 "13928401" 16 30000
"Sally Johnson" "918302912" 1994 "13928401"  6 50000
"Mike Johnson"  "929371031" 1995 "38103927" 17 30000
"Sally Johnson" "918302912" 1995 "38103927"  7 50000
"Jane Johnson"  "917374820" 1995 "38103927"  6 40000
"Sally Johnson" "918302912" 1996 "23145565"  8 50000
"Jane Johnson"  "917374820" 1996 "23145565"  7 40000
"Tyler Bates"   "910393920" 1996 "10392911" 10 23000
end


* Get year of birth (for ordering sibling position)
gen yob = year - age

* Use -group_id- to get stable family identifiers
gen new_id = id
group_id new_id, matchby(family year)
rename new_id family_id

* De-annualize data
drop year family_id_in_year age
duplicates drop

* Create columns with sibling information
sort family_id yob
bysort family_id: gen sibling_num = _n
bysort family_id: gen sibling1 = id[1]
bysort family_id: gen sibling2 = id[2]
bysort family_id: gen sibling3 = id[3]

* Remake family ID as string of siblings
drop family_id
egen family = concat(sibling1 sibling2 sibling3), punct("-")

* Reorder
order family age30_income, after(sibling3)

Announcement

Generating family identifier in dataset of siblings

Comment

Comment