First, my verbal description:
I am matching intergenerational data in a file where I have rows with unique IDs for women in column "profileid" and also unique IDs for their mothers in column "momid." All mothers appearing in "momid" also appear somewhere else as a "profileid." This data covers many generations, with about 8 million rows. I want to do several things. 1) I want to identify grantmothers, i.e. pull the "momid" from the row where that row's "profileid" matches a reference row's "momid," and have the grandmother's profileid populated into a variable called something like "grandmomid." 2) I want to grab traits of mothers/grandmothers and match them to daughters/granddaughters; i.e. if a grandmother was had a completed fertility of 6 kids, I want to be able to assign "grandmother had 6 children" to the granddaughter's row in a column called something like "grandmacebt."
A constraint here is I have about 8 million rows and 16 GB memory to work with. I can let something run overnight easily enough, but with 16 GB, sometimes things just crap out. I am using Stata (MP) 17.
So, here's a dataex of a dummy dataset illustrating the problem. The challenge is, in my actual data, it's not so tidily organized by generation, and I have millions of rows, so I need a fairly computationally efficient command.
Basically, I want to fill in the "grandmaid" by getting the "momid of the momid" etc, and then grab the mom's and grandma's children ever born too.
I'm aware of the ssc "vlookup" command but haven't been able to make it work, though perhaps I haven't tried hard enough. My intuition is there's a simple way to do this using some kind of list or egen-related function, but I just haven't been able to crack it. Any help would be enormously appreciated!
I am matching intergenerational data in a file where I have rows with unique IDs for women in column "profileid" and also unique IDs for their mothers in column "momid." All mothers appearing in "momid" also appear somewhere else as a "profileid." This data covers many generations, with about 8 million rows. I want to do several things. 1) I want to identify grantmothers, i.e. pull the "momid" from the row where that row's "profileid" matches a reference row's "momid," and have the grandmother's profileid populated into a variable called something like "grandmomid." 2) I want to grab traits of mothers/grandmothers and match them to daughters/granddaughters; i.e. if a grandmother was had a completed fertility of 6 kids, I want to be able to assign "grandmother had 6 children" to the granddaughter's row in a column called something like "grandmacebt."
A constraint here is I have about 8 million rows and 16 GB memory to work with. I can let something run overnight easily enough, but with 16 GB, sometimes things just crap out. I am using Stata (MP) 17.
So, here's a dataex of a dummy dataset illustrating the problem. The challenge is, in my actual data, it's not so tidily organized by generation, and I have millions of rows, so I need a fairly computationally efficient command.
Code:
* Example generated by -dataex-. For more info, type help dataex clear input byte(profileid momid ceb grandmaid momidceb grandmaidceb) 1 2 0 3 1 1 2 3 1 4 1 1 3 4 1 5 1 1 4 5 1 . 1 . 5 . 1 . . . 6 8 0 . . . 7 8 0 . . . 8 10 2 . . . 9 10 0 . . . 10 12 2 . . . 11 12 0 . . . 12 14 2 . . . 13 14 0 . . . 14 16 2 . . . 15 16 0 . . . 16 . 2 . . . 17 20 0 . . . 18 20 0 . . . 19 20 0 . . . 20 23 3 . . . 21 23 0 . . . 22 23 0 . . . 23 26 3 . . . 24 26 0 . . . 25 26 0 . . . 26 30 3 . . . 27 30 0 . . . 28 30 0 . . . 29 30 0 . . . 30 . 4 . . . end
Basically, I want to fill in the "grandmaid" by getting the "momid of the momid" etc, and then grab the mom's and grandma's children ever born too.
I'm aware of the ssc "vlookup" command but haven't been able to make it work, though perhaps I haven't tried hard enough. My intuition is there's a simple way to do this using some kind of list or egen-related function, but I just haven't been able to crack it. Any help would be enormously appreciated!
Comment