Complicated Vlookup-type problem in a large dataset

Lyman Stone

Join Date: Mar 2022

Posts: 4
#1

Complicated Vlookup-type problem in a large dataset

21 Mar 2022, 23:07

First, my verbal description:

I am matching intergenerational data in a file where I have rows with unique IDs for women in column "profileid" and also unique IDs for their mothers in column "momid." All mothers appearing in "momid" also appear somewhere else as a "profileid." This data covers many generations, with about 8 million rows. I want to do several things. 1) I want to identify grantmothers, i.e. pull the "momid" from the row where that row's "profileid" matches a reference row's "momid," and have the grandmother's profileid populated into a variable called something like "grandmomid." 2) I want to grab traits of mothers/grandmothers and match them to daughters/granddaughters; i.e. if a grandmother was had a completed fertility of 6 kids, I want to be able to assign "grandmother had 6 children" to the granddaughter's row in a column called something like "grandmacebt."

A constraint here is I have about 8 million rows and 16 GB memory to work with. I can let something run overnight easily enough, but with 16 GB, sometimes things just crap out. I am using Stata (MP) 17.

So, here's a dataex of a dummy dataset illustrating the problem. The challenge is, in my actual data, it's not so tidily organized by generation, and I have millions of rows, so I need a fairly computationally efficient command.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte(profileid momid ceb grandmaid momidceb grandmaidceb) 1 2 0 3 1 1 2 3 1 4 1 1 3 4 1 5 1 1 4 5 1 . 1 . 5 . 1 . . . 6 8 0 . . . 7 8 0 . . . 8 10 2 . . . 9 10 0 . . . 10 12 2 . . . 11 12 0 . . . 12 14 2 . . . 13 14 0 . . . 14 16 2 . . . 15 16 0 . . . 16 . 2 . . . 17 20 0 . . . 18 20 0 . . . 19 20 0 . . . 20 23 3 . . . 21 23 0 . . . 22 23 0 . . . 23 26 3 . . . 24 26 0 . . . 25 26 0 . . . 26 30 3 . . . 27 30 0 . . . 28 30 0 . . . 29 30 0 . . . 30 . 4 . . . end

Basically, I want to fill in the "grandmaid" by getting the "momid of the momid" etc, and then grab the mom's and grandma's children ever born too.

I'm aware of the ssc "vlookup" command but haven't been able to make it work, though perhaps I haven't tried hard enough. My intuition is there's a simple way to do this using some kind of list or egen-related function, but I just haven't been able to crack it. Any help would be enormously appreciated!
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

22 Mar 2022, 01:50

So with the current dataset we have, what would you want to happen here? What would be created or changed?

Also you say you haven't gotten a presumably user written command to work. Well why not?

It isn't that you're not trying hard enough, the command is doing what it's intended to do unless there's a bug. So could you provide the syntax you've tried with vlookup and say what Stata says when you use it?
Comment
Lyman Stone

Join Date: Mar 2022

Posts: 4
#3

22 Mar 2022, 11:06

What I want is for the "grandmaid" variable to be filled in correctly. So for example, for profileid 17, her mother is profileid 20 (as shown in momid), and profileid 20's mother is profileid 23. So I want the grandmaid for profileid17 to be filled in "23." Once I can do that, I think I can fill in the rest.

If I use:
vlookup momid, gen(grandmaid) key(profileid) value(blankvar)

where "blankvar" is a duplicate of "profileid" I at least get a non-error result, but it just duplicates the momid for each row. I've tried swapping each variable around in places, but I get various errors, including:

(when momid is in the "value" field):

momid is unique within profileid;
there are multiple observations with different momid within profileid.

and

(whenever I try to have a variable in two fields)

(varname) is already defined
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10195

22 Mar 2022, 12:30

Something like this should do what you want:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(profileid momid ceb grandmaid momidceb grandmaidceb)
 1  2 0 3 1 1
 2  3 1 4 1 1
 3  4 1 5 1 1
 4  5 1 . 1 .
 5  . 1 . . .
 6  8 0 . . .
 7  8 0 . . .
 8 10 2 . . .
 9 10 0 . . .
10 12 2 . . .
11 12 0 . . .
12 14 2 . . .
13 14 0 . . .
14 16 2 . . .
15 16 0 . . .
16  . 2 . . .
17 20 0 . . .
18 20 0 . . .
19 20 0 . . .
20 23 3 . . .
21 23 0 . . .
22 23 0 . . .
23 26 3 . . .
24 26 0 . . .
25 26 0 . . .
26 30 3 . . .
27 30 0 . . .
28 30 0 . . .
29 30 0 . . .
30  . 4 . . .
end


frame put profileid momid, into(grand)
frame grand: rename (profileid momid)(momid wanted)
frlink m:1 momid, frame(grand)
frget wanted, from(grand)
drop grand
frame drop grand

Res.:

Code:

. l, sep(0)

     +------------------------------------------------------------------+
     | profil~d   momid   ceb   grandm~d   momidceb   grandm~b   wanted |
     |------------------------------------------------------------------|
  1. |        1       2     0          3          1          1        3 |
  2. |        2       3     1          4          1          1        4 |
  3. |        3       4     1          5          1          1        5 |
  4. |        4       5     1          .          1          .        . |
  5. |        5       .     1          .          .          .        . |
  6. |        6       8     0          .          .          .       10 |
  7. |        7       8     0          .          .          .       10 |
  8. |        8      10     2          .          .          .       12 |
  9. |        9      10     0          .          .          .       12 |
 10. |       10      12     2          .          .          .       14 |
 11. |       11      12     0          .          .          .       14 |
 12. |       12      14     2          .          .          .       16 |
 13. |       13      14     0          .          .          .       16 |
 14. |       14      16     2          .          .          .        . |
 15. |       15      16     0          .          .          .        . |
 16. |       16       .     2          .          .          .        . |
 17. |       17      20     0          .          .          .       23 |
 18. |       18      20     0          .          .          .       23 |
 19. |       19      20     0          .          .          .       23 |
 20. |       20      23     3          .          .          .       26 |
 21. |       21      23     0          .          .          .       26 |
 22. |       22      23     0          .          .          .       26 |
 23. |       23      26     3          .          .          .       30 |
 24. |       24      26     0          .          .          .       30 |
 25. |       25      26     0          .          .          .       30 |
 26. |       26      30     3          .          .          .        . |
 27. |       27      30     0          .          .          .        . |
 28. |       28      30     0          .          .          .        . |
 29. |       29      30     0          .          .          .        . |
 30. |       30       .     4          .          .          .        . |
     +------------------------------------------------------------------+

.

Comment

Lyman Stone

Join Date: Mar 2022

Posts: 4
#5

22 Mar 2022, 12:45

Just for my clarification... does this *require* me to have the "grand" variable partly filled out?

This dataset is a "toy" dataset. My "real" dataset does not have profileid matched to grandmaid for any individuals. Will this code work for it or do I need to have a few linkages already made?
Comment
Lyman Stone

Join Date: Mar 2022

Posts: 4
#6

22 Mar 2022, 15:48

So, I ran the code Andrew provided above. It does run and yield values. However, they are incorrect. When I hand-check a few of them, the assigned "wanted" (which I assume is the identified grandma?) is routinely someone who when I go and check the "momid" values has no linkage to the supposed child.

Perhaps this matter: in my real dataset, there is no organization. Generations are scattered fairly randomly around the database, with lineages routinely having thousands or tens of thousands of rows between them.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

22 Mar 2022, 16:49

I think this works correctly:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear*
input byte(profileid momid ceb momidceb)
 1  2 0 1
 2  3 1 1
 3  4 1 1
 4  5 1 1
 5  . 1 .
 6  8 0 .
 7  8 0 .
 8 10 2 .
 9 10 0 .
10 12 2 .
11 12 0 .
12 14 2 .
13 14 0 .
14 16 2 .
15 16 0 .
16  . 2 .
17 20 0 .
18 20 0 .
19 20 0 .
20 23 3 .
21 23 0 .
22 23 0 .
23 26 3 .
24 26 0 .
25 26 0 .
26 30 3 .
27 30 0 .
28 30 0 .
29 30 0 .
30  . 4 .
end


frame put _all, into(mothers)
frame mothers {
    frame put _all, into(grandmothers)
    frlink m:1 momid, frame(grandmothers profileid)
    frget grandmaidceb = ceb, from(grandmothers)
}
frlink m:1 momid, frame(mothers profileid)
frget grandmaid = momid grandmaidceb = grandmaidceb , from(mothers)

And, no, as the above shows, you do not have to start out with some of the grandma data filled in.

Added: There is nothing in this code that would go wrong with a large data set and the observations in scrambled order. It might take a very long time to run, especially the -frlink- commands, and you could find yourself breaching memory limits. If the latter happens, you can spare memory by doing this with tempfiles instead of frames: that pushes the memory burden of to disk, at the expense of being even slower.

Last edited by Clyde Schechter; 22 Mar 2022, 16:53.

Announcement