Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Complicated Vlookup-type problem in a large dataset

    First, my verbal description:

    I am matching intergenerational data in a file where I have rows with unique IDs for women in column "profileid" and also unique IDs for their mothers in column "momid." All mothers appearing in "momid" also appear somewhere else as a "profileid." This data covers many generations, with about 8 million rows. I want to do several things. 1) I want to identify grantmothers, i.e. pull the "momid" from the row where that row's "profileid" matches a reference row's "momid," and have the grandmother's profileid populated into a variable called something like "grandmomid." 2) I want to grab traits of mothers/grandmothers and match them to daughters/granddaughters; i.e. if a grandmother was had a completed fertility of 6 kids, I want to be able to assign "grandmother had 6 children" to the granddaughter's row in a column called something like "grandmacebt."

    A constraint here is I have about 8 million rows and 16 GB memory to work with. I can let something run overnight easily enough, but with 16 GB, sometimes things just crap out. I am using Stata (MP) 17.

    So, here's a dataex of a dummy dataset illustrating the problem. The challenge is, in my actual data, it's not so tidily organized by generation, and I have millions of rows, so I need a fairly computationally efficient command.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(profileid momid ceb grandmaid momidceb grandmaidceb)
     1  2 0 3 1 1
     2  3 1 4 1 1
     3  4 1 5 1 1
     4  5 1 . 1 .
     5  . 1 . . .
     6  8 0 . . .
     7  8 0 . . .
     8 10 2 . . .
     9 10 0 . . .
    10 12 2 . . .
    11 12 0 . . .
    12 14 2 . . .
    13 14 0 . . .
    14 16 2 . . .
    15 16 0 . . .
    16  . 2 . . .
    17 20 0 . . .
    18 20 0 . . .
    19 20 0 . . .
    20 23 3 . . .
    21 23 0 . . .
    22 23 0 . . .
    23 26 3 . . .
    24 26 0 . . .
    25 26 0 . . .
    26 30 3 . . .
    27 30 0 . . .
    28 30 0 . . .
    29 30 0 . . .
    30  . 4 . . .
    end

    Basically, I want to fill in the "grandmaid" by getting the "momid of the momid" etc, and then grab the mom's and grandma's children ever born too.

    I'm aware of the ssc "vlookup" command but haven't been able to make it work, though perhaps I haven't tried hard enough. My intuition is there's a simple way to do this using some kind of list or egen-related function, but I just haven't been able to crack it. Any help would be enormously appreciated!

  • #2
    So with the current dataset we have, what would you want to happen here? What would be created or changed?

    Also you say you haven't gotten a presumably user written command to work. Well why not?

    It isn't that you're not trying hard enough, the command is doing what it's intended to do unless there's a bug. So could you provide the syntax you've tried with vlookup and say what Stata says when you use it?

    Comment


    • #3
      What I want is for the "grandmaid" variable to be filled in correctly. So for example, for profileid 17, her mother is profileid 20 (as shown in momid), and profileid 20's mother is profileid 23. So I want the grandmaid for profileid17 to be filled in "23." Once I can do that, I think I can fill in the rest.

      If I use:
      vlookup momid, gen(grandmaid) key(profileid) value(blankvar)

      where "blankvar" is a duplicate of "profileid" I at least get a non-error result, but it just duplicates the momid for each row. I've tried swapping each variable around in places, but I get various errors, including:

      (when momid is in the "value" field):

      momid is unique within profileid;
      there are multiple observations with different momid within profileid.


      and


      (whenever I try to have a variable in two fields)

      (varname) is already defined





      Comment


      • #4
        Something like this should do what you want:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input byte(profileid momid ceb grandmaid momidceb grandmaidceb)
         1  2 0 3 1 1
         2  3 1 4 1 1
         3  4 1 5 1 1
         4  5 1 . 1 .
         5  . 1 . . .
         6  8 0 . . .
         7  8 0 . . .
         8 10 2 . . .
         9 10 0 . . .
        10 12 2 . . .
        11 12 0 . . .
        12 14 2 . . .
        13 14 0 . . .
        14 16 2 . . .
        15 16 0 . . .
        16  . 2 . . .
        17 20 0 . . .
        18 20 0 . . .
        19 20 0 . . .
        20 23 3 . . .
        21 23 0 . . .
        22 23 0 . . .
        23 26 3 . . .
        24 26 0 . . .
        25 26 0 . . .
        26 30 3 . . .
        27 30 0 . . .
        28 30 0 . . .
        29 30 0 . . .
        30  . 4 . . .
        end
        
        
        frame put profileid momid, into(grand)
        frame grand: rename (profileid momid)(momid wanted)
        frlink m:1 momid, frame(grand)
        frget wanted, from(grand)
        drop grand
        frame drop grand
        Res.:

        Code:
        . l, sep(0)
        
             +------------------------------------------------------------------+
             | profil~d   momid   ceb   grandm~d   momidceb   grandm~b   wanted |
             |------------------------------------------------------------------|
          1. |        1       2     0          3          1          1        3 |
          2. |        2       3     1          4          1          1        4 |
          3. |        3       4     1          5          1          1        5 |
          4. |        4       5     1          .          1          .        . |
          5. |        5       .     1          .          .          .        . |
          6. |        6       8     0          .          .          .       10 |
          7. |        7       8     0          .          .          .       10 |
          8. |        8      10     2          .          .          .       12 |
          9. |        9      10     0          .          .          .       12 |
         10. |       10      12     2          .          .          .       14 |
         11. |       11      12     0          .          .          .       14 |
         12. |       12      14     2          .          .          .       16 |
         13. |       13      14     0          .          .          .       16 |
         14. |       14      16     2          .          .          .        . |
         15. |       15      16     0          .          .          .        . |
         16. |       16       .     2          .          .          .        . |
         17. |       17      20     0          .          .          .       23 |
         18. |       18      20     0          .          .          .       23 |
         19. |       19      20     0          .          .          .       23 |
         20. |       20      23     3          .          .          .       26 |
         21. |       21      23     0          .          .          .       26 |
         22. |       22      23     0          .          .          .       26 |
         23. |       23      26     3          .          .          .       30 |
         24. |       24      26     0          .          .          .       30 |
         25. |       25      26     0          .          .          .       30 |
         26. |       26      30     3          .          .          .        . |
         27. |       27      30     0          .          .          .        . |
         28. |       28      30     0          .          .          .        . |
         29. |       29      30     0          .          .          .        . |
         30. |       30       .     4          .          .          .        . |
             +------------------------------------------------------------------+
        
        .

        Comment


        • #5
          Just for my clarification... does this *require* me to have the "grand" variable partly filled out?

          This dataset is a "toy" dataset. My "real" dataset does not have profileid matched to grandmaid for any individuals. Will this code work for it or do I need to have a few linkages already made?

          Comment


          • #6
            So, I ran the code Andrew provided above. It does run and yield values. However, they are incorrect. When I hand-check a few of them, the assigned "wanted" (which I assume is the identified grandma?) is routinely someone who when I go and check the "momid" values has no linkage to the supposed child.

            Perhaps this matter: in my real dataset, there is no organization. Generations are scattered fairly randomly around the database, with lineages routinely having thousands or tens of thousands of rows between them.

            Comment


            • #7
              I think this works correctly:
              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear*
              input byte(profileid momid ceb momidceb)
               1  2 0 1
               2  3 1 1
               3  4 1 1
               4  5 1 1
               5  . 1 .
               6  8 0 .
               7  8 0 .
               8 10 2 .
               9 10 0 .
              10 12 2 .
              11 12 0 .
              12 14 2 .
              13 14 0 .
              14 16 2 .
              15 16 0 .
              16  . 2 .
              17 20 0 .
              18 20 0 .
              19 20 0 .
              20 23 3 .
              21 23 0 .
              22 23 0 .
              23 26 3 .
              24 26 0 .
              25 26 0 .
              26 30 3 .
              27 30 0 .
              28 30 0 .
              29 30 0 .
              30  . 4 .
              end
              
              
              frame put _all, into(mothers)
              frame mothers {
                  frame put _all, into(grandmothers)
                  frlink m:1 momid, frame(grandmothers profileid)
                  frget grandmaidceb = ceb, from(grandmothers)
              }
              frlink m:1 momid, frame(mothers profileid)
              frget grandmaid = momid grandmaidceb = grandmaidceb , from(mothers)
              And, no, as the above shows, you do not have to start out with some of the grandma data filled in.

              Added: There is nothing in this code that would go wrong with a large data set and the observations in scrambled order. It might take a very long time to run, especially the -frlink- commands, and you could find yourself breaching memory limits. If the latter happens, you can spare memory by doing this with tempfiles instead of frames: that pushes the memory burden of to disk, at the expense of being even slower.
              Last edited by Clyde Schechter; 22 Mar 2022, 17:53.

              Comment

              Working...
              X