Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • merge when value of one variable in dataset 1 matches either value of two variables in dataset 2

    Hi,
    I'm trying to merge two datasets by phone number. In dataset 1, each observation only has one phone number (so just one phone number variable). In dataset 2, each observation has a primary phone number and a secondary phone number (so two different phone number variables). I'd like to merge these two data sets by merging observations where the one phone number in the first dataset matches at least one of the phone numbers in the second dataset. I've looked through the merge documentation but it doesn't seem like there is an option to do this.

    Any help would be appreciated!

  • #2
    You can't do it with a single -merge- command, but you can do it with two:

    Code:
    use dataset1
    clonevar primary_phone_number = phone_number
    merge 1:1 primary_phone_number using dataset_2, nogenerate
    drop secondary_phone_number
    clonevar secondary_phone_number = phone_number
    merge 1:1 secondary_phoneNumber using dataset_2, update
    Notes:
    1. I have written these as -merge 1:1-, but depending on the organization of these data sets, m:1 or 1:m may be needed. You didn't show any example data, so I'm imagining that all of the phone numbers in both data sets uniquely identify observations. (Don't even think about using m:m!)
    2. This code will match the phone number in dataset1 with the primary phone number in dataset2 if possible. If no match is found with the primary phone number, it will match with the secondary phone number if possible. If a phone number matches with both a primary and a secondary phone number, only the match with the primary phone number applies.
    3. Depending on the organization of your data set, there may be better solutions. But without example data, nothing more specific can be said.
    4. This code also requires exact text match between phone numbers, so the formatting of the phone numbers must be consistent within and across data sets for this code to work. (987) 654-3210 will not match with 9876543210 or any other formatting of the same digits. If the formatting of the phone numbers in your real data is irregular, it probably makes sense to use regular expressions to reduce them to purely digit strings, removing any parens, blanks, hyphens or other non-digit characters.
    5. In the code, replace phone_number, primary_phone_number, and secondary_phone_number by the actual names of the corresponding variables in your data sets.
    6. Also because there was no example data, this code is untested. It may contain errors.

    In the future when asking for help with code, show example data, and please use the -dataex- command to do that. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.


    Comment


    • #3
      Thanks for the fast and thorough response! And thanks also for the tip on example data and dataex, I will definitely keep it in mind for the future.

      Comment


      • #4
        Perhaps this could also be done with just one merge, by duplicating the observations in dataset2.
        Code:
        use dataset2
        expand 2, generate(second)
        generate phone_number = cond(second==0,primary_phone_number,secondary_phone_number)
        merge 1:1 phone_number using dataset1

        Comment

        Working...
        X