Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create identifier for a variable that is in two data sets

    Hello!

    I have to merge two data sets. They both have the 'name' variable in common. But I need an identifier that will facilitate the matching.

    I have two issues with this:
    1. The names across data sets are not consistent. In one data set, the name variable might include a middle name and in the other one, it might or not include it. I hope the table below helps illustrate the inconsistency I find between data sets.
    data set 1 data set 2
    Name Name
    AA BB CC AA CC
    DD EE DD EE FF
    *I am thinking that maybe I could extract each name and create three or more variables for each name. And then create an identifier that matches at least two names together (for both data sets)

    2. Even if I find a solution to first problem. How can I ensure that the identifiers I create will be consistent across data sets?

    Thank you in advance for any tipp on how to solve any of my issues.
    Last edited by Lisa Alejandra Kobrich; 20 Jan 2024, 05:59.

  • #2
    You can remove middle names like this

    Code:
    clonevar name_original = name 
    replace name = word(name, 1) + " " + word(name, -1)
    which should work for anything matching the pattern you show (assuming that DD EE FF really means DD FF EE).

    My guess is that you may have other problems too. A variant of Murphy's Law is that large datasets based on real people are usually messy.

    There is no simple good news here. You may need to iterate around

    1. Try a merge with your revised name variables.

    2. See which values don't match.

    3. Realise that your names need more surgery and revise your name variable.

    Never lose the original name data.

    The more detail you lose, the more likely false matches will be. Clearly, but crucially, "Lisa Kobrich" (say) could match several people with different middle names, or none.

    Comment


    • #3
      Hi Nick,

      Thank you for your quick answer!

      Actually, I do not know whether the name is actually DD EE FF or DD FF EE. I am not familiar with those names so I do not know whether they have several middle names or family names.

      I just noticed that there was one observation that had as name DD EE in one data set and DD EE FF in the other data set. It could be that they have two family names and in one data set, they included only one, and in the other both. That is why I thought about extracting the first, second (if there were more than 2 words), and last word of each name. And then create an identifier that merges one data set with the other if at least two words match.

      Not sure if this would be the best solution or if I am just complicating myself but that is what I thought so far.

      Comment


      • #4
        You can try fuzzy matching. However, you need to examine the output carefully and also come up with an appropriate cutoff. Below, I use matchit from SSC. Alternatively, after crossing the datasets, you can break up the names and compare combinations of pairs.

        Code:
        clear
        input str29(Name2)    
        "AA CC"
        "DD EE FF"
        end
        tempfile dataset2
        save `dataset2'
        
        clear
        input str29(Name1)    
        "AA BB CC"    
        "DD EE"
        end
        
        cross using `dataset2'
        matchit Name1 Name2
        list
        
        keep if similscore>.7    
        list
        Res.:

        Code:
        . matchit Name1 Name2
        Matching columns Name1 and Name2
        Similarity function: bigram
        0%
        20%
        60%
        80%
        100%
        Done!
        
        . 
        . list
        
             +---------------------------------+
             |    Name1      Name2   similsc~e |
             |---------------------------------|
          1. | AA BB CC      AA CC   .75592895 |
          2. |    DD EE      AA CC           0 |
          3. | AA BB CC   DD EE FF           0 |
          4. |    DD EE   DD EE FF   .75592895 |
             +---------------------------------+
        
        . 
        . 
        . 
        . keep if similscore>.7    
        (2 observations deleted)
        
        . 
        . list
        
             +---------------------------------+
             |    Name1      Name2   similsc~e |
             |---------------------------------|
          1. | AA BB CC      AA CC   .75592895 |
          2. |    DD EE   DD EE FF   .75592895 |
             +---------------------------------+

        Comment


        • #5
          Thank you Andrew! This helps a lot!!

          Comment

          Working...
          X