Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Grouping similar strings

    Hello. I have a string variable and I need to see some way to find pairs of similar but not identical values, for example through some kind of identifier or categorical variable. In the example below, the "similar" variable indicates that observations 1 and 9 are quite similar, as are observations 5 and 6.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float id str18 var float similar
     1 "JUAN PEREZ"         1
     2 "JUAN GONZALEZ"      .
     3 "ARTURO PRAT"        .
     4 "ARTURO VIDAL"       .
     5 "SALVADOR ALLENDE"   2
     6 "SALVADOR ALLENDE G" 2
     7 "DON MATIAS"         .
     8 "DIEGO SOTO"         .
     9 "JUAN A PEREZ"       1
    10 "JUAN PABLO SASA"    .
    end

    My dataset has hundreds of thousands of observations, obviously I can't do it visually. I don't need a perfect method, but at least something that approximates a first result to later analyze visually. Any ideas? Thanks

  • #2
    Julio Raffo's -matchit-, available from SSC will do this. It's a complicated program, so you'll need to invest some time reading the help file to see how to use it. But it will do what you ask here. One hint: you will have to make a second variable that is a copy of the first, and then you will get matches between those variables.

    One other thing, with hundreds of thousands of observations, you may run into memory issues. If that happens, you may need to find a way to first partition the data set into pieces that have little probability of containing matches with each other, and do the matching separately within pieces.

    Comment


    • #3
      Excellent! Thank you very much Clyde

      Comment

      Working...
      X