Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I Realize Fuzzy Replacement of Strings that share a Stem Using Stata?

    I have a data set in which there are several thousand hundred strings under the variable "birth_city" to clean.
    Although each cell is different from each other, in fact, the stem is the same city--Houston.
    So, the expected result is to replace all strings under "birth_city" with "Houston,TX,USA"

    *The dataset is listed as below for the purpose of illustration,
    clear
    input int uid str13 birth_city
    11 "Houston"
    14 "HOUSTON"
    12 "huoston"
    16 "h ous t o n"
    17 "h_u_ostno"
    19 "houst"
    24 "houston harris"
    18 "hisuton"
    15 "harris_ohuston"
    10 "houston,texas"
    25 "houston,tx"
    20 "houston,HaRris"
    end

    Thanks for your kindly help!
    Last edited by smith Jason; 15 Mar 2022, 00:51.

  • #2
    With fuzzy matching, you have to make a judgement call as to how similar is similar enough. The following uses matchit from SSC. For the initial strings ignoring capitalization, 14% captures all strings. Eliminating all non-alphabet characters further increases the scores.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int uid str13 birth_city
    11 "Houston"      
    14 "HOUSTON"      
    12 "huoston"      
    16 "h ous t o n"  
    17 "h_u_ostno"    
    19 "houst"        
    24 "houston harri"
    18 "hisuton"      
    15 "harris_ohusto"
    10 "houston,texas"
    25 "houston,tx"  
    20 "houston,HaRri"
    end
    
    *LOWER-CASE STRINGS
    gen city= lower(birth_city)
    *MATCHING KEYWORD
    gen city2= "houston"
    *SIMILARITY 1
    matchit city city2
    *ELIMINATE NON-ALPHABET CHARACTERS
    gen city3= ustrregexra(lower(birth_city), "[^a-z]", "")
    *SIMILARITY 2
    matchit city3 city2, g(similscore2)
    l, sep(0)
    Res.:

    Code:
     
    . l, sep(0)
    
         +--------------------------------------------------------------------------------------+
         | uid      birth_city            city     city2   similsc~e          city3   similsc~2 |
         |--------------------------------------------------------------------------------------|
      1. |  11         Houston         houston   houston           1        houston           1 |
      2. |  14         HOUSTON         houston   houston           1        houston           1 |
      3. |  12         huoston         huoston   houston          .5        huoston          .5 |
      4. |  16     h ous t o n     h ous t o n   houston   .23570226        houston           1 |
      5. |  17       h_u_ostno       h_u_ostno   houston   .14433757        huostno   .16666667 |
      6. |  19           houst           houst   houston   .81649658          houst   .81649658 |
      7. |  24   houston harri   houston harri   houston   .70710678   houstonharri   .73854895 |
      8. |  18         hisuton         hisuton   houston   .33333333        hisuton   .33333333 |
      9. |  15   harris_ohusto   harris_ohusto   houston   .35355339   harrisohusto   .36927447 |
     10. |  10   houston,texas   houston,texas   houston   .70710678   houstontexas   .73854895 |
     11. |  25      houston,tx      houston,tx   houston   .81649658      houstontx    .8660254 |
     12. |  20   houston,HaRri   houston,harri   houston   .70710678   houstonharri   .73854895 |
         +--------------------------------------------------------------------------------------+
    The final step is to choose a cutoff similar score and do the replacement, e.g.,

    Code:
    replace birth_city= "Houston,TX,USA" if similscore2>=0.15

    Comment


    • #3
      Thank you!
      If the dataset just has 10 records like this, it is great. However, the real dataset has almost several thousand hundred cities (municipalities), for example, like New York, Baltimore, and Beijing etc. How could I mimic the code you provided to address this issue? Can I write code to automatically detect the stems of the string under this variable "birth_city" and then mimic your thoughts to calculate the similarity score among the strings? For ease of communication, I listed the dataset with several string stems below,

      * Example generated by -dataex-.
      clear
      input int uid str33 birth_city
      11 "Houston"
      14 "HOUSTON"
      12 "huoston"
      16 "h ous t o n"
      17 "h_u_ostno"
      19 "houst"
      24 "houston harri"
      18 "hisuton"
      15 "harris_ohusto"
      10 "houston,texas"
      25 "houston,tx"
      20 "houston,HaRri"
      31 "Brooklyn,neW York"
      34 "NewYoRK,NY,UAS"
      32 "NWE YOKR City"
      36 "newyork,canada"
      37 "York,Canada"
      39 "New York,NY,USA"
      44 "NEWYORK"
      38 "Newkyor,ny,uas"
      35 "ne york,usa,ny"
      30 "new yokr,u s a"
      35 "new new york,uas,ny"
      40 "Baltimore,MD,USA"
      41 "BALTIMORE Maryland"
      44 "Batiomore,maryldna,uas"
      42 "bal ti more,md,usa"
      46 "baltimore,md,USA"
      47 "Balti more,usa,MD"
      49 "bALITIMORE CITY, MD, USA"
      54 "baltimore"
      58 "balmotity,md,usa"
      55 "BALTIMORE CITY,US,MD"
      50 "Baltimore,u s a, md"
      55 "usa,md,baltimore"
      60 "usa,BaLTIMORE,CITY,Maryland State"
      61 "B E I ng,"
      64 "Beijing chian"
      62 "beijing"
      76 "Beijing beijing"
      77 "Bei Ji n g"
      99 "BEIJING,CHN"
      84 "Beijing"
      88 "Beijing ch i n a"
      95 "BEIJING CHINA"
      90 "Beijing,China"
      91 "Peking,people's republic of china"
      92 "Beijing,CHINA"
      end
      It is evident there are many typos in the dataset above. We see that most of the persons were born from four cities: Houston, TX, USA; New York, NY, USA, Baltimore, MD, USA, and Beijing, China. Also, it is found that person with id=37 is the correct data record, and it should NOT be replaced. Can anybody help me to develop the code to solve this problem?
      Thank you!
      Last edited by smith Jason; 15 Mar 2022, 10:22.

      Comment


      • #4
        Yes, there are many applications of this kind. See, e.g.,

        https://www.statalist.org/forums/for...and-one-column
        https://www.statalist.org/forums/forum/general-stata-discussion/general/1650108-unifying-the-names-of-companies


        Comment


        • #5
          Thank you! The cutoff score is 0.15. Is it an arbitrary value?

          Comment


          • #6
            Yes, you need to do a visual inspection to determine what cutoff is optimal.

            Comment


            • #7
              Thank you!

              Comment


              • #8
                * Example generated by -dataex-.
                clear
                input int uid str33 birth_city
                11 "Houston"
                14 "HOUSTON"
                12 "huoston"
                16 "h ous t o n"
                17 "h_u_ostno"
                19 "houst"
                24 "houston harri"
                18 "hisuton"
                15 "harris_ohusto"
                10 "houston,texas"
                25 "houston,tx"
                20 "houston,HaRri"
                31 "Brooklyn,neW York"
                34 "NewYoRK,NY,UAS"
                32 "NWE YOKR City"
                36 "newyork,canada"
                37 "York,Canada"
                39 "New York,NY,USA"
                44 "NEWYORK"
                38 "Newkyor,ny,uas"
                35 "ne york,usa,ny"
                30 "new yokr,u s a"
                35 "new new york,uas,ny"
                40 "Baltimore,MD,USA"
                41 "BALTIMORE Maryland"
                44 "Batiomore,maryldna,uas"
                42 "bal ti more,md,usa"
                46 "baltimore,md,USA"
                47 "Balti more,usa,MD"
                49 "bALITIMORE CITY, MD, USA"
                54 "baltimore"
                58 "balmotity,md,usa"
                55 "BALTIMORE CITY,US,MD"
                50 "Baltimore,u s a, md"
                55 "usa,md,baltimore"
                60 "usa,BaLTIMORE,CITY,Maryland State"
                61 "B E I ng,"
                64 "Beijing chian"
                62 "beijing"
                76 "Beijing beijing"
                77 "Bei Ji n g"
                99 "BEIJING,CHN"
                84 "Beijing"
                88 "Beijing ch i n a"
                95 "BEIJING CHINA"
                90 "Beijing,China"
                91 "Peking,people's republic of china"
                92 "Beijing,CHINA"
                end
                loc y newyork beijing houston baltimore
                loc x 0.38 0.22 0.15 0.39
                foreach a of local y {
                gettoken b x : x

                *LOWER-CASE STRINGS
                gen city= lower(birth_city)
                *MATCHING KEYWORD
                gen city2= "`a'"

                *SIMILARITY 1
                matchit city city2

                *ELIMINATE NON-ALPHABET CHARACTERS
                gen city3= ustrregexra(lower(birth_city), "[^a-z]", "")
                *SIMILARITY 2
                matchit city3 city2, g(similscore2)
                l, sep(0)

                replace birth_city= "`a'" if similscore2>=`b'

                drop city-similscore2
                }

                Although the code can work, the logic is not that correct. Can anybody help to change the code?
                The 4 values of 0.38 0.22 0.15 0.39 are the cutoff values to replace the birth_city with the corresponding strings.
                Last edited by smith Jason; 15 Mar 2022, 23:51.

                Comment


                • #9
                  How can I automate the whole process instead of visual inspection of the cutoff value every time at replace birth_city= "`a'" if similscore2>=`b'?
                  Thank you for your guidance!

                  Comment


                  • #10
                    Your strings are very messy. Perhaps browse the data and specify a list of irrelevant keywords to delete, such as country names. Fuzzy matching will not give you perfect results, so you will have to do some manual sorting.

                    Code:
                    * Example generated by -dataex-.
                    clear
                    input int uid str33 birth_city
                    11 "Houston"
                    14 "HOUSTON"
                    12 "huoston"
                    16 "h ous t o n"
                    17 "h_u_ostno"
                    19 "houst"
                    24 "houston harri"
                    18 "hisuton"
                    15 "harris_ohusto"
                    10 "houston,texas"
                    25 "houston,tx"
                    20 "houston,HaRri"
                    31 "Brooklyn,neW York"
                    34 "NewYoRK,NY,UAS"
                    32 "NWE YOKR City"
                    36 "newyork,canada"
                    37 "York,Canada"
                    39 "New York,NY,USA"
                    44 "NEWYORK"
                    38 "Newkyor,ny,uas"
                    35 "ne york,usa,ny"
                    30 "new yokr,u s a"
                    35 "new new york,uas,ny"
                    40 "Baltimore,MD,USA"
                    41 "BALTIMORE Maryland"
                    44 "Batiomore,maryldna,uas"
                    42 "bal ti more,md,usa"
                    46 "baltimore,md,USA"
                    47 "Balti more,usa,MD"
                    49 "bALITIMORE CITY, MD, USA"
                    54 "baltimore"
                    58 "balmotity,md,usa"
                    55 "BALTIMORE CITY,US,MD"
                    50 "Baltimore,u s a, md"
                    55 "usa,md,baltimore"
                    60 "usa,BaLTIMORE,CITY,Maryland State"
                    61 "B E I ng,"
                    64 "Beijing chian"
                    62 "beijing"
                    76 "Beijing beijing"
                    77 "Bei Ji n g"
                    99 "BEIJING,CHN"
                    84 "Beijing"
                    88 "Beijing ch i n a"
                    95 "BEIJING CHINA"
                    90 "Beijing,China"
                    91 "Peking,people's republic of china"
                    92 "Beijing,CHINA"
                    end
                    
                    
                    gen bc= birth_city
                    local todelete usa|china|republic|people|city|texas|canada
                    tempfile data
                    replace birth_city=ustrregexra(" " + lower(birth_city) + " ", "['!?,\. ](`todelete')['!?,\. ]", "")
                    replace birth_city=ustrregexra(lower(birth_city), "[^a-z]", "")
                    save `data'
                    rename birth_city birth_city2
                    cross using `data'
                    matchit birth_city birth_city2
                    keep if similscore>.30
                    contract birth_city birth_city2 
                    replace _freq=_n
                    rename birth_city birth_city1
                    reshape long birth_city, i(_freq) j(which)
                    bys _freq (which): gen birth_city2=birth_city[1]
                    keep birth_city birth_city2
                    gen order=_n
                    bys birth_city (order): keep if _n==_N
                    tempfile list
                    save `list', replace
                    use `data', clear
                    merge m:1 birth_city using `list',keep (master match) nogen
                    sort birth_city2
                    l, sepby(birth_city2)
                    Res.:

                    Code:
                    . sort birth_city2
                    
                    . 
                    . l, sepby(birth_city2)
                    
                         +----------------------------------------------------------------------------------------+
                         | uid               birth_city                                  bc   birth_city2   order |
                         |----------------------------------------------------------------------------------------|
                      1. |  76           beijingbeijing                     Beijing beijing         being     236 |
                      2. |  64             beijingchian                       Beijing chian         being     238 |
                      3. |  84                  beijing                             Beijing         being     234 |
                      4. |  99               beijingchn                         BEIJING,CHN         being     242 |
                      5. |  88             beijingchina                    Beijing ch i n a         being     240 |
                      6. |  62                  beijing                             beijing         being     234 |
                      7. |  90                  beijing                       Beijing,China         being     234 |
                      8. |  92                  beijing                       Beijing,CHINA         being     234 |
                      9. |  95                  beijing                       BEIJING CHINA         being     234 |
                     10. |  77                  beijing                          Bei Ji n g         being     234 |
                         |----------------------------------------------------------------------------------------|
                     11. |  19                    houst                               houst     houstontx     322 |
                         |----------------------------------------------------------------------------------------|
                     12. |  15             harrisohusto                       harris_ohusto       huoston     336 |
                     13. |  25                houstontx                          houston,tx       huoston     344 |
                     14. |  17                  huostno                           h_u_ostno       huoston     346 |
                     15. |  12                  huoston                             huoston       huoston     348 |
                     16. |  24             houstonharri                       houston harri       huoston     342 |
                     17. |  14                  houston                             HOUSTON       huoston     340 |
                     18. |  16                  houston                         h ous t o n       huoston     340 |
                     19. |  18                  hisuton                             hisuton       huoston     338 |
                     20. |  20             houstonharri                       houston,HaRri       huoston     342 |
                     21. |  10                  houston                       houston,texas       huoston     340 |
                     22. |  11                  houston                             Houston       huoston     340 |
                         |----------------------------------------------------------------------------------------|
                     23. |  44     batiomoremaryldnauas              Batiomore,maryldna,uas   mdbaltimore     366 |
                     24. |  55              mdbaltimore                    usa,md,baltimore   mdbaltimore     368 |
                     25. |  55            baltimoreusmd                BALTIMORE CITY,US,MD   mdbaltimore     364 |
                     26. |  50           baltimoreusamd                 Baltimore,u s a, md   mdbaltimore     362 |
                     27. |  42              baltimoremd                  bal ti more,md,usa   mdbaltimore     360 |
                     28. |  47              baltimoremd                   Balti more,usa,MD   mdbaltimore     360 |
                     29. |  58              balmotitymd                    balmotity,md,usa   mdbaltimore     352 |
                     30. |  60   baltimoremarylandstate   usa,BaLTIMORE,CITY,Maryland State   mdbaltimore     358 |
                     31. |  41        baltimoremaryland                  BALTIMORE Maryland   mdbaltimore     356 |
                     32. |  40              baltimoremd                    Baltimore,MD,USA   mdbaltimore     360 |
                     33. |  46              baltimoremd                    baltimore,md,USA   mdbaltimore     360 |
                     34. |  54                baltimore                           baltimore   mdbaltimore     354 |
                     35. |  49             balitimoremd            bALITIMORE CITY, MD, USA   mdbaltimore     350 |
                         |----------------------------------------------------------------------------------------|
                     36. |  32                  nweyokr                       NWE YOKR City       nweyokr     498 |
                     37. |  30               newyokrusa                      new yokr,u s a       nweyokr     494 |
                         |----------------------------------------------------------------------------------------|
                     38. |  91                pekingsof   Peking,people's republic of china     pekingsof     502 |
                     39. |  61                    being                           B E I ng,     pekingsof     500 |
                         |----------------------------------------------------------------------------------------|
                     40. |  34             newyorknyuas                      NewYoRK,NY,UAS          york     514 |
                     41. |  35          newnewyorkuasny                 new new york,uas,ny          york     508 |
                     42. |  38             newkyornyuas                      Newkyor,ny,uas          york     506 |
                     43. |  31          brooklynnewyork                   Brooklyn,neW York          york     504 |
                     44. |  39                newyorkny                     New York,NY,USA          york     512 |
                     45. |  44                  newyork                             NEWYORK          york     510 |
                     46. |  37                     york                         York,Canada          york     518 |
                     47. |  36                  newyork                      newyork,canada          york     510 |
                     48. |  35                 neyorkny                      ne york,usa,ny          york     516 |
                         +----------------------------------------------------------------------------------------+
                    
                    .

                    Comment


                    • #11
                      Thank you very much!

                      Comment

                      Working...
                      X