Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matchit in one dataset and one column

    Hello all,
    I need some help with using the matchit function. I have one dataset and need to rename the Name variable observations so they are the same as others which are written similarly but differently.
    I have included the example data below.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str16 Name str4 ID
    "Alpha"            "7501"
    "Alpha Team"       "7501"
    "Alpha Team Inc"   "7501"
    "Alpha Alpha Team" "7501"
    "Beta"             "6498"
    "Beta Team"        "6498"
    "Beta Team Inc"    "6498"
    "Beta Beta Team"   "6498"
    "Theta"            "6320"
    "Theta Team"       "6320"
    "Theta Team Inc"   "6320"
    "Theta Theta Team" "6320"
    end
    There are slightly varying names for the observations but the have common ID codes. I would like to use matchit to change the names to one uniform name i.e. All those with ID "7501" and contain "Alpha" in the Name variable to become "Alpha Team Inc", those with ID "6498" and contain "Beta" to become "Beta Team Inc" etc, for all my observations (~3,000 observations). Any help is much appreciated!

  • #2
    Does ID uniquely identify name? For example, can you have a name different from "Alpha ???" for ID 7501 that you will not want changed? Otherwise, I do not see what the use of matching is here.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str16 Name str4 ID
    "Theta Theta Team" "6320"
    "Theta Team"       "6320"
    "Theta"            "6320"
    "Theta Team Inc"   "6320"
    "Beta"             "6498"
    "Beta Team"        "6498"
    "Beta Beta Team"   "6498"
    "Beta Team Inc"    "6498"
    "Alpha Team"       "7501"
    "Alpha Alpha Team" "7501"
    "Alpha"            "7501"
    "Alpha Team Inc"   "7501"
    end
    
    gen Name2= Name if regexm(lower(Name), "team inc")
    bys ID (Name2): replace Name2= Name2[_N]
    Res.:

    Code:
    . l,sepby(ID)
    
         +------------------------------------------+
         |             Name     ID            Name2 |
         |------------------------------------------|
      1. | Theta Theta Team   6320   Theta Team Inc |
      2. |       Theta Team   6320   Theta Team Inc |
      3. |            Theta   6320   Theta Team Inc |
      4. |   Theta Team Inc   6320   Theta Team Inc |
         |------------------------------------------|
      5. |             Beta   6498    Beta Team Inc |
      6. |        Beta Team   6498    Beta Team Inc |
      7. |   Beta Beta Team   6498    Beta Team Inc |
      8. |    Beta Team Inc   6498    Beta Team Inc |
         |------------------------------------------|
      9. |       Alpha Team   7501   Alpha Team Inc |
     10. | Alpha Alpha Team   7501   Alpha Team Inc |
     11. |            Alpha   7501   Alpha Team Inc |
     12. |   Alpha Team Inc   7501   Alpha Team Inc |
         +------------------------------------------+

    Comment


    • #3
      Hi Andrew,
      Thank you for the response! The thing is sometimes the ID will represent other names e.g. 7501 may be Kings Hotel as they are grouped under one ID. Also not all the observations are as the example, so for instance there could be Kings, Kings' Hotel, Kings Hotel group with the same ID 7501.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str17 Name str4 ID
      "Alpha"             "7501"
      "Alpha Team"        "7501"
      "Alpha Team Inc"    "7501"
      "Alpha Alpha Team"  "7501"
      "Beta"              "6498"
      "Beta Team"         "6498"
      "Beta Team Inc"     "6498"
      "Beta Beta Team"    "6498"
      "Theta"             "6320"
      "Theta Team"        "6320"
      "Theta Team Inc"    "6320"
      "Theta Theta Team"  "6320"
      "Kings"             "7501"
      "Kings Hotel Group" "7501"
      "Kings Hotel"       "7501"
      "The Kings Hotel"   "7501"
      end
      Last edited by Rachael Finway; 12 Apr 2021, 10:11.

      Comment


      • #4
        Got it, thanks. matchit is from SSC. Note that results may vary depending on what threshold you choose for the similarity index. You might want to experiment with different values and find which one works for you. Below, I use a similarity score of 0.2 which works well for the example. I find that values from 0.2 upwards are potentially good matches.


        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str17 Name str4 ID
        "Alpha"             "7501"
        "Alpha Team"        "7501"
        "Alpha Team Inc"    "7501"
        "Alpha Alpha Team"  "7501"
        "Beta"              "6498"
        "Beta Team"         "6498"
        "Beta Team Inc"     "6498"
        "Beta Beta Team"    "6498"
        "Theta"             "6320"
        "Theta Team"        "6320"
        "Theta Team Inc"    "6320"
        "Theta Theta Team"  "6320"
        "Kings"             "7501"
        "Kings Hotel Group" "7501"
        "Kings Hotel"       "7501"
        "The Kings Hotel"   "7501"
        end
        
        preserve
        rename Name Name2 
        tempfile names list
        save `names'
        restore, preserve
        joinby ID using `names'
        *ssc install matchit
        *ssc install freqindex
        matchit Name Name2
        keep if similscore > 0.2
        contract Name Name2
        replace _freq=_n
        rename Name Name1
        reshape long Name, i(_freq) j(which)
        bys _freq (which): gen Name2=Name[1]
        keep Name Name2
        gen order=_n
        bys Name (order): keep if _n==_N
        save `list', replace
        restore
        merge m:1 Name using `list',keep (master match) nogen
        replace Name2=Name if missing(Name2)
        sort Name2
        drop order
        l, sepby(Name2)

        Res.:

        Code:
        . l, sepby(Name2)
        
             +-----------------------------------------------------+
             |              Name     ID   entry              Name2 |
             |-----------------------------------------------------|
          1. |             Alpha   7501       5     Alpha Team Inc |
          2. |  Alpha Alpha Team   7501       3     Alpha Team Inc |
          3. |        Alpha Team   7501       7     Alpha Team Inc |
          4. |    Alpha Team Inc   7501       8     Alpha Team Inc |
             |-----------------------------------------------------|
          5. |              Beta   6498       1      Beta Team Inc |
          6. |    Beta Beta Team   6498       4      Beta Team Inc |
          7. |         Beta Team   6498       3      Beta Team Inc |
          8. |     Beta Team Inc   6498       2      Beta Team Inc |
             |-----------------------------------------------------|
          9. |             Kings   7501       4    The Kings Hotel |
         10. |       Kings Hotel   7501       1    The Kings Hotel |
         11. | Kings Hotel Group   7501       6    The Kings Hotel |
         12. |   The Kings Hotel   7501       2    The Kings Hotel |
             |-----------------------------------------------------|
         13. |             Theta   6320       2   Theta Theta Team |
         14. |        Theta Team   6320       4   Theta Theta Team |
         15. |    Theta Team Inc   6320       3   Theta Theta Team |
         16. |  Theta Theta Team   6320       1   Theta Theta Team |
             +-----------------------------------------------------+
        Last edited by Andrew Musau; 12 Apr 2021, 11:15.

        Comment


        • #5
          Andrew thank you so much! I'm just working my way through the code, learning as I go along, I will update you on the progress. Thanks again!

          Comment

          Working...
          X