Matchit in one dataset and one column

Rachael Finway

Join Date: Apr 2021

Posts: 6
#1

Matchit in one dataset and one column

12 Apr 2021, 09:43

Hello all,
I need some help with using the matchit function. I have one dataset and need to rename the Name variable observations so they are the same as others which are written similarly but differently.
I have included the example data below.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str16 Name str4 ID "Alpha" "7501" "Alpha Team" "7501" "Alpha Team Inc" "7501" "Alpha Alpha Team" "7501" "Beta" "6498" "Beta Team" "6498" "Beta Team Inc" "6498" "Beta Beta Team" "6498" "Theta" "6320" "Theta Team" "6320" "Theta Team Inc" "6320" "Theta Theta Team" "6320" end

There are slightly varying names for the observations but the have common ID codes. I would like to use matchit to change the names to one uniform name i.e. All those with ID "7501" and contain "Alpha" in the Name variable to become "Alpha Team Inc", those with ID "6498" and contain "Beta" to become "Beta Team Inc" etc, for all my observations (~3,000 observations). Any help is much appreciated!
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 9956

12 Apr 2021, 10:00

Does ID uniquely identify name? For example, can you have a name different from "Alpha ???" for ID 7501 that you will not want changed? Otherwise, I do not see what the use of matching is here.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str16 Name str4 ID
"Theta Theta Team" "6320"
"Theta Team"       "6320"
"Theta"            "6320"
"Theta Team Inc"   "6320"
"Beta"             "6498"
"Beta Team"        "6498"
"Beta Beta Team"   "6498"
"Beta Team Inc"    "6498"
"Alpha Team"       "7501"
"Alpha Alpha Team" "7501"
"Alpha"            "7501"
"Alpha Team Inc"   "7501"
end

gen Name2= Name if regexm(lower(Name), "team inc")
bys ID (Name2): replace Name2= Name2[_N]

Res.:

Code:

. l,sepby(ID)

     +------------------------------------------+
     |             Name     ID            Name2 |
     |------------------------------------------|
  1. | Theta Theta Team   6320   Theta Team Inc |
  2. |       Theta Team   6320   Theta Team Inc |
  3. |            Theta   6320   Theta Team Inc |
  4. |   Theta Team Inc   6320   Theta Team Inc |
     |------------------------------------------|
  5. |             Beta   6498    Beta Team Inc |
  6. |        Beta Team   6498    Beta Team Inc |
  7. |   Beta Beta Team   6498    Beta Team Inc |
  8. |    Beta Team Inc   6498    Beta Team Inc |
     |------------------------------------------|
  9. |       Alpha Team   7501   Alpha Team Inc |
 10. | Alpha Alpha Team   7501   Alpha Team Inc |
 11. |            Alpha   7501   Alpha Team Inc |
 12. |   Alpha Team Inc   7501   Alpha Team Inc |
     +------------------------------------------+

Comment

Rachael Finway

Join Date: Apr 2021
Posts: 6

12 Apr 2021, 10:07

Hi Andrew,
Thank you for the response! The thing is sometimes the ID will represent other names e.g. 7501 may be Kings Hotel as they are grouped under one ID. Also not all the observations are as the example, so for instance there could be Kings, Kings' Hotel, Kings Hotel group with the same ID 7501.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str17 Name str4 ID
"Alpha"             "7501"
"Alpha Team"        "7501"
"Alpha Team Inc"    "7501"
"Alpha Alpha Team"  "7501"
"Beta"              "6498"
"Beta Team"         "6498"
"Beta Team Inc"     "6498"
"Beta Beta Team"    "6498"
"Theta"             "6320"
"Theta Team"        "6320"
"Theta Team Inc"    "6320"
"Theta Theta Team"  "6320"
"Kings"             "7501"
"Kings Hotel Group" "7501"
"Kings Hotel"       "7501"
"The Kings Hotel"   "7501"
end

Last edited by Rachael Finway; 12 Apr 2021, 10:11.

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9956

12 Apr 2021, 11:00

Got it, thanks. matchit is from SSC. Note that results may vary depending on what threshold you choose for the similarity index. You might want to experiment with different values and find which one works for you. Below, I use a similarity score of 0.2 which works well for the example. I find that values from 0.2 upwards are potentially good matches.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str17 Name str4 ID
"Alpha"             "7501"
"Alpha Team"        "7501"
"Alpha Team Inc"    "7501"
"Alpha Alpha Team"  "7501"
"Beta"              "6498"
"Beta Team"         "6498"
"Beta Team Inc"     "6498"
"Beta Beta Team"    "6498"
"Theta"             "6320"
"Theta Team"        "6320"
"Theta Team Inc"    "6320"
"Theta Theta Team"  "6320"
"Kings"             "7501"
"Kings Hotel Group" "7501"
"Kings Hotel"       "7501"
"The Kings Hotel"   "7501"
end

preserve
rename Name Name2 
tempfile names list
save `names'
restore, preserve
joinby ID using `names'
*ssc install matchit
*ssc install freqindex
matchit Name Name2
keep if similscore > 0.2
contract Name Name2
replace _freq=_n
rename Name Name1
reshape long Name, i(_freq) j(which)
bys _freq (which): gen Name2=Name[1]
keep Name Name2
gen order=_n
bys Name (order): keep if _n==_N
save `list', replace
restore
merge m:1 Name using `list',keep (master match) nogen
replace Name2=Name if missing(Name2)
sort Name2
drop order
l, sepby(Name2)

Res.:

Code:

. l, sepby(Name2)

     +-----------------------------------------------------+
     |              Name     ID   entry              Name2 |
     |-----------------------------------------------------|
  1. |             Alpha   7501       5     Alpha Team Inc |
  2. |  Alpha Alpha Team   7501       3     Alpha Team Inc |
  3. |        Alpha Team   7501       7     Alpha Team Inc |
  4. |    Alpha Team Inc   7501       8     Alpha Team Inc |
     |-----------------------------------------------------|
  5. |              Beta   6498       1      Beta Team Inc |
  6. |    Beta Beta Team   6498       4      Beta Team Inc |
  7. |         Beta Team   6498       3      Beta Team Inc |
  8. |     Beta Team Inc   6498       2      Beta Team Inc |
     |-----------------------------------------------------|
  9. |             Kings   7501       4    The Kings Hotel |
 10. |       Kings Hotel   7501       1    The Kings Hotel |
 11. | Kings Hotel Group   7501       6    The Kings Hotel |
 12. |   The Kings Hotel   7501       2    The Kings Hotel |
     |-----------------------------------------------------|
 13. |             Theta   6320       2   Theta Theta Team |
 14. |        Theta Team   6320       4   Theta Theta Team |
 15. |    Theta Team Inc   6320       3   Theta Theta Team |
 16. |  Theta Theta Team   6320       1   Theta Theta Team |
     +-----------------------------------------------------+

Last edited by Andrew Musau; 12 Apr 2021, 11:15.

Comment

Rachael Finway

Join Date: Apr 2021

Posts: 6
#5

13 Apr 2021, 07:12

Andrew thank you so much! I'm just working my way through the code, learning as I go along, I will update you on the progress. Thanks again!
Comment

Announcement