Matching a part of a string variable

Scott Rick

Join Date: May 2021
Posts: 242

Matching a part of a string variable

04 Jun 2021, 04:50

Hi. I have two columns of drug names as seen below “rxname” and “Product”. I want to do the following:

Match observations if
1. rxname is contained in Product, OR
2. Product is contained in rxname (e.g. this would match TRICOR with TRICOR / TRILIPIX)
Match on the first word of rxname and Product (e.g. this would match DIOVAN W/HCTZ with DIOVAN / HCT)

I'm new to strings, so any help would be much appreciated.

Code:

input long id2 str49 rxname int id1 str30 Product
 2 "TOPROL XL"              745 "PROTOPIC"       
 2 "TOPROL XL"              915 "TOPROL-XL"     
 7 "SEROQUEL (FILM-COATED)" 838 "SEROQUEL XR"   
 7 "SEROQUEL (FILM-COATED)" 834 "SEROQUEL IR"   
15 "LYRICA"                 600 "LYRICA"        
15 "LYRICA"                 602 "LYRICA"             
23 "VERAPAMIL"              974 "VERAMYST"      
23 "VERAPAMIL"              973 "VERAMYST"       
23 "VERAPAMIL"              774 "RAPAMUNE"        
25 "VERAPAMIL"              778 "RAPAMUNE"      
25 "VERAPAMIL"              774 "RAPAMUNE"      
29 "AMBIEN"                  70 "AMBIEN / CR"    
30 "ATENOLOL"                 5 "ACCOLATE"      
30 "ATENOLOL"                 8 "ACCOLATE"       
35 "ALPRAZOLAM"             991 "VIRAZOLE"      
39 "AMBIEN"                  70 "AMBIEN / CR"   
39 "AMBIEN"                  68 "AMBIEN / CR"        
40 "LEXAPRO"                263 "CELEXA"        
40 "LEXAPRO"                266 "CELEXA"        
40 "LEXAPRO"                264 "CELEXA"        
40 "LEXAPRO"                552 "LEXAPRO"       
40 "LEXAPRO"                262 "CELEXA"

Tags: None

Scott Rick

Join Date: May 2021

Posts: 242
#2

04 Jun 2021, 08:07

-matchit- does not work here since these are definitely not exactly the same. And I have about 14,000 observations, so checking manually is unfeasible
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#3

04 Jun 2021, 10:15

Regarding your first question:

Code:

gen byte either_contains = (strpos(Product, rxname) > 0) | (strpos(rxname, Product) > 0)

I don't quite know what your second question means: Match *what* on the first word? Perhaps you mean "detect whether the first word of rxname matches the first word of Product?"

Code:

gen byte firstsame = word(rxname,1) == word(Product, 1)

For simplicity, I have ignored that string comparisons are case-sensitive in Stata. Before doing either of the preceding, I'd make both variables lowercase (or upper, your preference)

Code:

replace rxname = lower(rxname)

but which you might want to do on the fly:

Code:

gen byte firstsame = lower(word(rxname,1)) == lower(word(Product, 1))

You can learn about all of these things and more "string stuff" at:

Code:

-help string functions-

Now, it's possible that you had in mind something about matching different observations to one another. If that's true, please explain some more.
Comment
Scott Rick

Join Date: May 2021

Posts: 242
#4

04 Jun 2021, 19:38

Mike Lacy Thank you. That was exactly what I was looking for and it worked perfectly
Comment

Announcement

Matching a part of a string variable

Comment

Comment

Comment