Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching a part of a string variable

    Hi. I have two columns of drug names as seen below “rxname” and “Product”. I want to do the following:
    1. Match observations if
      1. rxname is contained in Product, OR
      2. Product is contained in rxname (e.g. this would match TRICOR with TRICOR / TRILIPIX)
    2. Match on the first word of rxname and Product (e.g. this would match DIOVAN W/HCTZ with DIOVAN / HCT)
    I'm new to strings, so any help would be much appreciated.

    Code:
    input long id2 str49 rxname int id1 str30 Product
     2 "TOPROL XL"              745 "PROTOPIC"       
     2 "TOPROL XL"              915 "TOPROL-XL"     
     7 "SEROQUEL (FILM-COATED)" 838 "SEROQUEL XR"   
     7 "SEROQUEL (FILM-COATED)" 834 "SEROQUEL IR"   
    15 "LYRICA"                 600 "LYRICA"        
    15 "LYRICA"                 602 "LYRICA"             
    23 "VERAPAMIL"              974 "VERAMYST"      
    23 "VERAPAMIL"              973 "VERAMYST"       
    23 "VERAPAMIL"              774 "RAPAMUNE"        
    25 "VERAPAMIL"              778 "RAPAMUNE"      
    25 "VERAPAMIL"              774 "RAPAMUNE"      
    29 "AMBIEN"                  70 "AMBIEN / CR"    
    30 "ATENOLOL"                 5 "ACCOLATE"      
    30 "ATENOLOL"                 8 "ACCOLATE"       
    35 "ALPRAZOLAM"             991 "VIRAZOLE"      
    39 "AMBIEN"                  70 "AMBIEN / CR"   
    39 "AMBIEN"                  68 "AMBIEN / CR"        
    40 "LEXAPRO"                263 "CELEXA"        
    40 "LEXAPRO"                266 "CELEXA"        
    40 "LEXAPRO"                264 "CELEXA"        
    40 "LEXAPRO"                552 "LEXAPRO"       
    40 "LEXAPRO"                262 "CELEXA"

  • #2
    -matchit- does not work here since these are definitely not exactly the same. And I have about 14,000 observations, so checking manually is unfeasible

    Comment


    • #3
      Regarding your first question:
      Code:
      gen byte either_contains = (strpos(Product, rxname) > 0) | (strpos(rxname, Product) > 0)
      I don't quite know what your second question means: Match *what* on the first word? Perhaps you mean "detect whether the first word of rxname matches the first word of Product?"
      Code:
      gen byte firstsame = word(rxname,1) == word(Product, 1)
      For simplicity, I have ignored that string comparisons are case-sensitive in Stata. Before doing either of the preceding, I'd make both variables lowercase (or upper, your preference)
      Code:
      replace rxname = lower(rxname)
      but which you might want to do on the fly:
      Code:
      gen byte firstsame = lower(word(rxname,1)) == lower(word(Product, 1))
      You can learn about all of these things and more "string stuff" at:
      Code:
      -help string functions-
      Now, it's possible that you had in mind something about matching different observations to one another. If that's true, please explain some more.

      Comment


      • #4
        Mike Lacy Thank you. That was exactly what I was looking for and it worked perfectly

        Comment

        Working...
        X