Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Flagging values in a variable that contain certain words in a string

    Hi,

    Hope someone can help. I am trying to flag values that contain one of a number of key words I have in a list. I could ordinarily use something like this:

    Code:
    clear
    input str10 corp
    "INC       "
    "INC.      "
    "INCOME    "
    " INC      "
    " INC.     "
    "ZINC      "
    " INCOME   "
    "       INC"
    "      INC."
    "      ZINC"
    end
    generate m = regexm(corp,"^INC[. ]| INC[. ]| INC[.]?$")
    list, clean
    however, my list of keywords is over 150 and so this doesn't seem the most efficient way of doing this - also not sure if regexm will support that many different values.

    What I would like to do is generate a lookup file with my key words in it and then merge this with my base data in order to generate a flag using something like

    Code:
    merge m:m regexm(lookupname) using lookupfile
    tab _m
    but am guessing this isn't feasible. Any suggestions appreciated.

    Thanks
    Tim

  • #2
    That's a long way from what merge can do. merge won't merge on expressions.

    More concretely, I admire what regular expressions can do but usually consider more basic string functions first. Here it seems that

    Code:
    gen found = strpos(corp, "INC") > 0
    would be a start on your example, although there is always a risk that it finds too much. For a longer list, I still wouldn't reach for regular expressions or merging but try some variant on

    Code:
    gen found = 0
    
    quietly foreach word in INC foo bar {
          replace found = 1 if strpos(corp, "`word'") & found == 0
    }
    Usually I'd trim(itrim()) variables like yours and consider making all lower().

    I've never seen merge m:m suggested in this forum without someone explaining that it is a bad idea. I've never used it myself.

    Comment


    • #3
      Note the following trick of sticking a space before and after the string being searched can simplify matching by eliminating the need to treat the start and end of the string as special cases.
      Code:
      . generate m1 = regexm(corp,"^INC[. ]| INC[. ]| INC[.]?$")
      
      . generate m2 = regexm(" "+corp+" "," INC[. ]")
      
      . list, clean
      
                   corp   m1   m2  
        1.   INC           1    1  
        2.   INC.          1    1  
        3.   INCOME        0    0  
        4.    INC          1    1  
        5.    INC.         1    1  
        6.   ZINC          0    0  
        7.    INCOME       0    0  
        8.          INC    1    1  
        9.         INC.    1    1  
       10.         ZINC    0    0
      With regard to having 150 keywords to search for, I would be tempted for readability alone to look for them one-at-a-time
      Code:
      generate m = 0
      generate source = " "+corp+" "
      replace m = m | regexm(source, " INC[. ]")
      replace m = m | regexm(source, " LLC ")

      Comment


      • #4
        I've had a play with the suggestion above, personally I think there are too many lines of code and its quite time consuming to edit, but this is what I have at the moment (snippet of 100 GP practices out of 7,500):


        Code:
        clear
        input str6 PracticeCode str52 PracticeName
        "N81002" "KENMORE MEDICAL CENTRE"              
        "N81013" "HIGH STREET SURGERY MACCLESFIELD"    
        "N81021" "MCILVRIDE MEDICAL PRACTICE"          
        "N81022" "BOLLINGTON MEDICAL CENTRE"           
        "N81026" "TOFT ROAD SURGERY"                   
        "N81027" "READESMOOR GROUP PRACTICE"           
        "N81029" "SOUTH PARK SURGERY"                  
        "N81033" "GEORGE STREET PRACTICE"              
        "N81042" "MANCHESTER ROAD MEDICAL CENTRE"      
        "N81049" "ANNANDALE MEDICAL CENTRE"            
        "N81052" "LAWTON HOUSE SURGERY"                
        "N81062" "CUMBERLAND HOUSE"                    
        "N81069" "CHELFORD SURGERY"                    
        "N81070" "HANDFORTH HEALTH CENTRE"             
        "N81073" "PRIORSLEGH MEDICAL CENTRE"           
        "N81077" "LONDON ROAD HEALTH CENTRE"           
        "N81085" "PARK LANE MEDICAL CENTRE"            
        "N81086" "WILMSLOW HEALTH CENTRE"              
        "N81088" "PARK GREEN HOUSE"                    
        "N81112" "THE SCHOOL HOUSE SURGERY"            
        "N81118" "MEADOWSIDE MEDICAL CENTRE"           
        "N81632" "BROKEN CROSS SURGERY"                
        "N81001" "AUDLEM MEDICAL PRACTICE"             
        "N81008" "CEDARS MEDICAL CENTRE"               
        "N81010" "NANTWICH HEALTH CENTRE"              
        "N81015" "DELAMERE HEALTH CENTRE"              
        "N81016" "MILLCROFT MEDICAL CENTRE"            
        "N81032" "ASHFIELDS PRIMARY CARE CENTRE"       
        "N81039" "MIDDLEWICH MEDICAL CENTRE"           
        "N81043" "HASLINGTON SURGERY"                  
        "N81044" "HUNGERFORD MEDICAL CENTRE"           
        "N81047" "KILTEARN MEDICAL CENTRE"             
        "N81053" "EARNSWOOD MEDICAL CENTRE"            
        "N81068" "GROSVENOR MEDICAL CENTRE"            
        "N81071" "CINDERHILL LANE SURGERY"             
        "N81084" "ROPE GREEN MEDICAL CENTRE"           
        "N81090" "TUDOR SURGERY"                       
        "N81111" "38 CREWE ROAD"                       
        "N81614" "WRENBURY MEDICAL CENTRE"             
        "N81642" "THE ACORNS SURGERY"                  
        "N81024" "SWANLOW MEDICAL CENTRE"              
        "N81025" "FIRDALE MEDICAL CENTRE"              
        "N81040" "HIGH STREET MEDICAL PRACTICE WINSFORD"
        "N81051" "THE WEAVERHAM SURGERY"               
        "N81055" "WATLING MEDICAL PRACTICE"            
        "N81061" "WITTON STREET SURGERY"               
        "N81067" "OAKWOOD MEDICAL CENTRE"              
        "N81074" "9-10 LAUNCESTON CLOSE"               
        "N81087" "DANEBRIDGE MEDICAL CENTRE"           
        "N81113" "MIDDLEWICH ROAD SURGERY"             
        "N81123" "WHARTON HEALTH CENTRE"               
        "N81127" "WEAVER VALE SURGERY"                 
        "N81007" "HOLES LANE SURGERY"                  
        "N81012" "GUARDIAN MEDICAL CENTRE"             
        "N81014" "BROOKFIELD SURGERY"                  
        "N81020" "PENKETH MEDICAL CENTRE"              
        "N81028" "CAUSEWAY MEDICAL CENTRE"             
        "N81036" "SPRINGFIELDS MEDICAL CENTRE"         
        "N81041" "HELSBY STREET SURGERY"               
        "N81048" "FEARNHEAD CROSS MEDICAL CENTRE"      
        "N81056" "FOLLY LANE MEDICAL CENTRE"           
        "N81059" "CULCHETH MEDICAL CENTRE"             
        "N81065" "LATCHFORD MEDICAL CENTRE"            
        "N81075" "STOCKTON HEATH MEDICAL CENTRE"       
        "N81083" "PARKVIEW MEDICAL PRACTICE"           
        "N81089" "MANCHESTER ROAD PRACTICE"            
        "N81097" "DALLAM LANE MEDICAL CENTRE"          
        "N81107" "MANCHESTER ROAD MEDICAL CENTRE"      
        "N81108" "LAKESIDE SURGERY"                    
        "N81109" "PADGATE MEDICAL CENTRE"              
        "N81114" "BIRCHWOOD MEDICAL CENTRE"            
        "N81122" "WESTBROOK MEDICAL CENTRE"            
        "N81623" "STRETTON MEDICAL CENTRE"             
        "N81628" "ERIC MOORE PARTNERSHIP"              
        "N81634" "CCA CARE PARTNERSHIP"                
        "N81637" "COCKHEDGE MEDICAL CENTRE"            
        "N81645" "4 SEASONS MEDICAL CENTRE"            
        "Y01108" "FAIRFIELD SURGERY"                   
        "N81005" "HELSBY HEALTH CENTRE"                
        "N81006" "BUNBURY MEDICAL PRACTICE"            
        "N81009" "HEATH LANE MEDICAL CENTRE"           
        "N81017" "FRODSHAM MEDICAL PRACTICE"           
        "N81018" "PARK ROAD HEALTH CENTRE"             
        "N81023" "WHITBY GROUP PRACTICE"               
        "N81030" "THE KNOLL SURGERY"                   
        "N81031" "TARPORLEY HEALTH CENTRE"             
        "N81034" "BOUGHTON HEALTH CENTRE"              
        "N81038" "LAUREL BANK SURGERY"                 
        "N81046" "PARK MEDICAL CENTRE"                 
        "N81050" "GREAT SUTTON MEDICAL CENTRE"         
        "N81060" "NESTON SURGERY"                      
        "N81063" "YORK ROAD GROUP PRACTICE"            
        "N81079" "THE ELMS MEDICAL CENTRE"             
        "N81080" "UPPER NORTHGATE STREET"              
        "N81081" "GARDEN LANE MEDICAL CENTRE"          
        "N81082" "CITY WALLS MEDICAL CENTRE"           
        "N81091" "GROUP PRACTICE SURGERY"              
        "N81092" "HOPE FARM MEDICAL CENTRE"            
        "N81093" "GROUP PRACTICE SURGERY"              
        "N81094" "GREAT SUTTON HEALTH CENTRE"          
        end
        
        generate m = 0
        generate source = " "+PracticeName+" "
        replace m = m | regexm(source,"LOCUM")
        replace m = m | regexm(source,"DRUG")
        replace m = m | regexm(source,"DTTO")
        replace m = m | regexm(source,"TRAVELLERS")
        replace m = m | regexm(source,"HOSPICE")
        replace m = m | regexm(source,"CLINIC")
        replace m = m | regexm(source,"DENTISTRY")
        replace m = m | regexm(source,"CHILDREN")
        replace m = m | regexm(source,"PRIMAY CARE TRUST")
        replace m = m | regexm(source,"GPWSI")
        replace m = m | regexm(source,"PALLIATIVE")
        replace m = m | regexm(source,"PCT")
        replace m = m | regexm(source,"HEALTH AUTHORITY")
        *replace m = m | regexm(source,"PRACTICE")
        replace m = m | regexm(source,"PMS")
        replace m = m | regexm(source,"INTERMEDIATE")
        replace m = m | regexm(source,"COMMUNITY")
        replace m = m | regexm(source,"CARDIOLOGY")
        replace m = m | regexm(source,"DERMATOLOGY")
        replace m = m | regexm(source,"SCHEME")
        replace m = m | regexm(source,"OUT OF HOURS")
        replace m = m | regexm(source,"WALK IN")
        replace m = m | regexm(source,"DIABETES")
        replace m = m | regexm(source,"HEART FAILURE")
        replace m = m | regexm(source,"SMOKING")
        replace m = m | regexm(source,"RENAL TRANSPLANT")
        replace m = m | regexm(source,"DIABETIC")
        replace m = m | regexm(source,"OPHTHALINOLOGY")
        replace m = m | regexm(source,"PROJECT")
        replace m = m | regexm(source,"ACTION")
        replace m = m | regexm(source,"SERVICE")
        replace m = m | regexm(source,"FAMILY PLANNING")
        replace m = m | regexm(source,"FPC")
        replace m = m | regexm(source,"DAY CENTRE")
        *replace m = m | regexm(source,"CENTRE")
        replace m = m | regexm(source,"NURSING")
        replace m = m | regexm(source,"CONSULTANT")
        replace m = m | regexm(source,"REHABILITATION")
        replace m = m | regexm(source,"REHAB")
        *replace m = m | regexm(source,"PRIMARY")
        replace m = m | regexm(source,"SUPPORT")
        replace m = m | regexm(source,"SEXUAL")
        replace m = m | regexm(source,"DEPARTMENT")
        replace m = m | regexm(source,"ACCIDENT")
        replace m = m | regexm(source,"EMERGENCY")
        replace m = m | regexm(source,"HOSPITAL")
        replace m = m | regexm(source,"INJURIES")
        replace m = m | regexm(source,"INJURY")
        *replace m = m | regexm(source,"SURGERY")
        replace m = m | regexm(source,"HOMELESS")
        replace m = m | regexm(source,"DROP-IN")
        replace m = m | regexm(source,"DROP IN")
        replace m = m | regexm(source,"OPHTHALMOLOGY")
        replace m = m | regexm(source,"ORTHOPAEDIC")
        replace m = m | regexm(source,"TRANS.UNIT")
        *replace m = m | regexm(source,"DOCTORS")
        *replace m = m | regexm(source,"MEDICAL")
        *replace m = m | regexm(source,"HEALTHCARE")
        replace m = m | regexm(source,"RESIDENTIAL")
        replace m = m | regexm(source,"TRANSPLANT")
        replace m = m | regexm(source,"CLINICS")
        replace m = m | regexm(source,"MEDICINE")
        replace m = m | regexm(source,"PROJECTS")
        replace m = m | regexm(source,"REGISTERED")
        replace m = m | regexm(source,"SUBSTANCE")
        replace m = m | regexm(source,"TREATMENT")
        replace m = m | regexm(source,"TEAM")
        replace m = m | regexm(source,"FORMS")
        replace m = m | regexm(source,"REFERRAL")
        replace m = m | regexm(source,"DAYCENTRE")
        replace m = m | regexm(source,"LIAISON")
        replace m = m | regexm(source,"ASSESSMENT")
        replace m = m | regexm(source,"DOCTOR")
        replace m = m | regexm(source,"PATIENT")
        replace m = m | regexm(source,"HOME")
        replace m = m | regexm(source,"HOUSE")
        replace m = m | regexm(source,"NURSE")
        *replace m = m | regexm(source,"CARE")
        replace m = m | regexm(source,"ADVICE")
        replace m = m | regexm(source,"INTERVENTION")
        replace m = m | regexm(source,"MISUSE")
        replace m = m | regexm(source,"TEAM")
        replace m = m | regexm(source,"COMMITTEES")
        replace m = m | regexm(source,"PAEDIATRICIAN")
        replace m = m | regexm(source,"PAEDIATRIC")
        replace m = m | regexm(source,"PILOT")
        replace m = m | regexm(source,"CASUALTY")
        replace m = m | regexm(source,"SPECIALIST")
        replace m = m | regexm(source,"DEPUTY")
        replace m = m | regexm(source,"FPC")
        replace m = m | regexm(source,"COMBINED")
        replace m = m | regexm(source,"GPDERM")
        replace m = m | regexm(source,"HOME")
        replace m = m | regexm(source,"HHAT")
        replace m = m | regexm(source,"RESIDENTS")
        replace m = m | regexm(source,"DISABILITY")
        replace m = m | regexm(source,"PSYCHIATRY")
        replace m = m | regexm(source,"MUSCULOSKELETAL")
        replace m = m | regexm(source,"PRISON")
        replace m = m | regexm(source,"NHS")
        **ROAD
        replace m = m | regexm(source,"NHS WIC")
        replace m = m | regexm(source,"CANCER")
        replace m = m | regexm(source,"TRANSITIONAL")
        replace m = m | regexm(source,"FIXED")
        replace m = m | regexm(source,"ON-CALL")
        replace m = m | regexm(source,"HEARTLANDS")
        replace m = m | regexm(source,"RHEUMATOLOGY")
        replace m = m | regexm(source,"PCU")
        replace m = m | regexm(source,"PCHP")
        replace m = m | regexm(source,"PCAC")
        replace m = m | regexm(source,"PCAS")
        replace m = m | regexm(source,"POOLED LIS")
        replace m = m | regexm(source,"VILLAGE")
        replace m = m | regexm(source,"PROGRAMME")
        replace m = m | regexm(source,"RHEUMATOLOG")
        replace m = m | regexm(source,"SAFEHAVEN")
        replace m = m | regexm(source,"WIC")
        replace m = m | regexm(source,"OOH")
        replace m = m | regexm(source,"P/CARE")
        replace m = m | regexm(source,"TRUST")
        replace m = m | regexm(source,"ORGANISATION")
        replace m = m | regexm(source,"THE ZONE")
        replace m = m | regexm(source,"PCC")
        replace m = m | regexm(source,"NEUROLOGY")
        replace m = m | regexm(source,"CO OP")
        replace m = m | regexm(source,"LODGE")
        replace m = m | regexm(source,"NORTH WALES")
        replace m = m | regexm(source,"RESPONSE")
        replace m = m | regexm(source,"YOUNG PERSONS")
        replace m = m | regexm(source,"PRIMECARE")
        replace m = m | regexm(source,"DAY-CARE")
        replace m = m | regexm(source,"UNIT")
        replace m = m | regexm(source,"UNIDENTIFIABLE")
        replace m = m | regexm(source,"ADULT SICKLE CELL")
        replace m = m | regexm(source,"OLD SWAN WIC")
        replace m = m | regexm(source,"NE LINCOLNSHIRE DAT")
        replace m = m | regexm(source,"RAF STRIKE COMMAND")
        replace m = m | regexm(source,"COOPERATIVE")
        replace m = m | regexm(source,"DDC")
        replace m = m | regexm(source,"WIC")
        replace m = m | regexm(source,"WS 2001")
        replace m = m | regexm(source,"THE PARALLEL")
        replace m = m | regexm(source,"-OOH")
        replace m = m | regexm(source,"4 US")
        replace m = m | regexm(source,"CONTRACEPTION")
        replace m = m | regexm(source,"THE PLACE")
        replace m = m | regexm(source,"HEALTH")
        replace m = m | regexm(source,"BARNDOC 2")
        replace m = m | regexm(source,"BASE 51")
        replace m = m | regexm(source,"CROYDOC")
        replace m = m | regexm(source,"STARDOC")
        replace m = m | regexm(source,"DASHLINE")
        replace m = m | regexm(source,"BARNDOC")
        replace m = m | regexm(source,"CAMIDOC")
        replace m = m | regexm(source,"C.A.D.S.")
        replace m = m | regexm(source,"C.A.M.H.S.")
        replace m = m | regexm(source,"MRHDOC")
        replace m = m | regexm(source,"MAIDDOC")
        replace m = m | regexm(source,"THAMESDOC")
        replace m = m | regexm(source,"REDDOC")
        replace m = m | regexm(source,"AYDDOC")
        replace m = m | regexm(source,"GPPAIN")
        replace m = m | regexm(source,"SORTED")
        replace m = m | regexm(source,"D.S.M.P.")
        replace m = m | regexm(source,"EKDOC")
        replace m = m | regexm(source,"SURESTART")
        replace m = m | regexm(source,"TTDOC")
        replace m = m | regexm(source,"NEWDOCS")
        replace m = m | regexm(source,"YARDOC")
        replace m = m | regexm(source,"S.S.M.P.")
        replace m = m | regexm(source,"EKDOC")
        replace m = m | regexm(source,"FRENDOC")
        replace m = m | regexm(source,"TRAVELLING FAMILIES")
        replace m = m | regexm(source,"CO-OPERATIVE")
        replace m = m | regexm(source,"BRIDGE VIEW D.T.P.")
        replace m = m | regexm(source,"WEST HULL PTL")
        replace m = m | regexm(source,"THE PLACE")

        Any improvements appreciated!

        Comment


        • #5
          Here's an approach that may help. It does make the reasonable assumption that you have fewer keywords than observations. Note that it also adds the space before and after the text being matched, as in post #3, to prevent partial-word matches (MORE does not match KENMORE).
          Code:
          clear
          input str40 find
          "HOUSE"
          "SURGERY"
          "LONDON ROAD"
          "MORE"
          end
          replace find = " "+find+" "
          tempfile tofind
          save `tofind'
          
          clear
          input str6 PracticeCode str52 PracticeName
          "N81002" "KENMORE MEDICAL CENTRE"              
          "N81013" "HIGH STREET SURGERY MACCLESFIELD"    
          "N81021" "MCILVRIDE MEDICAL PRACTICE"          
          "N81022" "BOLLINGTON MEDICAL CENTRE"           
          "N81026" "TOFT ROAD SURGERY"                   
          "N81027" "READESMOOR GROUP PRACTICE"           
          "N81029" "SOUTH PARK SURGERY"                  
          "N81033" "GEORGE STREET PRACTICE"              
          "N81042" "MANCHESTER ROAD MEDICAL CENTRE"      
          "N81049" "ANNANDALE MEDICAL CENTRE"            
          "N81052" "LAWTON HOUSE SURGERY"                
          "N81062" "CUMBERLAND HOUSE"                    
          "N81069" "CHELFORD SURGERY"                    
          "N81070" "HANDFORTH HEALTH CENTRE"             
          "N81073" "PRIORSLEGH MEDICAL CENTRE"           
          "N81077" "LONDON ROAD HEALTH CENTRE"           
          "N81085" "PARK LANE MEDICAL CENTRE"            
          "N81086" "WILMSLOW HEALTH CENTRE"              
          "N81088" "PARK GREEN HOUSE"                    
          "N81112" "THE SCHOOL HOUSE SURGERY"            
          end
          generate m = 0
          generate source = " "+PracticeName+" "
          
          merge 1:1 _n using `tofind'
          count if _merge==3
          local nf = r(N)
          forvalues f=1/`nf' {
              replace m = m | regexm(source,find[`f'])
              }
          
          drop source find _merge
          list, clean
          Code:
          . list, clean 
          
                 Pract~de                       PracticeName   m  
            1.     N81002             KENMORE MEDICAL CENTRE   0  
            2.     N81013   HIGH STREET SURGERY MACCLESFIELD   1  
            3.     N81021         MCILVRIDE MEDICAL PRACTICE   0  
            4.     N81022          BOLLINGTON MEDICAL CENTRE   0  
            5.     N81026                  TOFT ROAD SURGERY   1  
            6.     N81027          READESMOOR GROUP PRACTICE   0  
            7.     N81029                 SOUTH PARK SURGERY   1  
            8.     N81033             GEORGE STREET PRACTICE   0  
            9.     N81042     MANCHESTER ROAD MEDICAL CENTRE   0  
           10.     N81049           ANNANDALE MEDICAL CENTRE   0  
           11.     N81052               LAWTON HOUSE SURGERY   1  
           12.     N81062                   CUMBERLAND HOUSE   1  
           13.     N81069                   CHELFORD SURGERY   1  
           14.     N81070            HANDFORTH HEALTH CENTRE   0  
           15.     N81073          PRIORSLEGH MEDICAL CENTRE   0  
           16.     N81077          LONDON ROAD HEALTH CENTRE   1  
           17.     N81085           PARK LANE MEDICAL CENTRE   0  
           18.     N81086             WILMSLOW HEALTH CENTRE   0  
           19.     N81088                   PARK GREEN HOUSE   1  
           20.     N81112           THE SCHOOL HOUSE SURGERY   1

          Comment

          Working...
          X