Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • an efficient approach to searching and keeping observations

    I am hoping to apply an efficient approach to searching and keeping observations. The following code finds key words in variables and keeps the observations with the key attributes. I am hoping to apply the same idea to a larger set of search terms and across a larger number of variables. Please let me know if you can see obvious improvements I could make or suggest alternative approaches. Thank you very much, Dan
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str63 Position strL(FirstName LastName) str36 Company
    "Managing Director"                                               "Adam"  "Angus" "ETF incorporated"                    
    "Chairman and CEO"                                                "Bert"  "Byron" "Invest Co.,LTD"                      
    "Senior Portfolio Manager / Head of Social Responsible Investing" "Chris" "Congo" "West ethical Private Bank"           
    "Head of ESG Risk"                                                "Dave"  "Dove"  "Responsible Investment and Insurance"
    "Marketing Consultant, Coach and CMO"                             "Edan"  "Elad"  "ESG advisors"                        
    end
    Code:
    gen terms_present1 =  strpos(lower(Position), "esg" "environment" "environmental" "environmental, social and governance" "eco" "sustain" "sustainable" "sustainability" ) > 0
    gen terms_present2 =  strpos(lower(Position), "sri" "socially responsible investing" "responsible" "responsibly" "ethic" "ethics" "ethical") > 0
    gen terms_present3 =  strpos(lower(Company), "esg" "environment" "environmental" "environmental, social and governance" "eco" "sustain" "sustainable" "sustainability" ) > 0
    gen terms_present4 =  strpos(lower(Company), "sri" "socially responsible investing" "responsible" "responsibly" "ethic" "ethics" "ethical") > 0
    count if terms_present1 > 0
    count if terms_present2 > 0
    count if terms_present3 > 0
    count if terms_present4 > 0
    keep if (terms_present1 + terms_present2 + terms_present3 + terms_present4) > 0

  • #2
    I do not believe your code works as you expect it does. The strpos() function takes two arguments, and searches the first argument for the text in the second argument. Consider the following example.
    Code:
    . clear
    
    . set obs 1
    number of observations (_N) was 0, now 1
    
    . generate str text = "the quick brown fox"
    
    . generate ff1 = strpos(text, "quick" "slow")
    
    . generate ff2 = strpos(text, "slow" "quick")
    
    . list, clean noobs
    
                       text   ff1   ff2  
        the quick brown fox     5     0
    Note also that searching Position for "eco" will find "Director of Disaster Recovery" should one exist. Since you seem to be looking for "eco as in ecology" this is probably not a match you want.

    With luck another respondent will be able to suggest better technique for the sort of search you are attempting; I'm regret that all I have are counterexamples to your current technique.
    Last edited by William Lisowski; 05 Oct 2018, 19:31.

    Comment


    • #3
      I have previously used the following, derived from Robert Picard's reply in #3 of the following link. This matches words which may be delimited by punctuation characters. Therefore, from William's point in #2, to match the word "ecology", you have to spell it out in full.

      Code:
      input strL tagline
      "Cable-free live TV is here. You Tube TV"
      "Join a better network! Because better matters. Verizon"
      "COLONEL QUALITY GUARANTEED. KFC"
      "Goodyear, more driven."
      "15 minutes could save you 15% or more on car insurance.GEICO"
      "All the News That's Fit to Print. NYT"
      "America Runs on Dunkin'. Dunkin' Donuts"
      "Imagination at Work. GE"
      "CAN YOU EAT THREE SHREDDED WHEAT? – Nabisco"
      end
      
      gen match = regexm(" " + lower(tagline) + " ", "['!?,\. ](tv|network|quality|goodyear|dunkin|wheat)['!?,\. ]")
      l, clean
      And the result:

      Code:
      . l, clean
                                                                  tagline   match  
        1.                        Cable-free live TV is here. You Tube TV       1  
        2.         Join a better network! Because better matters. Verizon       1  
        3.                                COLONEL QUALITY GUARANTEED. KFC       1  
        4.                                         Goodyear, more driven.       1  
        5.   15 minutes could save you 15% or more on car insurance.GEICO       0  
        6.                          All the News That's Fit to Print. NYT       0  
        7.                        America Runs on Dunkin'. Dunkin' Donuts       1  
        8.                                        Imagination at Work. GE       0  
        9.                    CAN YOU EAT THREE SHREDDED WHEAT? – Nabisco       1

      Comment


      • #4
        Thank you for your observations and illustration William - you are correct in that I have misunderstood what strpos was achieving.
        Thank you Andrew, yes, regexm looks like it is a more appropriate command. Thank you also for the "|" separator approach.
        much appreciated, Dan

        Comment

        Working...
        X