Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Grouping in Stata if it does not include a specific word

    Hello,

    Is there a way to group in Stata with specific requirements? For example, I want to create a binary variable that is true if the world agriculture is in the text but I don't want it to include agriculture engineering.

    Here is a method I tried that does not work. The error says its invalid.

    Code:
    gen agriculture = 0 
    replace agriculture = 1 if strpos(department, "Agriculture" ) & != "Engineering"

  • #2
    Code:
    input str29 text
    "corporate agriculture"
    "horticulture"
    "agriculture engineering"
    "livestock and agriculture"
    end
    
    gen wanted = regexm(lower(text), "agriculture") & !regexm(lower(text), "agriculture engineering")
    Similar logic with -strpos()-

    Res.:

    Code:
    . l
    
         +------------------------------------+
         |                      text   wanted |
         |------------------------------------|
      1. |     corporate agriculture        1 |
      2. |              horticulture        0 |
      3. |   agriculture engineering        0 |
      4. | livestock and agriculture        1 |
         +------------------------------------+

    Comment


    • #3
      Code:
       
       gen agriculture = strpos(department, "Agriculture" ) > 0 & strpos(department, "Engineering") == 0
      or

      Code:
       
       gen agriculture = strpos(department, "Agriculture" ) & !strpos(department, "Engineering")

      Comment


      • #4
        I couldn't resist extending Andrew's code to Stata's Unicode regular expression matching function. Unlike Andrew's post, it's not aimed at expanding the knowledge of novices: the target audience is experienced users of regular expressions who might find some interesting ideas here.
        Code:
        input str35 text
        "corporate Agriculture"
        "horticulture"
        "Agriculture Engineering"
        "livestock and agriculture"
        "Agriculture and Civil Engineering"
        "agricultural forensics"
        end
        generate wanted = ustrregexm(text,"(?i)\bagriculture\b(?! engineering\b)")
        list, clean
        Code:
        . list, clean
        
                                            text   wanted  
          1.               corporate agriculture        1  
          2.                        horticulture        0  
          3.             Agriculture Engineering        0  
          4.           livestock and agriculture        1  
          5.   Agriculture and Civil Engineering        1  
          6.              agricultural forensics        0
        Key ideas:
        • (?i) triggers case-insensitive matching
        • \bagriculture\b matches the string "agriculture" with "word breaks" before and after, so not "agricultural", and it leaves the "current position" at the first character after "agriculture"
        • (?! engineering\b) makes the match fail if "agriculture" is immediately followed by a space, "engineering", and a word break

        Comment


        • #5
          Code:
          gen byte wanted = ustrregexm(text,"(?i)agriculture(?! engineering)")

          Comment


          • #6
            Bjarte is correct that agricultural does not match agricuture - I should have included "pseudoagriculture" in the data to demonstrate searching for whole words.

            When I'm looking for a precise whole word I always match the word surrounded by word breaks. Not always perfect - agriculture.com will match because most punctuation is a word break - but it still helps ensure I find Robert and not Roberta.

            Comment


            • #7
              Thanks for the example code #4 with explanation using the \b regex word boundary anchor. Below I add "pseudoagriculture" and test two variants using the \b. As mentioned above the code in #5, lacking \b will then match "pseudoagriculture" in
              contrast to 1 and 2, and a last non-regex variant.
              Code:
              ustrregexm(text,"(?i)\bagriculture\b(?!\s\bengineering\b)")                                  
              
              ustrregexm(text,"(?i)\bagriculture(?!\sengineering\b)")                                      
              
              ustrregexm(text,"(?i)agriculture(?!\sengineering)")                                          
              
              strpos(" " + lower(text) + " ", " agriculture ") & lower(text) != "agriculture engineering"  
              Results of timing:
              Code:
              1:  ustrregexm(text,"(?i)\bagriculture\b(?!\s\bengineering\b)")                                  
              
              2:  ustrregexm(text,"(?i)\bagriculture(?!\sengineering\b)")                                      
              
              3:  ustrregexm(text,"(?i)agriculture(?!\sengineering)")                                          
              
              4:  strpos(" " + lower(text) + " ", " agriculture ") & lower(text) != "agriculture engineering"  
              
              . timer list
                 1:     11.15 /      100 =       0.1115
                 2:     10.55 /      100 =       0.1055
                 3:      9.72 /      100 =       0.0972
                 4:      1.70 /      100 =       0.0170
              Code:
              clear all
              
              input str35 text
              "corporate Agriculture"
              "horticulture"
              "Agriculture Engineering"
              "livestock and agriculture"
              "Agriculture and Civil Engineering"
              "agricultural forensics"
              "pseudoagriculture"
              end
              
              local expand 10000
              local reps 100
              
              expand `expand'
              
              #delim;
              tokenize // read regex into locals 1, 2, ...  
              `"
              `"  ustrregexm(text,"(?i)\bagriculture\b(?!\s\bengineering\b)")                                  "'                                  
              `"  ustrregexm(text,"(?i)\bagriculture(?!\sengineering\b)")                                      "'      
              `"  ustrregexm(text,"(?i)agriculture(?!\sengineering)")                                          "'      
              `"  strpos(" " + lower(text) + " ", " agriculture ") & lower(text) != "agriculture engineering"  "'      
              "'
              ; #delim cr  
              
              qui forvalues i = 1/`reps' {
               
                  local x 1
               
                  while ( `"``x''"' != "" ) {
                
                    timer on `x'
                    gen byte wanted`x' = ``x''
                    timer off `x'
                    loc ++x
                  }
                       
                  while ( `x' > 1 ) {
              
                    loc --x
                    capture assert wanted`x' == wanted1
              
                    if ( _rc == 9 ) {
              
                        assert `x' == 3 // will "fail" matching "pseudoagriculture" when not using word boundary \b
                      }
                  }
               
                  drop wanted?
              }    
              
              qui while ( `"``x''"' != "" ) {
              
                 noi di _n as text "`x':" as res `"``x''"'
                 loc ++x
              }
              
              timer list

              Comment


              • #8
                For looking for proper words in relatively free text using strpos, Bjart's
                Code:
                strpos(" " + lower(text) + " ", " agriculture ")
                is a helpful approach that apparently escapes most programmers new to searching strings for text.

                The important lesson from this discussion has nothing to do with the choice between regular expressions and other techniques.

                The important lesson is that you can't just plow into your text search without completely understanding the text within which you are search, or you risk counting Roberta as Robert and ignoring robert. If you have the possibility of upper- and lower-case characters, and they make no difference to you, then apply lower() or use case-insensitive regular expressions. If you're looking for a word, be sure to at least surround the search string with spaces, and add leading and trailing spaces to the text within which you are searching, or use regular expressions surrounded by word breaks. And if you're unlucky enough to have, for example, text containing comma-separated items (e.g. "agriculture, mining, and fishing") then you have to do something to handle that - perhaps replacing commas with spaces in the text you are searching, or using word break regular expressions. And if you have URLs (e.g. agriculture.com that you want treated as a single element) then simple word break regular expressions apparently won't do.

                Comment

                Working...
                X