Grouping in Stata if it does not include a specific word

Cindy Linwood

Join Date: Sep 2020

Posts: 22
#1

Grouping in Stata if it does not include a specific word

24 May 2021, 10:33

Hello,

Is there a way to group in Stata with specific requirements? For example, I want to create a binary variable that is true if the world agriculture is in the text but I don't want it to include agriculture engineering.

Here is a method I tried that does not work. The error says its invalid.

Code:

gen agriculture = 0 replace agriculture = 1 if strpos(department, "Agriculture" ) & != "Engineering"
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10195

24 May 2021, 10:40

Code:

input str29 text
"corporate agriculture"
"horticulture"
"agriculture engineering"
"livestock and agriculture"
end

gen wanted = regexm(lower(text), "agriculture") & !regexm(lower(text), "agriculture engineering")

Similar logic with -strpos()-

Res.:

Code:

. l

     +------------------------------------+
     |                      text   wanted |
     |------------------------------------|
  1. |     corporate agriculture        1 |
  2. |              horticulture        0 |
  3. |   agriculture engineering        0 |
  4. | livestock and agriculture        1 |
     +------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

24 May 2021, 10:41

Code:

 
 gen agriculture = strpos(department, "Agriculture" ) > 0 & strpos(department, "Engineering") == 0

Code:

 
 gen agriculture = strpos(department, "Agriculture" ) & !strpos(department, "Engineering")

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

24 May 2021, 12:39

I couldn't resist extending Andrew's code to Stata's Unicode regular expression matching function. Unlike Andrew's post, it's not aimed at expanding the knowledge of novices: the target audience is experienced users of regular expressions who might find some interesting ideas here.

Code:

input str35 text "corporate Agriculture" "horticulture" "Agriculture Engineering" "livestock and agriculture" "Agriculture and Civil Engineering" "agricultural forensics" end generate wanted = ustrregexm(text,"(?i)\bagriculture\b(?! engineering\b)") list, clean

Code:

. list, clean text wanted 1. corporate agriculture 1 2. horticulture 0 3. Agriculture Engineering 0 4. livestock and agriculture 1 5. Agriculture and Civil Engineering 1 6. agricultural forensics 0

Key ideas:
(?i) triggers case-insensitive matching

\bagriculture\b matches the string "agriculture" with "word breaks" before and after, so not "agricultural", and it leaves the "current position" at the first character after "agriculture"

(?! engineering\b) makes the match fail if "agriculture" is immediately followed by a space, "engineering", and a word break
2 likes
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#5

24 May 2021, 13:20

Code:

gen byte wanted = ustrregexm(text,"(?i)agriculture(?! engineering)")
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

24 May 2021, 14:02

Bjarte is correct that agricultural does not match agricuture - I should have included "pseudoagriculture" in the data to demonstrate searching for whole words.

When I'm looking for a precise whole word I always match the word surrounded by word breaks. Not always perfect - agriculture.com will match because most punctuation is a word break - but it still helps ensure I find Robert and not Roberta.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

25 May 2021, 15:01

Thanks for the example code #4 with explanation using the \b regex word boundary anchor. Below I add "pseudoagriculture" and test two variants using the \b. As mentioned above the code in #5, lacking \b will then match "pseudoagriculture" in
contrast to 1 and 2, and a last non-regex variant.

Code:

ustrregexm(text,"(?i)\bagriculture\b(?!\s\bengineering\b)")                                  

ustrregexm(text,"(?i)\bagriculture(?!\sengineering\b)")                                      

ustrregexm(text,"(?i)agriculture(?!\sengineering)")                                          

strpos(" " + lower(text) + " ", " agriculture ") & lower(text) != "agriculture engineering"

Results of timing:

Code:

1:  ustrregexm(text,"(?i)\bagriculture\b(?!\s\bengineering\b)")                                  

2:  ustrregexm(text,"(?i)\bagriculture(?!\sengineering\b)")                                      

3:  ustrregexm(text,"(?i)agriculture(?!\sengineering)")                                          

4:  strpos(" " + lower(text) + " ", " agriculture ") & lower(text) != "agriculture engineering"  

. timer list
   1:     11.15 /      100 =       0.1115
   2:     10.55 /      100 =       0.1055
   3:      9.72 /      100 =       0.0972
   4:      1.70 /      100 =       0.0170

Code:

clear all

input str35 text
"corporate Agriculture"
"horticulture"
"Agriculture Engineering"
"livestock and agriculture"
"Agriculture and Civil Engineering"
"agricultural forensics"
"pseudoagriculture"
end

local expand 10000
local reps 100

expand `expand'

#delim;
tokenize // read regex into locals 1, 2, ...  
`"
`"  ustrregexm(text,"(?i)\bagriculture\b(?!\s\bengineering\b)")                                  "'                                  
`"  ustrregexm(text,"(?i)\bagriculture(?!\sengineering\b)")                                      "'      
`"  ustrregexm(text,"(?i)agriculture(?!\sengineering)")                                          "'      
`"  strpos(" " + lower(text) + " ", " agriculture ") & lower(text) != "agriculture engineering"  "'      
"'
; #delim cr  

qui forvalues i = 1/`reps' {
 
    local x 1
 
    while ( `"``x''"' != "" ) {
  
      timer on `x'
      gen byte wanted`x' = ``x''
      timer off `x'
      loc ++x
    }
         
    while ( `x' > 1 ) {

      loc --x
      capture assert wanted`x' == wanted1

      if ( _rc == 9 ) {

          assert `x' == 3 // will "fail" matching "pseudoagriculture" when not using word boundary \b
        }
    }
 
    drop wanted?
}    

qui while ( `"``x''"' != "" ) {

   noi di _n as text "`x':" as res `"``x''"'
   loc ++x
}

timer list

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

25 May 2021, 19:48

For looking for proper words in relatively free text using strpos, Bjart's

Code:

strpos(" " + lower(text) + " ", " agriculture ")

is a helpful approach that apparently escapes most programmers new to searching strings for text.

The important lesson from this discussion has nothing to do with the choice between regular expressions and other techniques.

The important lesson is that you can't just plow into your text search without completely understanding the text within which you are search, or you risk counting Roberta as Robert and ignoring robert. If you have the possibility of upper- and lower-case characters, and they make no difference to you, then apply lower() or use case-insensitive regular expressions. If you're looking for a word, be sure to at least surround the search string with spaces, and add leading and trailing spaces to the text within which you are searching, or use regular expressions surrounded by word breaks. And if you're unlucky enough to have, for example, text containing comma-separated items (e.g. "agriculture, mining, and fishing") then you have to do something to handle that - perhaps replacing commas with spaces in the text you are searching, or using word break regular expressions. And if you have URLs (e.g. agriculture.com that you want treated as a single element) then simple word break regular expressions apparently won't do.
1 like
Comment

Announcement

Grouping in Stata if it does not include a specific word

Comment

Comment

Comment

Comment

Comment

Comment

Comment