Keep variables if they contain at least one word of a given list

Lisa Moon

Join Date: Nov 2019

Posts: 8
#1

Keep variables if they contain at least one word of a given list

21 Nov 2019, 03:44

Hello Stata Community

I have a very big number of observations and I want to filter out the ones which contain at least one of the 20 key words I have. The problem is, that the observations are sentences and not just one word.

I have tried the command:

keep if inlist(Resolution,"drill" , "dioxin" , "clean up" , "nuclear" , "environment" , "environmental" , "pollution" , "energy" , "power" , "chlorine" , "trees" , "GHG" , "emissions" , "forest" , "recycling" , "recycled" , "mercury" , "water" , "filter" , "gene-engineered" , "mining" , "PVC" , "old growth wood" , "waste" , "paper" , "radioactive" , "toxic" , "plutonium" , "renewable" , "greenhouse gas" , "climate" , "CO2" , "parabens" , "phthalates")

but there is always an error saying that the expression is to long.

Do you have an idea, how I could do this? Thank you very much for your help.
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#2

21 Nov 2019, 05:36

as the help says, for strings there can be no more than 10 arguments - so break what you are doing into several "inlists" with an "or" (|) between each pair of lists
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#3

21 Nov 2019, 05:43

Rich Goldstein is bang on, but see also https://www.stata.com/support/faqs/d...s-for-subsets/ for another way to do it.

On a different level, note that your inlist() call is not a test for "contains". It is a test for "equals".
Comment

David Benson

Join Date: Oct 2018
Posts: 489

22 Nov 2019, 22:16

Lisa, you might look at the examples here, here, and here.

Picking an example from the first link:

Code:

dataex text
clear
input str47 text
"The speaker occasionally referred to his notes" 
"The speaker often referred to his notes"        
"The speaker frequently referred to his notes"   
"The speaker occasionelly referred to his notes" 
"The speaker occasionally referred to his notes" 
"The speaker ocasionally referred to his notes"  
"The speaker occasionaly referred to his notes"  
"The speaker occassionally referred to his notes"
"The speaker occasionnally referred to his notes"
end

gen has_word=0

foreach word in occasionaly occasionelly occasionally ocasionally occasionally Occasionally {
replace has_word=1 if strpos(text, "`word'") > 0
}

*** To make the above loop case insensitive
foreach word in occasionaly occasionelly occasionally ocasionally occasionally Occasionally {
replace has_word=1 if strpos(strupper(text), strupper("`word'")) > 0 
}

. list, noobs

  +------------------------------------------------------------+
  |                                            text   has_word |
  |------------------------------------------------------------|
  |  The speaker occasionally referred to his notes          1 |
  |         The speaker often referred to his notes          0 |
  |    The speaker frequently referred to his notes          0 |
  |  The speaker occasionelly referred to his notes          1 |
  |  The speaker occasionally referred to his notes          1 |
  |------------------------------------------------------------|
  |   The speaker ocasionally referred to his notes          1 |
  |   The speaker occasionaly referred to his notes          1 |
  | The speaker occassionally referred to his notes          0 |
  | The speaker occasionnally referred to his notes          0 |
  +------------------------------------------------------------+

Comment

Lisa Moon

Join Date: Nov 2019
Posts: 8

26 Nov 2019, 06:32

Originally posted by David Benson View Post

Lisa, you might look at the examples here, here, and here.

Picking an example from the first link:

Code:

dataex text
clear
input str47 text
"The speaker occasionally referred to his notes"
"The speaker often referred to his notes"
"The speaker frequently referred to his notes"
"The speaker occasionelly referred to his notes"
"The speaker occasionally referred to his notes"
"The speaker ocasionally referred to his notes"
"The speaker occasionaly referred to his notes"
"The speaker occassionally referred to his notes"
"The speaker occasionnally referred to his notes"
end

gen has_word=0

foreach word in occasionaly occasionelly occasionally ocasionally occasionally Occasionally {
replace has_word=1 if strpos(text, "`word'") > 0
}

*** To make the above loop case insensitive
foreach word in occasionaly occasionelly occasionally ocasionally occasionally Occasionally {
replace has_word=1 if strpos(strupper(text), strupper("`word'")) > 0
}

. list, noobs

+------------------------------------------------------------+
| text has_word |
|------------------------------------------------------------|
| The speaker occasionally referred to his notes 1 |
| The speaker often referred to his notes 0 |
| The speaker frequently referred to his notes 0 |
| The speaker occasionelly referred to his notes 1 |
| The speaker occasionally referred to his notes 1 |
|------------------------------------------------------------|
| The speaker ocasionally referred to his notes 1 |
| The speaker occasionaly referred to his notes 1 |
| The speaker occassionally referred to his notes 0 |
| The speaker occasionnally referred to his notes 0 |
+------------------------------------------------------------+

Thank you so much David. This worked perfectly!

Announcement

Keep variables if they contain at least one word of a given list

Comment

Comment

Comment

Comment