How can I create one binary variable that =1 if an observation contains one of multiple substrings?

Joe Mo

Join Date: Feb 2019

Posts: 4
#1

How can I create one binary variable that =1 if an observation contains one of multiple substrings?

01 Feb 2019, 13:45

Hi, I want to create a binary variable that =1 if an observation within var1 contains certain substrings. It is survey data, and I am looking for a particular word, however, it is often misspelt, so I am looking to search for all possible spelling combinations.

I am using:
foreach word in xx xy xz {
egen `word' = incss(var1), sub(`word') insensitive
}

Which works, but it creates a separate variable for each combination, I only want to create one variable for each variation of the spelling. Thanks.

Last edited by Joe Mo; 01 Feb 2019, 13:48.
Tags: None

David Benson

Join Date: Oct 2018
Posts: 489

01 Feb 2019, 15:07

You might try:

Code:

gen has_word = 0

foreach word in xx xy xz {
replace has_word=1 if strpos(var1, "`word'") > 0
}

Code:

dataex text
clear
input str47 text
"The speaker occasionally referred to his notes" 
"The speaker often referred to his notes"        
"The speaker frequently referred to his notes"   
"The speaker occasionelly referred to his notes" 
"The speaker occasionally referred to his notes" 
"The speaker ocasionally referred to his notes"  
"The speaker occasionaly referred to his notes"  
"The speaker occassionally referred to his notes"
"The speaker occasionnally referred to his notes"
end

gen has_word=0

foreach word in occasionaly occasionelly occasionally ocasionally occasionally Occasionally {
replace has_word=1 if strpos(text, "`word'") > 0
}

. list, noobs

  +------------------------------------------------------------+
  |                                            text   has_word |
  |------------------------------------------------------------|
  |  The speaker occasionally referred to his notes          1 |
  |         The speaker often referred to his notes          0 |
  |    The speaker frequently referred to his notes          0 |
  |  The speaker occasionelly referred to his notes          1 |
  |  The speaker occasionally referred to his notes          1 |
  |------------------------------------------------------------|
  |   The speaker ocasionally referred to his notes          1 |
  |   The speaker occasionaly referred to his notes          1 |
  | The speaker occassionally referred to his notes          0 |
  | The speaker occasionnally referred to his notes          0 |
  +------------------------------------------------------------+

Comment

Joe Mo

Join Date: Feb 2019
Posts: 4

01 Feb 2019, 15:45

Originally posted by David Benson View Post

You might try:

Code:

gen has_word = 0

foreach word in xx xy xz {
replace has_word=1 if strpos(var1, "`word'") > 0
}

Code:

dataex text
clear
input str47 text
"The speaker occasionally referred to his notes"
"The speaker often referred to his notes"
"The speaker frequently referred to his notes"
"The speaker occasionelly referred to his notes"
"The speaker occasionally referred to his notes"
"The speaker ocasionally referred to his notes"
"The speaker occasionaly referred to his notes"
"The speaker occassionally referred to his notes"
"The speaker occasionnally referred to his notes"
end

gen has_word=0

foreach word in occasionaly occasionelly occasionally ocasionally occasionally Occasionally {
replace has_word=1 if strpos(text, "`word'") > 0
}

. list, noobs

+------------------------------------------------------------+
| text has_word |
|------------------------------------------------------------|
| The speaker occasionally referred to his notes 1 |
| The speaker often referred to his notes 0 |
| The speaker frequently referred to his notes 0 |
| The speaker occasionelly referred to his notes 1 |
| The speaker occasionally referred to his notes 1 |
|------------------------------------------------------------|
| The speaker ocasionally referred to his notes 1 |
| The speaker occasionaly referred to his notes 1 |
| The speaker occassionally referred to his notes 0 |
| The speaker occasionnally referred to his notes 0 |
+------------------------------------------------------------+

That worked perfectly, thank you very much!

Comment

Joe Mo

Join Date: Feb 2019
Posts: 4

01 Feb 2019, 15:48

Originally posted by David Benson View Post

You might try:

Code:

gen has_word = 0

foreach word in xx xy xz {
replace has_word=1 if strpos(var1, "`word'") > 0
}

Code:

dataex text
clear
input str47 text
"The speaker occasionally referred to his notes"
"The speaker often referred to his notes"
"The speaker frequently referred to his notes"
"The speaker occasionelly referred to his notes"
"The speaker occasionally referred to his notes"
"The speaker ocasionally referred to his notes"
"The speaker occasionaly referred to his notes"
"The speaker occassionally referred to his notes"
"The speaker occasionnally referred to his notes"
end

gen has_word=0

foreach word in occasionaly occasionelly occasionally ocasionally occasionally Occasionally {
replace has_word=1 if strpos(text, "`word'") > 0
}

. list, noobs

+------------------------------------------------------------+
| text has_word |
|------------------------------------------------------------|
| The speaker occasionally referred to his notes 1 |
| The speaker often referred to his notes 0 |
| The speaker frequently referred to his notes 0 |
| The speaker occasionelly referred to his notes 1 |
| The speaker occasionally referred to his notes 1 |
|------------------------------------------------------------|
| The speaker ocasionally referred to his notes 1 |
| The speaker occasionaly referred to his notes 1 |
| The speaker occassionally referred to his notes 0 |
| The speaker occasionnally referred to his notes 0 |
+------------------------------------------------------------+

Also, is there a way to make the command insensitive, or should I just include each spelling both capitalised and non-capitalised?

Comment

David Benson

Join Date: Oct 2018

Posts: 489
#5

01 Feb 2019, 15:55

You can use strupper() to convert both of them to uppercase.

Code:

* Change what's in the loop to: replace has_word=1 if strpos(strupper(text), strupper("`word'")) > 0
2 likes
Comment

Announcement

How can I create one binary variable that =1 if an observation contains one of multiple substrings?

Comment

Comment

Comment

Comment