Foreach to match partial strings from a separate file

Kirsten Arendse

Join Date: Jun 2023

Posts: 2
#1

Foreach to match partial strings from a separate file

17 Jun 2023, 10:56

I would like to replace a variable (cancer=1) when the variable string contains a partial match from the list of local words (see below).

use "file 1", clear

local words cancer malignancy metastatic

foreach w of local words {
replace cancer = 1 if strpos(term, "`w'")
}

QUESTION

*There are over 200 partial strings that could indicate cancer, I've started with 3 words to run the code to see if it works. Instead of writing out all 200 words under "local words" how do I refer to a separate file with a list of strings (file 2)?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

17 Jun 2023, 11:43

In the code below, I use temporary files for both the first data and the separate file of cancer terms. You will, of course, use your real permanent files for these purposes and will replace the references in the code to the temporary files with the corresponding filenames.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float cancer str18 term 0 "adenocarcinoma" 0 "malignancy" 0 "reflux esophagitis" 0 "Wilm's tumor" 0 "glioblastoma" end tempfile dataset1 save `dataset1' * Example generated by -dataex-. For more info, type help dataex clear input str14 cancer_term "Wilm's tumor" "adenocarcinoma" "cancer" "carcinoma" "glioblastoma" "malignancy" "retinoblastoma" "sarcoma" end tempfile cancer_terms save `cancer_terms' use `dataset1' gen `c(obs_t)' obs_no = _n cross using `cancer_terms' by obs_no, sort: egen byte has_cancer = max(strpos(term, cancer_term) > 0) by obs_no: keep if _n == 1 replace cancer = 1 if has_cancer drop cancer_term has_cancer

Note: If your first data set is very large, this code may cause you to exceed available memory, because the intermediate data set immediately following -cross- will be 200 times as large as the original and might not fit. (The -keep- command two lines later will restore the size of the original data set.) To avoid this problem, you may wish to strip your original data set down to a single observation identifier and the term variable (as was done here) and then, when done, merge the results back to your original data set. If even this leaves you overflowing available memory, post back and I will show you a different way that is more economical of memory.

In the future, when asking for help with code, please use the -dataex- command and show example data. Although sometimes, as here, it is possible to give an answer that has a reasonable probability of being correct, this is usually not the case. Moreover, such answers are necessarily based on experience-based guesses or intuitions about the nature of your data. When those guesses are wrong, both you and the person trying to help you have wasted their time as you end up with useless code. To avoid this, a -dataex- based example provides all of the information needed to develop and test a solution.

If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1398

17 Jun 2023, 12:59

Building off the example provided by Clyde (thanks!), here are two methods that may be less demanding on memory:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str14 cancer_term
"Wilm's tumor"  
"adenocarcinoma"
"cancer"        
"carcinoma"    
"glioblastoma"  
"malignancy"    
"retinoblastoma"
"sarcoma"      
end

gen byte i_var = 1
gen `c(obs_t)' j_var = _n

reshape wide cancer_term, i(i_var) j(j_var)
egen all_cancer_terms = concat(cancer_term?) , punct(;)
local all_cancer_terms = all_cancer_terms[1]

* Example generated by -dataex-. For more info, type help dataex
clear
input str40 term
"adenocarcinoma"    
"malignancy"        
"reflux esophagitis"
"Wilm's tumor"      
"glioblastoma"
"god forbid another malignancy"      
end

gen byte is_cancer = (strpos("`all_cancer_terms'", term) > 0) // ignore this line if you fall in case #2

* ignore all the lines below if you fall in case #1
gen byte is_cancer_2 = 0
local current_cancer_terms = "`all_cancer_terms';"
local j = strpos("`current_cancer_terms'", ";")
while `j' > 0 {
    local cancer_term = substr("`current_cancer_terms'", 1 , `=`j'-1')
    replace is_cancer_2 = (strpos(term,"`cancer_term'") > 0) if is_cancer_2 == 0
    local current_cancer_terms = substr("`current_cancer_terms'", `=`j'+1', .)
    local j = strpos("`current_cancer_terms'", ";")
}

which produces:

Code:

. list , noobs sep(0) abbrev(15)

  +---------------------------------------------------------+
  |                          term   is_cancer   is_cancer_2 |
  |---------------------------------------------------------|
  |                adenocarcinoma           1             1 |
  |                    malignancy           1             1 |
  |            reflux esophagitis           0             0 |
  |                  Wilm's tumor           1             1 |
  |                  glioblastoma           1             1 |
  | god forbid another malignancy           0             1 |
  +---------------------------------------------------------+

I added one term to Clyde's list (the last observation), to create a situation where only a part of the term is a cancer term. If such situations do not exist in your data (call this case #1), then the subset of my code that is used to create is_cancer will suffice. If examples like this do exist in your data (case #2), then you can avoid creating is_cancer, and see the rest of the code, to create is_cancer_2.

If your data contains semi-colons, then use a different character as punctuation to separate cancer terms in the code above.

Last edited by Hemanshu Kumar; 17 Jun 2023, 13:25.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

17 Jun 2023, 14:14

Actually, I realized that there's a simpler way to do it than either #2 or #3. It's just a slight modification of the approach in #1. You don't want to write out a local with 200 cancer terms in it. But Stata will do the job for you by creating a local from the separate file:

Code:

use `cancer_terms', clear levelsof cancer_term, local(words) use `dataset1', clear foreach w of local words { replace cancer = 1 if strpos(term, `"`w'"') }

It will be slower than the approach in #2, perhaps much slower if the dataset1 is very large, but the code is simpler, and it makes negligible demands on memory.
1 like
Comment
Kirsten Arendse

Join Date: Jun 2023

Posts: 2
#5

18 Jun 2023, 08:38

Hi Clyde and Hemanshu. I ran it with the last code that Clyde suggested and it worked! Thank you for the help.
Comment

Announcement