Delete everything after a specific string

Daniel Perez Parra

Join Date: Jul 2021

Posts: 63
#1

Delete everything after a specific string

10 Nov 2022, 11:23

Hi!

I'm dealing with some data where my variable "address" is contaminated also by the population where my observations live. Here you have some examples (poblacion= village):
CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT
PASAJE ACHAO, POBLACION LA LAGUNA
VICENTE REYES 889 POBLACION PATRONA DE CHILE

I need to remove POBLACION and everything that goes after it, so in the first case, I would need to remove POBLACION ARTURO PRAT.

Is this possible in Stata?

Daniel.
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35446

10 Nov 2022, 11:40

Code:

clear 
input str46 tochange
"CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT"
"PASAJE ACHAO, POBLACION LA LAGUNA"
"VICENTE REYES 889 POBLACION PATRONA DE CHILE"
end 

gen toget = trim(substr(tochange, 1, strpos(tochange, "POBLACION") - 1)) 

list toget 

     +-------------------------+
     |                   toget |
     |-------------------------|
  1. | CALLE CONCHA Y TORO 120 |
  2. |           PASAJE ACHAO, |
  3. |       VICENTE REYES 889 |
     +-------------------------+

Comment

Daniel Perez Parra

Join Date: Jul 2021
Posts: 63

10 Nov 2022, 14:19

Originally posted by Nick Cox View Post

Code:

clear
input str46 tochange
"CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT"
"PASAJE ACHAO, POBLACION LA LAGUNA"
"VICENTE REYES 889 POBLACION PATRONA DE CHILE"
end

gen toget = trim(substr(tochange, 1, strpos(tochange, "POBLACION") - 1))

list toget

+-------------------------+
| toget |
|-------------------------|
1. | CALLE CONCHA Y TORO 120 |
2. | PASAJE ACHAO, |
3. | VICENTE REYES 889 |
+-------------------------+

Thank you Nick. Is there a way to do it without having to generate another variable? Just curiosity.

Daniel.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35446
#4

10 Nov 2022, 17:11

You can use replace — with the danger that you get it wrong or change your mind.
1 like
Comment
Daniel Perez Parra

Join Date: Jul 2021

Posts: 63
#5

11 Nov 2022, 02:34

Originally posted by Nick Cox View Post

You can use replace — with the danger that you get it wrong or change your mind.

Thank you Nick, as helpful as always.
Comment
Daniel Perez Parra

Join Date: Jul 2021

Posts: 63
#6

11 Nov 2022, 12:32

Originally posted by Nick Cox View Post

You can use replace — with the danger that you get it wrong or change your mind.

I have tried to add replace since I have just discovered that there are also other population units like "Villa" that should be removed in the same way we did with Poblacion. But it removes all the observations of address, probably because I don't understand quite well the logic behind the command.

Code:

replace address=trim(substr(address, 1, strpos(address, ", POBLACION") -1)) replace address=trim(substr(address, 1, strpos(address, "SECTOR") -1)) replace address=trim(substr(address, 1, strpos(address, "VILLA") -1))

Daniel.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35446
#7

11 Nov 2022, 13:28

Indeed. If a string is not found, then strpos() returns 0 so you need to make the replacement conditional on a positive result.

See

Code:

help strpos()
1 like
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1326

12 Nov 2022, 04:15

Some alternative code using regular expressions:

Code:

clear 
input str46 tochange
"CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT"
"PASAJE ACHAO, POBLACION LA LAGUNA"
"VICENTE REYES 889 POBLACION PATRONA DE CHILE"
"MI VILLA NUMERO 990 EN CHILE"
"PASAJE ACHAO, LA LAGUNA"
"PASAJE SECTOR ACHAO, LA LAGUNA"
end 

gen toget = ustrregexra(tochange,"\s(POBLACION|VILLA|SECTOR)\s.+$","")

so that you get:

Code:

. li, noobs sep(0)
  +-------------------------------------------------------------------------+
  |                                      tochange                     toget |
  |-------------------------------------------------------------------------|
  | CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT   CALLE CONCHA Y TORO 120 |
  |             PASAJE ACHAO, POBLACION LA LAGUNA             PASAJE ACHAO, |
  |  VICENTE REYES 889 POBLACION PATRONA DE CHILE         VICENTE REYES 889 |
  |                  MI VILLA NUMERO 990 EN CHILE                        MI |
  |                       PASAJE ACHAO, LA LAGUNA   PASAJE ACHAO, LA LAGUNA |
  |                PASAJE SECTOR ACHAO, LA LAGUNA                    PASAJE |
  +-------------------------------------------------------------------------+

Last edited by Hemanshu Kumar; 12 Nov 2022, 04:18.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35446
#9

12 Nov 2022, 04:35

Regular expressions are serious players here. For those who find them intimidating or awkward, other functions remain possible.

Code:

gen where = 0 foreach w in POBLACION VILLA SECTOR { replace where = strpos(tochange, "`w'") } gen toget = cond(where > 0, substr(tochange, 1, where - 1), tochange)

which assumes that each key word occurs at most once.
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1326

#10

12 Nov 2022, 05:19

The code in #9 has two issues:

it will pick up even those situations where the letters are embedded in another word, e.g. "SECTORES"
the loop will reset the value of where to the location of "SECTOR" in the string, meaning that it misses every case except those where the string contains SECTOR

The following modification fixes both issues. It also accounts for situations where more than one keyword appears in the string. I retain the code in #9 to compare the results.

Code:

clear
input str46 tochange
"CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT"
"PASAJE ACHAO, POBLACION LA LAGUNA"
"VICENTE REYES 889 POBLACION PATRONA DE CHILE"
"MI VILLA NUMERO 990 EN CHILE"
"PASAJE ACHAO, LA LAGUNA"
"PASAJE SECTOR ACHAO, LA LAGUNA"
"NUEVOS SECTORES EN CHILE"
"MI VILLA DE NOVA SECTOR 20 CHILE"
"SECTOR 20 CHILE"
end 

gen where = 0
gen where2 = .
foreach w in POBLACION VILLA SECTOR {
      replace where = strpos(tochange, "`w'")
      replace where2 = min(where2, strpos(tochange, " `w' ")) if strpos(tochange," `w' ") > 0
      replace where2 = 1 if strpos(tochange,"`w' ") == 1
} 

replace where2 = 0 if missing(where2)
gen toget = cond(where > 0, substr(tochange, 1, where - 1), tochange)
gen toget2 = cond(where2 > 0, substr(tochange, 1, where2 - 1), tochange)
drop where where2

The results are:

Code:

. li, noobs sep(0)

  +--------------------------------------------------------------------------------------------------------------------------+
  |                                      tochange                                           toget                     toget2 |
  |--------------------------------------------------------------------------------------------------------------------------|
  | CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT   CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT    CALLE CONCHA Y TORO 120 |
  |             PASAJE ACHAO, POBLACION LA LAGUNA               PASAJE ACHAO, POBLACION LA LAGUNA              PASAJE ACHAO, |
  |  VICENTE REYES 889 POBLACION PATRONA DE CHILE    VICENTE REYES 889 POBLACION PATRONA DE CHILE          VICENTE REYES 889 |
  |                  MI VILLA NUMERO 990 EN CHILE                    MI VILLA NUMERO 990 EN CHILE                         MI |
  |                       PASAJE ACHAO, LA LAGUNA                         PASAJE ACHAO, LA LAGUNA    PASAJE ACHAO, LA LAGUNA |
  |                PASAJE SECTOR ACHAO, LA LAGUNA                                         PASAJE                      PASAJE |
  |                      NUEVOS SECTORES EN CHILE                                         NUEVOS    NUEVOS SECTORES EN CHILE |
  |              MI VILLA DE NOVA SECTOR 20 CHILE                               MI VILLA DE NOVA                          MI |
  |                               SECTOR 20 CHILE                                                                            |
  +--------------------------------------------------------------------------------------------------------------------------+

The regular expression equivalent, incidentally, is the single line:

Code:

gen toget3 = ustrregexra(tochange,"(^|\s)(POBLACION|VILLA|SECTOR)\s.+$","")

I am myself trying to learn regular expressions, because as awkward (and yes, intimidating!) as they are, they are both extremely powerful and usually lead to much shorter code even when the old string functions can get the job done.

Last edited by Hemanshu Kumar; 12 Nov 2022, 05:54.

Comment

Daniel Perez Parra

Join Date: Jul 2021
Posts: 63

#11

12 Nov 2022, 07:55

Thank you so much to both!

For the other users, I've solved my issue using the following code:

Code:

foreach var of varlist address{

replace `var' = subinstr(`var', "SECTOR", "POBLACION",.)
replace `var' = subinstr(`var', "VILLA", "POBLACION",.)    
replace `var' = subinstr(`var', "LOCALIDAD", "POBLACION",.)    

replace `var' = subinstr(`var', ", POBLACION", "POBLACION",.)    

}

gen address2=trim(substr(address, 1, strpos(address, "POBLACION") -1))
egen address2dummy=group(address2)
replace address=address2 if address2dummy!=.

First, I rename the other all the location units as LOCATION. Then I apply the code suggested by Nick and I make the replacement (replace) generating an ID and replace when !=.
This is probably not the simplest solution but it works!

Daniel.

Comment

Hemanshu Kumar

Join Date: Mar 2015

Posts: 1326
#12

12 Nov 2022, 08:11

Daniel Perez Parra A couple of warnings and a question:
your code will wrongly pick up situations where VILLA, SECTOR or POBLACION are part of a longer word.

your code will not work if any of those words occur at the start of the string.

I hope you are sure that neither of these occur in your data.

Also, I am not clear why you create address2dummy. Here it seems to be used to identify observations where address2 is non-missing. If this is its only purpose, you can substitute for the last two lines:

Code:

replace address = address2 if !missing(address2)
Comment
Daniel Perez Parra

Join Date: Jul 2021

Posts: 63
#13

12 Nov 2022, 08:16

Originally posted by Hemanshu Kumar View Post

Daniel Perez Parra A couple of warnings and a question:
your code will wrongly pick up situations where VILLA, SECTOR or POBLACION are part of a longer word.

your code will not work if any of those words occur at the start of the string.

I hope you are sure that neither of these occur in your data.

Also, I am not clear why you create address2dummy. Here it seems to be used to identify observations where address2 is non-missing. If this is its only purpose, you can substitute for the last two lines:

Code:

replace address = address2 if !missing(address2)

Indeed, my data doesn't contain any of those 2 cases, but it is good to point that out.

Regarding your second comment, you're right. Thank you for providing an simpler code.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35446
#14

12 Nov 2022, 08:39

Thanks for the corrections in #10. The replacement should be conditional on the variable being positive. As for false positives, yes, that could be a problem. One can always imagine small problems. The code in #8 could fail whenever spelling was even slightly different.
Comment

Announcement