Remove part of string variable if it contains a word

Daniel Wilmink

Join Date: Jun 2017

Posts: 15
#1

Remove part of string variable if it contains a word

15 Jun 2017, 08:02

Dear Statalist members,

I am trying to remove a part of my name variable if it contains 'DEAD'. This is an example of my data:

name
AAH DEAD
AB AIRLINES DEAD
AB CONSULTING
AB DYNAMICS
AB ELTN.PRDS.
ABACUS GROUP DEAD

Desired result:
name
AAH
AB AIRLINES
AB CONSULTING
AB DYNAMICS
AB ELTN.PRDS.
ABACUS GROUP

Essentially, I need a code that removes ' DEAD' if an observation contains it. I have tried the code below, but that removes the entire name of the company and results in empty observations for companies that had 'DEAD' in their names.

replace name = substr(name, 1, strpos(name,"DEAD")-1)

Thank you for your help!
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2449
#2

15 Jun 2017, 08:34

If you have not already, try looking at the entries in -help string functions- to learn about the various functions that would help with problems related to strings For this question, you will find -subinstr()- to be useful:

Code:

replace name = subinstr(name,"DEAD", .)
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#3

15 Jun 2017, 08:38

Code:

replace name = subinstr(name, " DEAD", "", .)

Note: This will remove " DEAD" wherever it occurs--even if that's not the end of the name. Also, are you sure that all values of name are upper case in your data? If a string contains " dead" this code will fail.

The reason your code removes all the strings that do not contain "DEAD" is that, when "DEAD" does not appear in name, strpos(name, "DEAD") is 0. The code then replaces name by the substring of name beginning at character 1 with length 0-1. A string with negative length is an empty string!

Added: Crossed with #2, which, I believe, has an error because it omits the "" argument.
5 likes
Comment
Daniel Wilmink

Join Date: Jun 2017

Posts: 15
#4

15 Jun 2017, 08:42

Entering this yields an 'Invalid syntax r(198)'

However, if I use

Code:

replace name = subinstr(name,"DEAD","",.)

it works, but leaves a spacebar after the company name (e.g. 'AAH ' instead of 'AAH'). How would I go about taking care of this spacebar?
Comment
Daniel Wilmink

Join Date: Jun 2017

Posts: 15
#5

15 Jun 2017, 08:47

Originally posted by Clyde Schechter View Post

Code:

replace name = subinstr(name, " DEAD", "", .)

Note: This will remove " DEAD" wherever it occurs--even if that's not the end of the name. Also, are you sure that all values of name are upper case in your data? If a string contains " dead" this code will fail.

The reason your code removes all the strings that do not contain "DEAD" is that, when "DEAD" does not appear in name, strpos(name, "DEAD") is 0. The code then replaces name by the substring of name beginning at character 1 with length 0-1. A string with negative length is an empty string!

Added: Crossed with #2, which, I believe, has an error because it omits the "" argument.

Dear Clyde,

This worked! However, in order to be safe, the 'DEAD' part should only be removed at the end of the name. Is that possible too? I.e. that only the last 5 characters are removed (spacebar+DEAD) if spacebar+DEAD are the last 5 characters.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#6

15 Jun 2017, 09:05

So, to be fully general, we need the code to remove " DEAD" when it appears at the end of the string, and, I presume, if a string contains DEAD both at the end and somewhere else, it should remove only the one at the end. That requires a few little tricks:

Code:

clear input str18 name "AAH DEAD" "AB AIRLINES DEAD" "AB CONSULTING" "AB DYNAMICS" "AB ELTN.PRDS." "ABACUS GROUP DEAD" "GRATEFUL DEAD INC" "ONE DEAD1 TWO DEAD" end replace name = reverse(subinstr(reverse(name), "DAED ", "", 1)) /// if substr(name, length(name)-4, 5) == " DEAD" list, noobs clean

Notice by the way that "DAED " in the -replace name =...- line is not a typo. It is " DEAD" typed backwards, which is exactly what is needed here.

By the way, in the future, please post example data using the -dataex- command. While what you have shown was easy enough to import to Stata, more complicated example data sets would have been troublesome. You can get the -dataex- command by running -ssc install dataex- and then you can run -help dataex- to see the instructions for using it. When you use -dataex-, you make it possible for those who want to help you to create a completely faithful replica of your Stata example with a simple copy/paste operation.

Added: There is probably a really short simple way to do this with regular expressions, but I always struggle with them, perhaps because I use them too infrequently to ever really internalize their syntax.

Last edited by Clyde Schechter; 15 Jun 2017, 09:36.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

15 Jun 2017, 10:18

Rising to Clyde's challenge, but only for expository purposes, and to demonstrate technique that removes any excess blanks before the trailing "DEAD".

Code:

clear
input str18 name
"AAH DEAD"        
"AB AIRLINES DEAD"
"AB CONSULTING"    
"AB DYNAMICS"      
"AB ELTN.PRDS."    
"ABACUS GROUP DEAD"
"GRATEFUL DEAD INC"
"ONE DEAD1 TWO DEAD"
"THE UNDEAD"
"TWO BLANKS  DEAD"
end

clonevar rename = name
replace name = reverse(subinstr(reverse(name), "DAED ", "", 1)) ///
    if substr(name, length(name)-4, 5) == " DEAD"
replace rename = regexr(rename," +DEAD$","")

list, noobs clean

Code:

. list, noobs clean

                 name              rename  
                  AAH                 AAH  
          AB AIRLINES         AB AIRLINES  
        AB CONSULTING       AB CONSULTING  
          AB DYNAMICS         AB DYNAMICS  
        AB ELTN.PRDS.       AB ELTN.PRDS.  
         ABACUS GROUP        ABACUS GROUP  
    GRATEFUL DEAD INC   GRATEFUL DEAD INC  
        ONE DEAD1 TWO       ONE DEAD1 TWO  
           THE UNDEAD          THE UNDEAD  
          TWO BLANKS           TWO BLANKS

Comment

Santiago Cantillo

Join Date: Nov 2017

Posts: 17
#8

18 Dec 2017, 21:49

Hi!
What about if you have a list of words you want to remove?
So instead of just having "DEAD" you also want to remove "ALIVE" and "DON'T KNOW".
For example:

Code:

name AAH DEAD AB AIRLINES DEAD AB CONSULTING ALIVE AB DYNAMICS DON'T KNOW AB ELTN.PRDS. ABACUS GROUP DEAD

I have a long list of string i would like to remove so i'm trying to use a foreach loop but haven't had luck yet.
So far my code looks something like this

Code:

foreach scenario in ALIVE DEAD "DON'T KNOW"{ replace name="`scenario'" if strpos(name, "`scenario'") >0 }

My problem here is that code will leave "AAH DEAD" unchanged since it doesn't contain DON'T KNOW (last loop), and so ignores the replace done in the previous loops.
Thanks in advance!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#9

19 Dec 2017, 08:43

I don't believe your loop does what you say it does, but it also doesn't do what you want. Try this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str22 name "AAH DEAD" "AB AIRLINES DEAD" "AB CONSULTING ALIVE" "AB DYNAMICS DON'T KNOW" "AB ELTN.PRDS." "ABACUS GROUP DEAD" end foreach scenario in DEAD ALIVE "DON'T KNOW"{ local target = reverse(`"`scenario'"') local length = length(`"`target'"') replace name = reverse(subinstr(reverse(name), "`target' ", "", 1)) /// if substr(name, length(name)-`length', `=`length'+1') == `" `scenario'"' }

In the future, please use the -dataex- command to show example data, as I have done here. It greatly simplifies the process of replicating your Stata example in another person's Stata, so that code can be tested on it. If you are running version 15.1, -dataex- is part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it.
1 like
Comment
Maia DEBS

Join Date: Apr 2023

Posts: 11
#10

27 Apr 2023, 08:31

Hi all,

A follow-up question to this old thread, which I found very helpful: What about if you have a list of words you want to keep, rather than drop?

My data consists in a list of different financial instruments:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str28 instrument "4 % cons. cum. stock" "6 1/2 % Electric Power debs." "6 % debs." "5 1/4 % stock" "5 % stock org. cons." "5 % stock (1923)" "5 1/4 % stock" end

My goal is to harmonize the names of the instruments. To do this, I want to keep the interest rate + a list of words (cons. cum. and stock.) and drop all words that are not part of this list (In this case, I want to drop "Electric Power", "org." and "(1923)" ) How can I do that?

Code:

* desired result: clear input str25 instrument "4 % cons. cum. stock" "6 1/2 % debs." "6 % debs." "5 1/4 % stock" "5 % stock cons." "5 % stock" "5 1/4 % stock" end

Many thanks in advance for any ideas/leads!

Last edited by Maia DEBS; 27 Apr 2023, 08:37.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 36058

#11

27 Apr 2023, 09:29

This may help with #10

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str28 instrument
"4 % cons. cum. stock"              
"6 1/2 % Electric Power debs."      
"6 % debs."    
"5 1/4 % stock"      
"5 % stock org. cons."    
"5 % stock (1923)"    
"5 1/4 % stock"          
end

gen first = substr(instrument, 1, strpos(instrument, "%")) 

gen work = trim(substr(instrument, strpos(instrument, "%") + 1, .))  

gen second = "" 

foreach w in cons cum stock {
  replace second = second + "`w' " if strpos(work, "`w'")
} 

replace second = trim(second) 

list instrument first second 

     +---------------------------------------------------------+
     |                   instrument     first           second |
     |---------------------------------------------------------|
  1. |         4 % cons. cum. stock       4 %   cons cum stock |
  2. | 6 1/2 % Electric Power debs.   6 1/2 %                  |
  3. |                    6 % debs.       6 %                  |
  4. |                5 1/4 % stock   5 1/4 %            stock |
  5. |         5 % stock org. cons.       5 %       cons stock |
     |---------------------------------------------------------|
  6. |             5 % stock (1923)       5 %            stock |
  7. |                5 1/4 % stock   5 1/4 %            stock |
     +---------------------------------------------------------+

Comment

Maia DEBS

Join Date: Apr 2023

Posts: 11
#12

09 May 2023, 10:10

Apologies about the slow reply, that did the job perfectly, many thanks!
Comment

Maia DEBS

Join Date: Apr 2023
Posts: 11

#13

12 May 2023, 08:37

A follow-up question. I have the variable CompanyInstrument, which combines company names and financial instruments. I want to split it in two variables: Company and Instrument.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str40 CompanyInstrument
"Hull Corporatn. 3 1/2 %, red. 1943"
"Harrisons and Crosfield 5 1/2 % cum. prf."
"Green (J. W.) 4 % deb. red."
"Delaware and Hudson. $100 shares"
"Gas Light and Coke (10 % with gas at 3/9)"
"Lynn & Boston cum. bonds 1924"
"Argentine Land pref. shares 10% guaranteed"
"Birmingham Gas cum. shares (max., 9 pr ct.)"
end

a) In most cases, the name of the financial instrument starts with the interest rate. How can I adapt Nick's code to obtain the following result:

Company	Instrument
Hull Corporatn	3 1/2 %, red. 1943
Harrisons and Crosfield	5 1/2 % cum. prf.
Green (J. W.)	4 % deb. red.

b) In other cases it is less straightforward, because the instrument's name doesn't start with the interest rate. It begins with the characteristics of the instrument (cum. , pref.) or with a $ sign, or with a parenthesis. How can I tell Stata to extract the instrument name starting the first of these words that appear in CompanyInterest?

Desired result is:

Company	Instrument
Delaware and Hudson.	$100 shares
Gas Light and Coke	(10 % with gas at 3/9)
Lynn & Boston	cum. Bonds 1924
Argentine Land L.	pref. shares 10% guaranteed
Birmingham Gas	cum. shares (max., 9 pr ct.)

I am struggling to write a code that takes into account both scenarios a) and b). Many thanks for your suggestions!

Last edited by Maia DEBS; 12 May 2023, 08:41.

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10483
#14

12 May 2023, 17:52

Any automated solution requires you to specify a specific set of conditions. If you are finding it difficult to spell out your rules, then it also becomes difficult to write code. For example, there appears no obvious way to specify that the parentheses

Green (J. W.) 4% deb. red.

belong to "company" whereas the parentheses

Gas Light and Coke (10 % with gas at 3/9)

do not. Unless you come up with some specific rule, e.g., only parentheses that are placed at the end of the sentence belong to "instrument".
Comment
Maia DEBS

Join Date: Apr 2023

Posts: 11
#15

14 May 2023, 22:46

Hi Andrew. Right, thank you for pointing it out. The rule goes as follows: the code should extract the sentence between the parentheses only if the first word that immediately follows the opening parenthesis is either a percentage (e.g. 10%) or the word 'max'.

Last edited by Maia DEBS; 14 May 2023, 22:50.
Comment

Announcement