Removing unwanted text from within strings (two variations on a problem)

Nate Tamment

Join Date: Jun 2020

Posts: 19
#1

Removing unwanted text from within strings (two variations on a problem)

03 May 2022, 23:07

Dear Statalisters,

I am facing two problems with text files that I imported into Stata. The files consist of statements made different speakers. In the imported dataset, each speaker's statement is a single observation.

Problem #1: How can I use subinstr to remove text in only the first part of a string?

For the variable "statement," there are sometimes unwanted words that I would like to get rid of. (They are an artefact of how I split up the text files).

In the example below, I would like to only get rid of the "Secretary" at the end of Observation 1, and not remove the word elsewhere in the string.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str11 speaker str118 statement "Speaker 1" "The Secretary of State knows that the cost of food will get much higher. Secretary" "Speaker 2" "As our Secretary has said, the next meeting will focus specifically on the issue of food security." end

If I use the following code:

Code:

gen newvar = strreverse(statement) replace newvar = subinstr(newvar, "yraterceS", ".", 1) replace newvar = strreverse(newvar)

-This works right for observation 1. It removes only the last instance of Secretary.
-It doesn't work right for observation 2. Since observation 2 did not have extra text at the end, the command removed Secretary in the middle of the string.

Is there a way to tell Stata to use subinstr only within the first 20 characters of the string? Something like the below, even though substr does not work this way!

Code:

replace staterev1 = subinstr(staterev1, "yraterceS", ".", 1) if substr(1, 20) ==1

Problem #2: How can I use regular expressions (or another method) to remove a certain pattern of text from a string?

In the text files, there are date stamps that appear on page breaks in the format of "DR 10.2008", "DR 1.2011", "DR 2.2022", and so on. They appear randomly throughout the strings: sometimes in the middle and sometimes at the end. They also vary according to the date of the text.

To take a variation on the data example above:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str11 speaker str118 statement "Speaker 1" "The Secretary of State knows DR 10.2008 that the cost of food will get much higher. Secretary" "Speaker 2" "As our Secretary has said, the next meeting will focus specifically on the issue of food security. DR 1.2011" end

The distinguishing characteristic of these timestamps is that they begin with "DR", and end with a 4-character year. I would like to do something like the below (although again, this doesn't work as I'd like it to):

Code:

replace statement = regexr(statement, "DR*[0-9][0-9][0-9][0-9]", "")

Could anyone help correct the above code to remove this text?

Thanks in advance!

Nate
Tags: None

Øyvind Snilsberg

Join Date: Oct 2021
Posts: 591

04 May 2022, 00:59

Code:

replace statement = substr(statement,1,length(statement)-length(word(statement,-1))-1) if word(statement,-1) == "Secretary"
replace statement = regexr(statement, " DR.*[0-9][0-9][0-9][0-9]","")

Comment

Nate Tamment

Join Date: Jun 2020

Posts: 19
#3

04 May 2022, 07:14

Dear Øyvind,

Thanks for your speedy reply, and for the code! Your code works perfectly for resolving these two problems.
Comment

Announcement

Removing unwanted text from within strings (two variations on a problem)

Comment

Comment