Dear Statalisters,
I am facing two problems with text files that I imported into Stata. The files consist of statements made different speakers. In the imported dataset, each speaker's statement is a single observation.
Problem #1: How can I use subinstr to remove text in only the first part of a string?
For the variable "statement," there are sometimes unwanted words that I would like to get rid of. (They are an artefact of how I split up the text files).
In the example below, I would like to only get rid of the "Secretary" at the end of Observation 1, and not remove the word elsewhere in the string.
If I use the following code:
-This works right for observation 1. It removes only the last instance of Secretary.
-It doesn't work right for observation 2. Since observation 2 did not have extra text at the end, the command removed Secretary in the middle of the string.
Is there a way to tell Stata to use subinstr only within the first 20 characters of the string? Something like the below, even though substr does not work this way!
Problem #2: How can I use regular expressions (or another method) to remove a certain pattern of text from a string?
In the text files, there are date stamps that appear on page breaks in the format of "DR 10.2008", "DR 1.2011", "DR 2.2022", and so on. They appear randomly throughout the strings: sometimes in the middle and sometimes at the end. They also vary according to the date of the text.
To take a variation on the data example above:
The distinguishing characteristic of these timestamps is that they begin with "DR", and end with a 4-character year. I would like to do something like the below (although again, this doesn't work as I'd like it to):
Could anyone help correct the above code to remove this text?
Thanks in advance!
Nate
I am facing two problems with text files that I imported into Stata. The files consist of statements made different speakers. In the imported dataset, each speaker's statement is a single observation.
Problem #1: How can I use subinstr to remove text in only the first part of a string?
For the variable "statement," there are sometimes unwanted words that I would like to get rid of. (They are an artefact of how I split up the text files).
In the example below, I would like to only get rid of the "Secretary" at the end of Observation 1, and not remove the word elsewhere in the string.
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str11 speaker str118 statement "Speaker 1" "The Secretary of State knows that the cost of food will get much higher. Secretary" "Speaker 2" "As our Secretary has said, the next meeting will focus specifically on the issue of food security." end
If I use the following code:
Code:
gen newvar = strreverse(statement) replace newvar = subinstr(newvar, "yraterceS", ".", 1) replace newvar = strreverse(newvar)
-It doesn't work right for observation 2. Since observation 2 did not have extra text at the end, the command removed Secretary in the middle of the string.
Is there a way to tell Stata to use subinstr only within the first 20 characters of the string? Something like the below, even though substr does not work this way!
Code:
replace staterev1 = subinstr(staterev1, "yraterceS", ".", 1) if substr(1, 20) ==1
Problem #2: How can I use regular expressions (or another method) to remove a certain pattern of text from a string?
In the text files, there are date stamps that appear on page breaks in the format of "DR 10.2008", "DR 1.2011", "DR 2.2022", and so on. They appear randomly throughout the strings: sometimes in the middle and sometimes at the end. They also vary according to the date of the text.
To take a variation on the data example above:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str11 speaker str118 statement "Speaker 1" "The Secretary of State knows DR 10.2008 that the cost of food will get much higher. Secretary" "Speaker 2" "As our Secretary has said, the next meeting will focus specifically on the issue of food security. DR 1.2011" end
The distinguishing characteristic of these timestamps is that they begin with "DR", and end with a 4-character year. I would like to do something like the below (although again, this doesn't work as I'd like it to):
Code:
replace statement = regexr(statement, "DR*[0-9][0-9][0-9][0-9]", "")
Thanks in advance!
Nate
Comment