Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing unwanted text from within strings (two variations on a problem)

    Dear Statalisters,

    I am facing two problems with text files that I imported into Stata. The files consist of statements made different speakers. In the imported dataset, each speaker's statement is a single observation.

    Problem #1: How can I use subinstr to remove text in only the first part of a string?

    For the variable "statement," there are sometimes unwanted words that I would like to get rid of. (They are an artefact of how I split up the text files).

    In the example below, I would like to only get rid of the "Secretary" at the end of Observation 1, and not remove the word elsewhere in the string.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str11 speaker str118 statement
    "Speaker 1" "The Secretary of State knows that the cost of food will get much higher. Secretary"                
    "Speaker 2" "As our Secretary has said, the next meeting will focus specifically on the issue of food security."
    end


    If I use the following code:
    Code:
    gen newvar = strreverse(statement) 
    replace newvar = subinstr(newvar, "yraterceS", ".", 1)
    replace newvar = strreverse(newvar)
    -This works right for observation 1. It removes only the last instance of Secretary.
    -It doesn't work right for observation 2. Since observation 2 did not have extra text at the end, the command removed Secretary in the middle of the string.

    Is there a way to tell Stata to use subinstr only within the first 20 characters of the string? Something like the below, even though substr does not work this way!
    Code:
    replace staterev1 = subinstr(staterev1, "yraterceS", ".", 1) if substr(1, 20) ==1


    Problem #2: How can I use regular expressions (or another method) to remove a certain pattern of text from a string?

    In the text files, there are date stamps that appear on page breaks in the format of "DR 10.2008", "DR 1.2011", "DR 2.2022", and so on. They appear randomly throughout the strings: sometimes in the middle and sometimes at the end. They also vary according to the date of the text.

    To take a variation on the data example above:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str11 speaker str118 statement
    "Speaker 1" "The Secretary of State knows DR 10.2008 that the cost of food will get much higher. Secretary"               
    "Speaker 2" "As our Secretary has said, the next meeting will focus specifically on the issue of food security. DR 1.2011"
    end

    The distinguishing characteristic of these timestamps is that they begin with "DR", and end with a 4-character year. I would like to do something like the below (although again, this doesn't work as I'd like it to):
    Code:
    replace statement = regexr(statement, "DR*[0-9][0-9][0-9][0-9]", "")
    Could anyone help correct the above code to remove this text?

    Thanks in advance!


    Nate










  • #2
    Code:
    replace statement = substr(statement,1,length(statement)-length(word(statement,-1))-1) if word(statement,-1) == "Secretary"
    replace statement = regexr(statement, " DR.*[0-9][0-9][0-9][0-9]","")

    Comment


    • #3
      Dear Øyvind,

      Thanks for your speedy reply, and for the code! Your code works perfectly for resolving these two problems.






      Comment

      Working...
      X