Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing middle of a string between certain characters

    Hello Statalist,

    I have a string variable which is interspersed with HTML tags (e.g. "<br>" or "</span>"). I want to get rid of all these tags that are identified by angled brackets.

    To make things complicated:
    1) There is a large variety of these tags, so I cannot simply run a "subinstr()" for a select list of them - I need something that catches them in an automated way via the angled brackets.
    2) There can be more than one of these tags per observation.

    I tried the following code (looping it 9 times to remove up to 9 tags):

    Code:
    foreach num of numlist 1/9 {
       gen htmltag`num'=substr(textwithtags,strpos(textwithtags,"<"),strpos(textwithtags,">"))
       replace textwithtags=subinstr(textwithtags,htmltag`num',"",.)
    }
    But this doesn't work well for cases with multiple tags. Take the following example: "<br> Does NAME have an <span style="color:red"> AGREEMENT or CONTRACT</span> to return?" - this approach doesn't know which pair of brackets belong together as one tag, and in consequence some of the "real" text between the tags is also removed... and I end up with only "Does NAME have an ACT"...

    Any help would be much appreciated!

    Best,
    Felix
    Last edited by Felix Sg; 06 May 2019, 10:31.

  • #2
    Stata's "regular expression" string funcions can help with this.
    Code:
    clear
    set obs 1
    generate str100 html = `"<br> Does NAME have an <span style="color:red"> AGREEMENT or CONTRACT</span> to return?"'
    
    generate str100 text = ustrregexra(html,"<[^\>]*>","")
    replace text = trim(stritrim(text))
    list, noobs
    Code:
      +-----------------------------------------------------------------------------------------+
      |                                                                                    html |
      | <br> Does NAME have an <span style="color:red"> AGREEMENT or CONTRACT</span> to return? |
      |-----------------------------------------------------------------------------------------|
      |                                                                 text                    |
      |                   Does NAME have an AGREEMENT or CONTRACT to return?                    |
      +-----------------------------------------------------------------------------------------+
    To learn more about constructing regular expressions, to the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.

    Comment


    • #3
      William's regular expression solution is probably your best choice for tackling your problem. However you might still want to know what wrong with your original code.

      In this example the problem isn't that strpos can't match the brackets. You get the position of the first "<" and the first ">" each time so unless you have nested tags they should match up. The issue is with your subinstr statement. You have specified it as (string to subset, start position, end position) but that's not the syntax of subinstr. The arguments it takes are the string to subset, the start position and the length of the substitution.

      So if you were going to do it this way you need to rewrite your code so that you're getting the length of the desired substitution first. So you could do something like this

      Code:
      foreach num of numlist 1/9 {
         gen len`num'=strpos(textwithtags,">")-strpos(textwithtags,"<")+1
         gen htmltag`num'=substr(textwithtags,strpos(textwithtags,"<"),len`num')
         replace textwithtags=subinstr(textwithtags,htmltag`num',"",.)
      }

      Comment


      • #4
        Dear William and Sarah,
        This is perfect advice, thank you both so much for providing both a logical fix for my approach and a more elegant solution!
        Best,
        Felix

        Comment


        • #5

          The ? operator make the regex match non-greedy (i.e. match as few times as possible):
          Code:
          gen text = ustrregexra( html, "<.+?>", "" )

          Comment


          • #6
            Originally posted by William Lisowski View Post
            Stata's "regular expression" string funcions can help with this.
            Code:
            clear
            set obs 1
            generate str100 html = `"<br> Does NAME have an <span style="color:red"> AGREEMENT or CONTRACT</span> to return?"'
            
            generate str100 text = ustrregexra(html,"<[^\>]*>","")
            replace text = trim(stritrim(text))
            list, noobs
            Code:
            +-----------------------------------------------------------------------------------------+
            | html |
            | <br> Does NAME have an <span style="color:red"> AGREEMENT or CONTRACT</span> to return? |
            |-----------------------------------------------------------------------------------------|
            | text |
            | Does NAME have an AGREEMENT or CONTRACT to return? |
            +-----------------------------------------------------------------------------------------+
            To learn more about constructing regular expressions, to the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.
            Hi William,
            Could you explain what "<[^\>]*>" mean? Thanks!!

            Comment

            Working...
            X