Removing middle of a string between certain characters

Felix Sg

Join Date: May 2019

Posts: 4
#1

Removing middle of a string between certain characters

06 May 2019, 10:18

Hello Statalist,

I have a string variable which is interspersed with HTML tags (e.g. " " or ""). I want to get rid of all these tags that are identified by angled brackets.

To make things complicated:
1) There is a large variety of these tags, so I cannot simply run a "subinstr()" for a select list of them - I need something that catches them in an automated way via the angled brackets.
2) There can be more than one of these tags per observation.

I tried the following code (looping it 9 times to remove up to 9 tags):

Code:

foreach num of numlist 1/9 { gen htmltag`num'=substr(textwithtags,strpos(textwithtags,"<"),strpos(textwithtags,">")) replace textwithtags=subinstr(textwithtags,htmltag`num',"",.) }

But this doesn't work well for cases with multiple tags. Take the following example: " Does NAME have an AGREEMENT or CONTRACT to return?" - this approach doesn't know which pair of brackets belong together as one tag, and in consequence some of the "real" text between the tags is also removed... and I end up with only "Does NAME have an ACT"...

Any help would be much appreciated!

Best,
Felix

Last edited by Felix Sg; 06 May 2019, 10:31.
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

06 May 2019, 13:35

Stata's "regular expression" string funcions can help with this.

Code:

clear
set obs 1
generate str100 html = `"<br> Does NAME have an <span style="color:red"> AGREEMENT or CONTRACT</span> to return?"'

generate str100 text = ustrregexra(html,"<[^\>]*>","")
replace text = trim(stritrim(text))
list, noobs

Code:

  +-----------------------------------------------------------------------------------------+
  |                                                                                    html |
  | <br> Does NAME have an <span style="color:red"> AGREEMENT or CONTRACT</span> to return? |
  |-----------------------------------------------------------------------------------------|
  |                                                                 text                    |
  |                   Does NAME have an AGREEMENT or CONTRACT to return?                    |
  +-----------------------------------------------------------------------------------------+

To learn more about constructing regular expressions, to the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.

Comment

Sarah Edgington

Join Date: Apr 2014

Posts: 284
#3

06 May 2019, 13:48

William's regular expression solution is probably your best choice for tackling your problem. However you might still want to know what wrong with your original code.

In this example the problem isn't that strpos can't match the brackets. You get the position of the first "<" and the first ">" each time so unless you have nested tags they should match up. The issue is with your subinstr statement. You have specified it as (string to subset, start position, end position) but that's not the syntax of subinstr. The arguments it takes are the string to subset, the start position and the length of the substitution.

So if you were going to do it this way you need to rewrite your code so that you're getting the length of the desired substitution first. So you could do something like this

Code:

foreach num of numlist 1/9 { gen len`num'=strpos(textwithtags,">")-strpos(textwithtags,"<")+1 gen htmltag`num'=substr(textwithtags,strpos(textwithtags,"<"),len`num') replace textwithtags=subinstr(textwithtags,htmltag`num',"",.) }
2 likes
Comment
Felix Sg

Join Date: May 2019

Posts: 4
#4

07 May 2019, 03:57

Dear William and Sarah,
This is perfect advice, thank you both so much for providing both a logical fix for my approach and a more elegant solution!
Best,
Felix
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#5

07 May 2019, 06:35

The ? operator make the regex match non-greedy (i.e. match as few times as possible):

Code:

gen text = ustrregexra( html, "<.+?>", "" )
Comment

ZILU JIANG

Join Date: Mar 2022
Posts: 1

06 Jun 2022, 22:29

Originally posted by William Lisowski View Post

Stata's "regular expression" string funcions can help with this.

Code:

clear
set obs 1
generate str100 html = `"<br> Does NAME have an <span style="color:red"> AGREEMENT or CONTRACT</span> to return?"'

generate str100 text = ustrregexra(html,"<[^\>]*>","")
replace text = trim(stritrim(text))
list, noobs

Code:

+-----------------------------------------------------------------------------------------+
| html |
| <br> Does NAME have an <span style="color:red"> AGREEMENT or CONTRACT</span> to return? |
|-----------------------------------------------------------------------------------------|
| text |
| Does NAME have an AGREEMENT or CONTRACT to return? |
+-----------------------------------------------------------------------------------------+

Hi William,
Could you explain what "<[^\>]*>" mean? Thanks!!

Announcement