I am trying to parse a strL variable named "X" in Stata. One of the first steps I need to complete is to remove all characters in the strL variable that fall between the characters "<" and ">". I have attempted using regular expressions via the command:
but this crashes Stata - I suspect because the strL variable X can be very long with lots of text falling between "<" and ">". Sometimes there are hundreds of separate "< text that needs to be removed >" occurrences in a single observation's value of X.
I thought that perhaps I could use Stata's python integration to (1) load the strL variable X into python, remove all the text in X between the characters "<" and ">", and return the modified strL variable X back to Stata for further parsing using Stata' excellent substring functions, with which I am already quite familiar. The problem is that I don't have much familiarity to python, and it appears that working with strL variables in Python is a bit complicated. Whereas I can easily load a str variable from Stata into python using the sfi module, loading a strL variable seems to work differently for reasons I don't fully understand. I am looking for any advice that might be helpful in this task - whether it be in native Stata (perhaps there is some other approach for removing the unwanted text that won't crash Stata) or through python integration.
Thanks in advance!
Code:
replace X = regexr(X, "\<(.)+\>", "")
I thought that perhaps I could use Stata's python integration to (1) load the strL variable X into python, remove all the text in X between the characters "<" and ">", and return the modified strL variable X back to Stata for further parsing using Stata' excellent substring functions, with which I am already quite familiar. The problem is that I don't have much familiarity to python, and it appears that working with strL variables in Python is a bit complicated. Whereas I can easily load a str variable from Stata into python using the sfi module, loading a strL variable seems to work differently for reasons I don't fully understand. I am looking for any advice that might be helpful in this task - whether it be in native Stata (perhaps there is some other approach for removing the unwanted text that won't crash Stata) or through python integration.
Thanks in advance!
Comment