Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using python integration to parse a strL variable

    I am trying to parse a strL variable named "X" in Stata. One of the first steps I need to complete is to remove all characters in the strL variable that fall between the characters "<" and ">". I have attempted using regular expressions via the command:

    Code:
    replace X = regexr(X, "\<(.)+\>", "")
    but this crashes Stata - I suspect because the strL variable X can be very long with lots of text falling between "<" and ">". Sometimes there are hundreds of separate "< text that needs to be removed >" occurrences in a single observation's value of X.

    I thought that perhaps I could use Stata's python integration to (1) load the strL variable X into python, remove all the text in X between the characters "<" and ">", and return the modified strL variable X back to Stata for further parsing using Stata' excellent substring functions, with which I am already quite familiar. The problem is that I don't have much familiarity to python, and it appears that working with strL variables in Python is a bit complicated. Whereas I can easily load a str variable from Stata into python using the sfi module, loading a strL variable seems to work differently for reasons I don't fully understand. I am looking for any advice that might be helpful in this task - whether it be in native Stata (perhaps there is some other approach for removing the unwanted text that won't crash Stata) or through python integration.

    Thanks in advance!

  • #2
    I have some experience that oddities can occur in using regex functions on strLs, some of which problems have been fixed over the years. I would encourage you to contact Stata Tech Support, which was quite thoughtful and interested when I did that about another issue in this context several years ago.

    I would consider using an approach that does not use regex. I think you can do what you want with something like the following:
    Code:
    // This should be correct in concept, but it may well have some mistakes.
    // Corrections welcome.
    gen start = .
    gen stop = .
    local done = 0
    while !`done'  {
      // look for next instances of "<" and >"
      replace start = strpos(X, "<")
      capture assert start == 0
      if (_rc > 0)  {
         // still some observations with stuff to delete
         replace stop = strpos(substr(X,start, .), ">")
         replace X = substr(X, 1, start -1) + ///
                     substr(X, stop + 1, . )
      }
      else { // no observations with "<"
         local done = 1
      }
    }

    Comment


    • #3
      To the advice from Mike Lacy let me add a few suggestions.

      First, your current regular expression is going to match everything between the leftmost "<" and the rightmost ">". This is not what you want.

      Next, I would suggest that you try using the newer Unicode-supporting regular expression functions (e.g. ustregexra in your case). They have a different underlying regular expression engine that may be more robust to long strings, although in fact eliminating the problem of long matches may itself solve the crashing problem. Note that ASCII is a proper subset of Unicode and thus these functions work on simple ASCII strings as well as multi-byte Unicode characters.

      Here is an example.
      Code:
      . * Example generated by -dataex-. To install: ssc install dataex
      . clear
      
      . input str17 text
      
                        text
        1. "1 <one> 2 <two> 3"
        2. end
      
      . generate new1 = regexr(text, "\<(.)+\>", "")
      
      . generate new2 = ustrregexra(text, "\<[^>]*\>", "")
      
      . list, clean noobs
      
                       text   new1      new2  
          1 <one> 2 <two> 3   1  3   1  2  3  
      
      .
      To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.

      Also, a caveat: if you have nested sets of angle brackets (e.g. "<a <b> c>") you will need a more complicated regular expression than I provided, which I believe would yield " c >".

      Comment

      Working...
      X