Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Web scraping / string parsing help

    Hello,

    I'm trying to scrape a webpage, and have imported html code as one variable and am trying to extract my data using string functions.

    To parse with different substrings that vary only in the middle (hence the asterisk below) I want to do something like this:

    split htmlcode, p(`"<div class="ysf-"'*"<a"*">" [2nd string following data])

    However, I don't think including the asterisk is acceptable use for the split command.

    Can anyone recommend another way to do this?

    Thank you!

    -Reese






  • #2
    I think you are trying to feed some kind of regular expression to split as parsing (punctuation) character. That's way beyond and outside what split does.

    The default case for split is parsing on spaces. So "Stata user" would just get split into "Stata" and "user".

    With e.g. p(,) "a,b,c,d" would get split into "a" "b" "c" "d"

    With e.g. p(@) "somebody@somewhere,edu" would get split into somebody and somewhere.edu.

    You can have multiple parsing characters, but that is it.

    You should get further with moss (SSC).

    Comment


    • #3
      I note that what was shown in post #1 does not constitute a "regular expression", and the following advice assumes familiarity with regular expressions, so if you're not a regular expression fanatic, both this code and using moss with regular expressions are going to raise more questions than they answer.

      A possible approach using regular expressions would be to use Stata's regular expression functions to match your "separator" and substitute a fixed string for it, which you then feed into split as the separator.
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str99 text
      "firstxxstuffyysecond"
      "xxotheryyanotheryymore"
      end
      generate newtext = ustrregexra(text,"xx.*yy","!!split!!")
      split newtext, parse("!!split!!")
      list
      Code:
      . list
      
           +---------------------------------------------------------------------+
           |                   text                newtext   newtext1   newtext2 |
           |---------------------------------------------------------------------|
        1. |   firstxxstuffyysecond   first!!split!!second      first     second |
        2. | xxotheryyanotheryymore          !!split!!more                  more |
           +---------------------------------------------------------------------+
      The second example was chosen to demonstrate a problem: the regular expression matching is "greedy" making the longest possible match, it does not understand that the first "yy" in my example is intended to stop the match begun at "xx". Not sure how to work around that; I think I've worked with regular expression software that could be set to be greedy or not, but I don't think that is so for the regular expression engine used by Stata.

      Comment


      • #4
        Thank you both!

        William, your solution is exactly what I need is rather simple. I've never used ustrregexra and would not have thought of it.
        Nick, thanks for introducing me to (and writing!) moss. Not going to use it here, but it will be great to have in my repertoire.

        -Reese

        Comment

        Working...
        X