Web scraping / string parsing help

Reese Crispen

Join Date: Jul 2018

Posts: 55
#1

Web scraping / string parsing help

27 Nov 2018, 20:39

Hello,

I'm trying to scrape a webpage, and have imported html code as one variable and am trying to extract my data using string functions.

To parse with different substrings that vary only in the middle (hence the asterisk below) I want to do something like this:

split htmlcode, p(`"<div class="ysf-"'*"<a"*">" [2nd string following data])

However, I don't think including the asterisk is acceptable use for the split command.

Can anyone recommend another way to do this?

Thank you!

-Reese
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35334
#2

28 Nov 2018, 03:33

I think you are trying to feed some kind of regular expression to split as parsing (punctuation) character. That's way beyond and outside what split does.

The default case for split is parsing on spaces. So "Stata user" would just get split into "Stata" and "user".

With e.g. p(,) "a,b,c,d" would get split into "a" "b" "c" "d"

With e.g. p(@) "somebody@somewhere,edu" would get split into somebody and somewhere.edu.

You can have multiple parsing characters, but that is it.

You should get further with moss (SSC).
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

28 Nov 2018, 08:46

I note that what was shown in post #1 does not constitute a "regular expression", and the following advice assumes familiarity with regular expressions, so if you're not a regular expression fanatic, both this code and using moss with regular expressions are going to raise more questions than they answer.

A possible approach using regular expressions would be to use Stata's regular expression functions to match your "separator" and substitute a fixed string for it, which you then feed into split as the separator.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str99 text "firstxxstuffyysecond" "xxotheryyanotheryymore" end generate newtext = ustrregexra(text,"xx.*yy","!!split!!") split newtext, parse("!!split!!") list

Code:

. list +---------------------------------------------------------------------+ | text newtext newtext1 newtext2 | |---------------------------------------------------------------------| 1. | firstxxstuffyysecond first!!split!!second first second | 2. | xxotheryyanotheryymore !!split!!more more | +---------------------------------------------------------------------+

The second example was chosen to demonstrate a problem: the regular expression matching is "greedy" making the longest possible match, it does not understand that the first "yy" in my example is intended to stop the match begun at "xx". Not sure how to work around that; I think I've worked with regular expression software that could be set to be greedy or not, but I don't think that is so for the regular expression engine used by Stata.
2 likes
Comment
Reese Crispen

Join Date: Jul 2018

Posts: 55
#4

28 Nov 2018, 12:00

Thank you both!

William, your solution is exactly what I need is rather simple. I've never used ustrregexra and would not have thought of it.
Nick, thanks for introducing me to (and writing!) moss. Not going to use it here, but it will be great to have in my repertoire.

-Reese
Comment

Announcement

Web scraping / string parsing help

Comment

Comment

Comment