How to extract data from web page by Stata?

Pengpeng Ye

Join Date: May 2014

Posts: 6
#1

How to extract data from web page by Stata?

12 Jun 2014, 22:53

Hi guys,

Is there any systematic method to extract data from web page using Stata?

I did not find very useful information or reference. So if anyone could give me a tip or some materials, I would appreciate that very much.

BTW, Does Stata support regular expression?

Many thanks.

Pengpeng
Tags: None
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#2

13 Jun 2014, 02:34

PengPeng: you are recommended to use Stata's search resources before posting! search regular expression or findit regular expression would have answered your second question immediately. (I also suggest that you don't post questions on different topics under a single heading.)
1 like
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#3

13 Jun 2014, 08:54

PengPeng,

Although it's not well documented, you can use Stata's file commands to read data from a website. Instead of a disk file name you can use a URL:

Code:

file open test using "http://www.google.com", read file read test line while r(eof)==0 { di "`line'" // Do something with the input... file read test line }

In this example, test is the file handle name that you will use on all subsequent read operatios, and line is a local macro that contains the result of the read.

See help file for more details.

Regards,
Joe
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#4

13 Jun 2014, 14:33

P.S. A question from another user reminded me that the copy command also works; you just substitute the URL for the first filename. Of course, you still need to parse the resulting file. The above code might still be useful for that part of the task.
Comment
Bert Jung

Join Date: Apr 2014

Posts: 16
#5

14 Jun 2014, 12:25

Parsing complex pages can be tricky if you can only read the html pages as text files. You might consider pre-processing your html pages to extract the fields you are interested in. In Python you can use BeautifulSoup for webscraping. It might be easier and more reliable then working with the text files.
Comment
alva

Join Date: Oct 2014

Posts: 1
#6

10 Oct 2014, 01:07

Hi,
You can use Yellow Pages Spider is tool that makes searches in most of the popular “Yellow Pages” directories. By the help of this software you can extract data from websites and the export them into excel file.

http://www.softwaredownloadcentre.co...ges-spider.php
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#7

10 Oct 2014, 06:56

My personal favorite is -fileread-:

Code:

set obs 1 gen s = fileread("http://www.statalist.org")

Regards, Mike
2 likes
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#8

10 Oct 2014, 07:29

Before you start scraping web pages, are you sure that the website you are interrogating doesn't provide a RESTful (or other) interface for accessing the data via the web?
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#9

10 Oct 2014, 12:15

Check also http://scrapy.org/, if you decide to go down that path.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10194
#10

26 Dec 2022, 05:27

Originally posted by Mike Lacy View Post

My personal favorite is -fileread-:

Code:

set obs 1 gen s = fileread("http://www.statalist.org")

Regards, Mike

Thank you Mike, your suggestion saved me at least a month's work.
Comment

Announcement

How to extract data from web page by Stata?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment