Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to extract data from web page by Stata?

    Hi guys,

    Is there any systematic method to extract data from web page using Stata?

    I did not find very useful information or reference. So if anyone could give me a tip or some materials, I would appreciate that very much.

    BTW, Does Stata support regular expression?

    Many thanks.

    Pengpeng

  • #2
    PengPeng: you are recommended to use Stata's search resources before posting! search regular expression or findit regular expression would have answered your second question immediately. (I also suggest that you don't post questions on different topics under a single heading.)

    Comment


    • #3
      PengPeng,

      Although it's not well documented, you can use Stata's file commands to read data from a website. Instead of a disk file name you can use a URL:

      Code:
      file open test using "http://www.google.com", read
      file read test line
      while r(eof)==0 {
         di "`line'"
         // Do something with the input...
         file read test line
      }
      In this example, test is the file handle name that you will use on all subsequent read operatios, and line is a local macro that contains the result of the read.

      See help file for more details.

      Regards,
      Joe

      Comment


      • #4
        P.S. A question from another user reminded me that the copy command also works; you just substitute the URL for the first filename. Of course, you still need to parse the resulting file. The above code might still be useful for that part of the task.

        Comment


        • #5
          Parsing complex pages can be tricky if you can only read the html pages as text files. You might consider pre-processing your html pages to extract the fields you are interested in. In Python you can use BeautifulSoup for webscraping. It might be easier and more reliable then working with the text files.

          Comment


          • #6
            Hi,
            You can use Yellow Pages Spider is tool that makes searches in most of the popular “Yellow Pages” directories. By the help of this software you can extract data from websites and the export them into excel file.

            http://www.softwaredownloadcentre.co...ges-spider.php

            Comment


            • #7
              My personal favorite is -fileread-:
              Code:
              set obs 1
              gen s = fileread("http://www.statalist.org")
              Regards, Mike

              Comment


              • #8
                Before you start scraping web pages, are you sure that the website you are interrogating doesn't provide a RESTful (or other) interface for accessing the data via the web?

                Comment


                • #9
                  Check also http://scrapy.org/, if you decide to go down that path.
                  You should:

                  1. Read the FAQ carefully.

                  2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

                  3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

                  4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

                  Comment


                  • #10
                    Originally posted by Mike Lacy View Post
                    My personal favorite is -fileread-:
                    Code:
                    set obs 1
                    gen s = fileread("http://www.statalist.org")
                    Regards, Mike
                    Thank you Mike, your suggestion saved me at least a month's work.

                    Comment

                    Working...
                    X