Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Google Scholar web scraping: web error 503

    I am interested in scraping Google Scholar citations for a list of 700+ papers I have listed in a CSV file. I've written some fairly simple Stata code to do the web scraping, but I've found that after about 40 queries Google "blocks" me from entering any more queries and I receive the error message:

    web error 503
    could not open url


    Does anyone have tips for getting around the 503 error for several hundred queries? I've tried running the queries at non-regular intervals to simulate a human, using Stata's sleep command, but with no luck.

    I've been using Stata's import delimited command. Is this the best option? The relevant portion of my code is:

    foreach title in[list of paper titles] {
    local website "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C9&q=`title' "
    import delimited using "`website'", [options]
    }

  • #2
    I dont have a solution for you but some suggestions or things to consider:

    1. Google Scholar does not permit scraping. They have api's for other services (like goo.gl) that allow this kind of automated activity, but not for scholar.

    2. You might be able to get around this by some combination of (1) manipulating your IP address (you'll need to consider the implications of this on your own, but you could download tor and then use the commands like this in your do-file to change your ip randomly:
    Code:
    !service tor reload
    sudo killall -HUP tor
    ** check  IP change:     !curl  ipinfo.io/ip -o myip.txt
    type  myip.txt    //you could have the program regex this file to confirm ip change
    
    mata: input2 = cat(`"myip.txt"')
         getmata input2 , force
    I've also seen mention that adding a cookies file to your curl or wget command can help with forestalling the google scholar limit but I've never tested it. (e.g. using a --cookie-file ~/.scholar-cookies.txt option).
    Note, google isnt necessarily blocking you based on a strict #queries/minute formula -- they are doing so based on the behavior of your computer (identified by browser characteristics, cookies etc) and your (external) IP address, plus time between searches - to bypass the limits you'll have to make adjustments accordingly.

    (2) using an authorized 3rd party service like Mendeley or PublishorPerish (https://harzing.com/resources/publish-or-perish ) to run your queries and then scrape the results files.


    3. For accessing the downloaded html files, it depends on the page I'm scraping, sometimes I use insheet/import, but most often I use something like the example below to find the pieces I want:


    Code:
    clear
    set obs 10000
    local file "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C9&q=breakingrules "
    copy `"`file'"'  myfile.txt, replace public
        mata: input = cat(`"myfile.txt"')
         getmata input , force
        cap drop status
        gen status=regexs(1) if regexm(input, `"(href="http://.*)"')
            
        ta status
    Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

    Comment


    • #3
      Have you tried to find out whether or not that part of the search service requires an API key/token? My guess is that you are running into quota limit problems where the number of queries within a 24hr period exceeds some amount set by Google.

      Comment


      • #4
        Originally posted by eric_a_booth View Post
        You might be able to get around this by some combination of (1) manipulating your IP address (you'll need to consider the implications of this on your own, but you could download tor and then use the commands like this in your do-file to change your ip randomly:
        Code:
        !service tor reload
        sudo killall -HUP tor
        ** check IP change: !curl ipinfo.io/ip -o myip.txt
        type myip.txt //you could have the program regex this file to confirm ip change
        
        mata: input2 = cat(`"myip.txt"')
        getmata input2 , force
        eric_a_booth Does this code work in Windows too (when googling it I only got results on Ubuntu)? I've tried it but my IP address stayed the same.

        Comment


        • #5
          Giuseppe Ciccolini
          sudo is an *nix specific command to elevate preferences in *nix-based operating systems. There is likely a comparable set of commands for the Windows command line and/or power shell that would do the same thing in that environment.

          Comment

          Working...
          X