Google Scholar web scraping: web error 503

Austin Drukker

Join Date: Apr 2018

Posts: 4
#1

Google Scholar web scraping: web error 503

02 Apr 2018, 18:42

I am interested in scraping Google Scholar citations for a list of 700+ papers I have listed in a CSV file. I've written some fairly simple Stata code to do the web scraping, but I've found that after about 40 queries Google "blocks" me from entering any more queries and I receive the error message:

web error 503
could not open url

Does anyone have tips for getting around the 503 error for several hundred queries? I've tried running the queries at non-regular intervals to simulate a human, using Stata's sleep command, but with no luck.

I've been using Stata's import delimited command. Is this the best option? The relevant portion of my code is:

foreach title in[list of paper titles] {
local website "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C9&q=`title' "
import delimited using "`website'", [options]
}
Tags: google scholar, HTML, text, web error 503, web scraping
eric_a_booth

Join Date: Apr 2014

Posts: 288
#2

03 Apr 2018, 06:00

I dont have a solution for you but some suggestions or things to consider:

1. Google Scholar does not permit scraping. They have api's for other services (like goo.gl) that allow this kind of automated activity, but not for scholar.

2. You might be able to get around this by some combination of (1) manipulating your IP address (you'll need to consider the implications of this on your own, but you could download tor and then use the commands like this in your do-file to change your ip randomly:

Code:

!service tor reload sudo killall -HUP tor ** check IP change: !curl ipinfo.io/ip -o myip.txt type myip.txt //you could have the program regex this file to confirm ip change mata: input2 = cat(`"myip.txt"') getmata input2 , force

I've also seen mention that adding a cookies file to your curl or wget command can help with forestalling the google scholar limit but I've never tested it. (e.g. using a --cookie-file ~/.scholar-cookies.txt option).
Note, google isnt necessarily blocking you based on a strict #queries/minute formula -- they are doing so based on the behavior of your computer (identified by browser characteristics, cookies etc) and your (external) IP address, plus time between searches - to bypass the limits you'll have to make adjustments accordingly.

(2) using an authorized 3rd party service like Mendeley or PublishorPerish (https://harzing.com/resources/publish-or-perish ) to run your queries and then scrape the results files.

3. For accessing the downloaded html files, it depends on the page I'm scraping, sometimes I use insheet/import, but most often I use something like the example below to find the pieces I want:

Code:

clear set obs 10000 local file "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C9&q=breakingrules " copy `"`file'"' myfile.txt, replace public mata: input = cat(`"myfile.txt"') getmata input , force cap drop status gen status=regexs(1) if regexm(input, `"(href="http://.*)"') ta status

Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#3

03 Apr 2018, 06:00

Have you tried to find out whether or not that part of the search service requires an API key/token? My guess is that you are running into quota limit problems where the number of queries within a 24hr period exceeds some amount set by Google.
Comment
Giuseppe Ciccolini

Join Date: Dec 2020

Posts: 11
#4

11 May 2022, 07:44

Originally posted by eric_a_booth View Post

You might be able to get around this by some combination of (1) manipulating your IP address (you'll need to consider the implications of this on your own, but you could download tor and then use the commands like this in your do-file to change your ip randomly:

Code:

!service tor reload sudo killall -HUP tor ** check IP change: !curl ipinfo.io/ip -o myip.txt type myip.txt //you could have the program regex this file to confirm ip change mata: input2 = cat(`"myip.txt"') getmata input2 , force

eric_a_booth Does this code work in Windows too (when googling it I only got results on Ubuntu)? I've tried it but my IP address stayed the same.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#5

24 May 2022, 06:29

Giuseppe Ciccolini
sudo is an *nix specific command to elevate preferences in *nix-based operating systems. There is likely a comparable set of commands for the Windows command line and/or power shell that would do the same thing in that environment.
Comment

Announcement

Google Scholar web scraping: web error 503

Comment

Comment

Comment

Comment