Mike Lacy requested that I post an example of how I used his suggestion in https://www.statalist.org/forums/for.../general/12850 to extract data from different webpages. We are interested in the effect of sociopolitical events on attendance in the Kontinental Hockey League (KHL). The data are available from https://en.khl.ru/calendar/202/00/ and are for the seasons 08/09 - 22/23, representing over 11,000 observations. data:image/s3,"s3://crabby-images/12127/12127eda2adb637fe90f4c5069e53c47692f0a7a" alt="Click image for larger version
Name: main.png
Views: 1
Size: 100.7 KB
ID: 1694977"
The issue is that the attendance data are embedded within links (click where the scores are displayed), where each link represents a game.data:image/s3,"s3://crabby-images/fc073/fc07357625349da400c0190d1aff845f47ae8952" alt="Click image for larger version
Name: link1.png
Views: 1
Size: 43.8 KB
ID: 1694978"
Therefore, manual extraction would entail opening over 11,000 links and copying the wanted information. Instead, we can extract these data automatically using Stata as below (for the 08/09 season). First, I read the data into Stata and then export them as a text file. I then parse the text file to retrieve the information that I want. It helps that the links are standardized to a large degree, and the same code works for all.
The results follow in #2.
The issue is that the attendance data are embedded within links (click where the scores are displayed), where each link represents a game.
Therefore, manual extraction would entail opening over 11,000 links and copying the wanted information. Instead, we can extract these data automatically using Stata as below (for the 08/09 season). First, I read the data into Stata and then export them as a text file. I then parse the text file to retrieve the information that I want. It helps that the links are standardized to a large degree, and the same code works for all.
Code:
cap frame drop myresults frame create myresults frame myresults{ set obs 1 gen game=. gen time="" gen venue="" gen attendance=. } tempfile b forval i= 21650/22321{ clear set obs 1 gen s = fileread("https://en.khl.ru/game/160/`i'/protocol/") export delimited using myfile2.txt, replace import delimited "myfile2.txt", clear keep v1 gen attendance = real(ustrregexra(v1, "[^\d]", "")) if regexm(v1, "Spectators") gen venue1= v1[_n-4] if !missing(attendance) gen venue2= v1[_n-2] if !missing(attendance) gen venue= venue1 + venue2 gen time= v1[_n-15] if !missing(attendance) gen Game= subinstr(v1,"â", "", 1) if ustrregexm(v1, "Game â") gen teams= subinstr(v1, "<title>Game summary:","", 1) if ustrregexm(v1, "<title>Game summary:") collapse (firstnm) attendance venue time Game teams replace venue= trim(itrim(ustrregexra(venue, "(.*)</p>(.*)</p>$", "$1 $2"))) replace time= trim(ustrregexra(time, ".*(\d{2}:\d{2}).*$", "$1")) gen game= real(ustrregexra(ustrregexra(trim(itrim(Game)),".*Game (.*) </h2>", "$1"), "[^\d]", "")) gen home= trim(ustrregexra(teams, "(^.*)[-].*", "$1")) gen away= trim(ustrregexra(teams, "(^.*)[-](.*)[:].*", "$2")) drop Game teams save `b', replace frame myresults: append using `b' } frame change myresults drop in 1 rename (time venue attendance) (Time Venue Attendance) gen season="08/09" gen which="Regular season"
Comment