Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting data from a website using -fileread()-

    Mike Lacy requested that I post an example of how I used his suggestion in https://www.statalist.org/forums/for.../general/12850 to extract data from different webpages. We are interested in the effect of sociopolitical events on attendance in the Kontinental Hockey League (KHL). The data are available from https://en.khl.ru/calendar/202/00/ and are for the seasons 08/09 - 22/23, representing over 11,000 observations.
    Click image for larger version

Name:	main.png
Views:	1
Size:	100.7 KB
ID:	1694977




    The issue is that the attendance data are embedded within links (click where the scores are displayed), where each link represents a game.
    Click image for larger version

Name:	link1.png
Views:	1
Size:	43.8 KB
ID:	1694978



    Therefore, manual extraction would entail opening over 11,000 links and copying the wanted information. Instead, we can extract these data automatically using Stata as below (for the 08/09 season). First, I read the data into Stata and then export them as a text file. I then parse the text file to retrieve the information that I want. It helps that the links are standardized to a large degree, and the same code works for all.

    Code:
    cap frame drop myresults
    frame create myresults
    frame myresults{
        set obs 1
        gen game=.
        gen time=""
        gen venue=""
        gen attendance=.
    }
    tempfile b
    forval i= 21650/22321{
        clear
        set obs 1
        gen s = fileread("https://en.khl.ru/game/160/`i'/protocol/")
        export delimited using myfile2.txt, replace
        import delimited "myfile2.txt", clear
        keep v1
        gen attendance = real(ustrregexra(v1, "[^\d]", "")) if regexm(v1, "Spectators")
        gen venue1= v1[_n-4] if !missing(attendance)
        gen venue2= v1[_n-2] if !missing(attendance)
        gen venue= venue1 + venue2
        gen time= v1[_n-15] if !missing(attendance)
        gen Game= subinstr(v1,"â", "", 1)  if ustrregexm(v1, "Game â")
        gen teams= subinstr(v1, "<title>Game summary:","", 1) if ustrregexm(v1, "<title>Game summary:")
        collapse (firstnm) attendance venue time Game teams
        replace venue= trim(itrim(ustrregexra(venue, "(.*)</p>(.*)</p>$", "$1 $2")))
        replace time= trim(ustrregexra(time, ".*(\d{2}:\d{2}).*$", "$1"))
        gen game= real(ustrregexra(ustrregexra(trim(itrim(Game)),".*Game (.*) </h2>", "$1"), "[^\d]", ""))
        gen home= trim(ustrregexra(teams, "(^.*)[-].*", "$1"))
        gen away= trim(ustrregexra(teams, "(^.*)[-](.*)[:].*", "$2"))
        drop Game teams
        save `b', replace
        frame myresults: append using `b'
    }
    
    frame change myresults
    drop in 1
    rename (time venue attendance) (Time Venue Attendance)
    gen season="08/09"
    gen which="Regular season"
    The results follow in #2.
    Last edited by Andrew Musau; 26 Dec 2022, 10:54.

  • #2
    Sample Res.:

    Code:
    . l, sep(0)
    
         +----------------------------------------------------------------------------------------------------------------------------+
         | game    Time                                  Venue   Attend~e             home             away   season            which |
         |----------------------------------------------------------------------------------------------------------------------------|
      1. |    1   17:00                          Ufa-Arena Ufa       8400   Salavat Yulaev        Lokomotiv    08/09   Regular season |
      2. |    2   12:00              Platinum Arena Khabarovsk       7100             Amur         Dinamo R    08/09   Regular season |
      3. |    3   15:00   Metallurgs Sport Palace Novokuznetsk       5900     Metallurg Nk        Dinamo Mn    08/09   Regular season |
      4. |    4   19:30                 Sokolniki Arena Moscow       3200          Spartak           Atlant    08/09   Regular season |
      5. |    5   19:00                   Vityaz Arena Chekhov       2500           Vityaz           HC MVD    08/09   Regular season |
      6. |    6   19:00          Nagorny Arena Nizhny Novgorod       5500          Torpedo             CSKA    08/09   Regular season |
      7. |    7   12:00              Platinum Arena Khabarovsk       7100             Amur         Dinamo R    08/09   Regular season |
      8. |    8   15:00   Metallurgs Sport Palace Novokuznetsk       3600     Metallurg Nk        Dinamo Mn    08/09   Regular season |
      9. |    9   16:00                Sibir Arena Novosibirsk       7500            Sibir         Dynamo M    08/09   Regular season |
     10. |   10   17:00           Arena Metallurg Magnitogorsk       7052     Metallurg Mg              SKA    08/09   Regular season |
     11. |   11   16:30               Yunost Arena Chelyabinsk       3600          Traktor        Severstal    08/09   Regular season |
     12. |   12   19:30                    Tatneft Arena Kazan       7200          Ak Bars         Avangard    08/09   Regular season |
     13. |   13   19:00          Neftekhimik Arena Nizhnekamsk       5500      Neftekhimik            Barys    08/09   Regular season |
     14. |   14   17:30                 Volgar Arena Togliatti       2900             Lada           Khimik    08/09   Regular season |
     15. |   15   19:00              Mytishchi Arena Mytishchi       3500           Atlant           Vityaz    08/09   Regular season |
     16. |   16   16:30               Yunost Arena Chelyabinsk       3400          Traktor        Lokomotiv    08/09   Regular season |
     17. |   17   19:00          Nagorny Arena Nizhny Novgorod       5500          Torpedo          Spartak    08/09   Regular season |
     18. |   18   19:30            Arena Balashikha Balashikha       5700           HC MVD             CSKA    08/09   Regular season |
     19. |   19   12:00              Platinum Arena Khabarovsk       7100             Amur         Dynamo M    08/09   Regular season |
     20. |   20   15:00   Metallurgs Sport Palace Novokuznetsk       4500     Metallurg Nk         Dinamo R    08/09   Regular season |
     21. |   21   16:00                Sibir Arena Novosibirsk       7000            Sibir        Dinamo Mn    08/09   Regular season |
     22. |   22   17:00           Arena Metallurg Magnitogorsk       1783     Metallurg Mg        Severstal    08/09   Regular season |
     23. |   23   17:00                          Ufa-Arena Ufa       8025   Salavat Yulaev              SKA    08/09   Regular season |
     24. |   24   19:00                    Tatneft Arena Kazan       4250          Ak Bars           Khimik    08/09   Regular season |
     25. |   25   19:00          Neftekhimik Arena Nizhnekamsk       4500      Neftekhimik         Avangard    08/09   Regular season |
     26. |   26   17:30                 Volgar Arena Togliatti       2900             Lada            Barys    08/09   Regular season |
     27. |   27   15:00           Arena Metallurg Magnitogorsk       1848     Metallurg Mg        Lokomotiv    08/09   Regular season |
     28. |   28   17:00          Nagorny Arena Nizhny Novgorod       5500          Torpedo           Vityaz    08/09   Regular season |
     29. |   29   17:00            Arena Balashikha Balashikha       5500           HC MVD          Spartak    08/09   Regular season |
     30. |   30   10:00              Platinum Arena Khabarovsk       7100             Amur        Dinamo Mn    08/09   Regular season |
     31. |   31   13:00   Metallurgs Sport Palace Novokuznetsk       4800     Metallurg Nk         Dynamo M    08/09   Regular season |
     32. |   32   14:00                Sibir Arena Novosibirsk       6000            Sibir         Dinamo R    08/09   Regular season |
     33. |   33   15:00               Yunost Arena Chelyabinsk       3200          Traktor              SKA    08/09   Regular season |
     34. |   34   15:00                          Ufa-Arena Ufa       6600   Salavat Yulaev        Severstal    08/09   Regular season |
     35. |   35   17:00                    Tatneft Arena Kazan       4600          Ak Bars            Barys    08/09   Regular season |
     36. |   36   17:00          Neftekhimik Arena Nizhnekamsk       3000      Neftekhimik           Khimik    08/09   Regular season |
     37. |   37   16:00                 Volgar Arena Togliatti       2900             Lada         Avangard    08/09   Regular season |
     38. |   38   17:00                 CSKA Ice Palace Moscow       4500             CSKA           Atlant    08/09   Regular season |
     39. |   39   16:00                Sibir Arena Novosibirsk       5000            Sibir         Dinamo R    08/09   Regular season |
     40. |   40   19:00              Mytishchi Arena Mytishchi       6300           Atlant          Spartak    08/09   Regular season |
     41. |   41   16:00                        Arena Omsk Omsk      10000         Avangard     Metallurg Mg    08/09   Regular season |
     42. |   42   19:00                    Tatneft Arena Kazan       6700          Ak Bars         Dynamo M    08/09   Regular season |
     43. |   43   19:00          Neftekhimik Arena Nizhnekamsk       3000      Neftekhimik             Lada    08/09   Regular season |
     44. |   44   17:00                Kazakhstan Arena Astana       4200            Barys          Traktor    08/09   Regular season |
     45. |   45   19:00            Ice Palace Saint Petersburg       5800              SKA             Amur    08/09   Regular season |
     46. |   46   19:00                 Ice Palace Cherepovets       4500        Severstal     Metallurg Nk    08/09   Regular season |
     47. |   47   20:30                        Arena Riga Riga      10232         Dinamo R           HC MVD    08/09   Regular season |
     48. |   48   19:00                   Vityaz Arena Chekhov       3300           Vityaz          Torpedo    08/09   Regular season |
     49. |   49   16:00                        Arena Omsk Omsk      10200         Avangard   Salavat Yulaev    08/09   Regular season |
     50. |   50   17:00                Kazakhstan Arena Astana       3900            Barys     Metallurg Mg    08/09   Regular season |
     51. |   51   19:00         Arena-2000-Lokomotiv Yaroslavl       9046        Lokomotiv            Sibir    08/09   Regular season |
     52. |   52   19:30                    Sports Palace Minsk       2700        Dinamo Mn           Atlant    08/09   Regular season |
     53. |   53   19:30                 Sokolniki Arena Moscow       5300          Spartak             CSKA    08/09   Regular season |
     54. |   54   17:00          Podmoskovye Arena Voskresensk       4300           Khimik          Traktor    08/09   Regular season |
     55. |   55   17:00            Ice Palace Saint Petersburg       5000              SKA           HC MVD    08/09   Regular season |
     56. |   56   17:00                 Ice Palace Cherepovets       2500        Severstal             Amur    08/09   Regular season |
     57. |   57   17:00         Arena-2000-Lokomotiv Yaroslavl       9046        Lokomotiv     Metallurg Nk    08/09   Regular season |
     58. |   58   18:00                    Sports Palace Minsk       2000        Dinamo Mn          Torpedo    08/09   Regular season |
     59. |   59   17:00                   Vityaz Arena Chekhov       3000           Vityaz         Dynamo M    08/09   Regular season |
     60. |   60   15:00                Kazakhstan Arena Astana       4500            Barys   Salavat Yulaev    08/09   Regular season |
     61. |   61   17:00          Podmoskovye Arena Voskresensk       4200           Khimik     Metallurg Mg    08/09   Regular season |
     62. |   62   17:00                 Ice Palace Cherepovets       2300        Severstal            Sibir    08/09   Regular season |
     63. |   63   18:00                        Arena Riga Riga       7502         Dinamo R           Atlant    08/09   Regular season |
     64. |   64   17:00                   Vityaz Arena Chekhov       3300           Vityaz      Neftekhimik    08/09   Regular season |
     65. |   65   17:30                 Volgar Arena Togliatti       2900             Lada          Ak Bars    08/09   Regular season |
     66. |   66   16:00                        Arena Omsk Omsk       7500         Avangard          Traktor    08/09   Regular season |
     67. |   67   19:00            Ice Palace Saint Petersburg       3400              SKA     Metallurg Nk    08/09   Regular season |
     68. |   68   19:00         Arena-2000-Lokomotiv Yaroslavl       9046        Lokomotiv             Amur    08/09   Regular season |
     69. |   69   20:30                        Arena Riga Riga       4936         Dinamo R          Torpedo    08/09   Regular season |
     70. |   70   19:30                    Sports Palace Minsk       2800        Dinamo Mn          Spartak    08/09   Regular season |
     71. |   71   19:30                  Luzhniki Arena Moscow       4500         Dynamo M           HC MVD    08/09   Regular season |
     72. |   72   17:00                Kazakhstan Arena Astana       3500            Barys      Neftekhimik    08/09   Regular season |
     73. |   73   18:30          Podmoskovye Arena Voskresensk       4100           Khimik   Salavat Yulaev    08/09   Regular season |
     74. |   74   19:00            Ice Palace Saint Petersburg       2700              SKA            Sibir    08/09   Regular season |
     75. |   75   19:00                   Vityaz Arena Chekhov       3000           Vityaz             CSKA    08/09   Regular season |
     76. |   76   17:00                Kazakhstan Arena Astana       5200            Barys          Ak Bars    08/09   Regular season |
     77. |   77   19:30                    Sports Palace Minsk       1500        Dinamo Mn           HC MVD    08/09   Regular season |
     78. |   78   19:30                  Luzhniki Arena Moscow       3900         Dynamo M          Torpedo    08/09   Regular season |
     79. |   79   15:00   Metallurgs Sport Palace Novokuznetsk       6800     Metallurg Nk   Salavat Yulaev    08/09   Regular season |
     80. |   80   17:00           Arena Metallurg Magnitogorsk       2150     Metallurg Mg      Neftekhimik    08/09   Regular season |
     81. |   81   19:30                 Sokolniki Arena Moscow       1000          Spartak           Vityaz    08/09   Regular season |
     82. |   82   19:00              Mytishchi Arena Mytishchi       6800           Atlant         Dynamo M    08/09   Regular season |
     83. |   83   17:00           Arena Metallurg Magnitogorsk       5098     Metallurg Mg          Ak Bars    08/09   Regular season |
     84. |   84   16:00                        Arena Omsk Omsk       8501         Avangard             Lada    08/09   Regular season |
     85. |   85   19:45                 CSKA Ice Palace Moscow       3600             CSKA              SKA    08/09   Regular season |
     86. |   86   15:00               Yunost Arena Chelyabinsk       3400          Traktor      Neftekhimik    08/09   Regular season |
     87. |   87   17:00          Nagorny Arena Nizhny Novgorod       5150          Torpedo              SKA    08/09   Regular season |
     88. |   88   17:00            Arena Balashikha Balashikha       3750           HC MVD        Severstal    08/09   Regular season |
     89. |   89   17:00              Mytishchi Arena Mytishchi       6600           Atlant        Lokomotiv    08/09   Regular season |
     90. |   90   10:00              Platinum Arena Khabarovsk       7100             Amur            Barys    08/09   Regular season |
     91. |   91   13:00   Metallurgs Sport Palace Novokuznetsk       6100     Metallurg Nk           Khimik    08/09   Regular season |
     92. |   92   14:00                Sibir Arena Novosibirsk       7500            Sibir         Avangard    08/09   Regular season |
     93. |   93   15:00           Arena Metallurg Magnitogorsk       3012     Metallurg Mg             Lada    08/09   Regular season |
     94. |   94   15:00               Yunost Arena Chelyabinsk       3500          Traktor          Ak Bars    08/09   Regular season |
     95. |   95   17:00                          Ufa-Arena Ufa       7200   Salavat Yulaev      Neftekhimik    08/09   Regular season |
     96. |   96   19:45                 CSKA Ice Palace Moscow       2500             CSKA         Dinamo R    08/09   Regular season |
     97. |   97   19:30                 Sokolniki Arena Moscow       1100          Spartak        Dinamo Mn    08/09   Regular season |
     98. |   98   19:00          Nagorny Arena Nizhny Novgorod       5200          Torpedo        Severstal    08/09   Regular season |
     99. |   99   19:30            Arena Balashikha Balashikha       3500           HC MVD        Lokomotiv    08/09   Regular season |
    100. |  100   19:30              Mytishchi Arena Mytishchi       5000           Atlant              SKA    08/09   Regular season |
    
    600. |  600   17:00           Arena Metallurg Magnitogorsk       5289     Metallurg Mg           Khimik    08/09   Regular season |
    601. |  601   17:30              Traktor Arena Chelyabinsk       7500          Traktor            Barys    08/09   Regular season |
    602. |  602   19:00          Nagorny Arena Nizhny Novgorod       5600          Torpedo           Vityaz    08/09   Regular season |
    603. |  603   19:00            Arena Balashikha Balashikha       5050           HC MVD         Dynamo M    08/09   Regular season |
    604. |  604   19:00              Mytishchi Arena Mytishchi       5900           Atlant        Dinamo Mn    08/09   Regular season |
    605. |  605   17:00                          Ufa-Arena Ufa       8400   Salavat Yulaev         Avangard    08/09   Regular season |
    606. |  606   19:00                    Tatneft Arena Kazan       4600          Ak Bars      Neftekhimik    08/09   Regular season |
    607. |  607   19:45                 CSKA Ice Palace Moscow       5600             CSKA          Spartak    08/09   Regular season |
    608. |  608   17:00              Mytishchi Arena Mytishchi       6000           Atlant         Dinamo R    08/09   Regular season |
    609. |  609   10:00              Platinum Arena Khabarovsk       7100             Amur        Severstal    08/09   Regular season |
    610. |  610   13:00   Metallurgs Sport Palace Novokuznetsk       2800     Metallurg Nk        Lokomotiv    08/09   Regular season |
    611. |  611   14:00                Sibir Arena Novosibirsk       5000            Sibir              SKA    08/09   Regular season |
    612. |  612   15:00           Arena Metallurg Magnitogorsk       7215     Metallurg Mg            Barys    08/09   Regular season |
    613. |  613   15:00                          Ufa-Arena Ufa       7560   Salavat Yulaev           Khimik    08/09   Regular season |
    614. |  614   17:00          Nagorny Arena Nizhny Novgorod       5600          Torpedo         Dynamo M    08/09   Regular season |
    615. |  615   17:00            Arena Balashikha Balashikha       4400           HC MVD        Dinamo Mn    08/09   Regular season |
    616. |  616   15:00              Traktor Arena Chelyabinsk       7500          Traktor         Avangard    08/09   Regular season |
    617. |  617   17:00                    Tatneft Arena Kazan       5810          Ak Bars             Lada    08/09   Regular season |
    618. |  618   17:00          Neftekhimik Arena Nizhnekamsk       4500      Neftekhimik           Vityaz    08/09   Regular season |
    619. |  619   19:00          Nagorny Arena Nizhny Novgorod       5600          Torpedo         Dinamo R    08/09   Regular season |
    620. |  620   19:00              Mytishchi Arena Mytishchi       7000           Atlant             CSKA    08/09   Regular season |
    621. |  621   12:00              Platinum Arena Khabarovsk       7100             Amur              SKA    08/09   Regular season |
    622. |  622   15:00   Metallurgs Sport Palace Novokuznetsk       2600     Metallurg Nk        Severstal    08/09   Regular season |
    623. |  623   16:00                Sibir Arena Novosibirsk       4500            Sibir        Lokomotiv    08/09   Regular season |
    624. |  624   17:30              Traktor Arena Chelyabinsk       7500          Traktor           Khimik    08/09   Regular season |
    625. |  625   17:00                          Ufa-Arena Ufa       8400   Salavat Yulaev            Barys    08/09   Regular season |
    626. |  626   17:00           Arena Metallurg Magnitogorsk       5810     Metallurg Mg         Avangard    08/09   Regular season |
    627. |  627   19:30                  Luzhniki Arena Moscow       6500         Dynamo M          Ak Bars    08/09   Regular season |
    628. |  628   17:00           Arena Metallurg Magnitogorsk       7704     Metallurg Mg           HC MVD    08/09   Regular season |
    629. |  629   19:30                  Luzhniki Arena Moscow       3000         Dynamo M      Neftekhimik    08/09   Regular season |
    630. |  630   19:30                 Sokolniki Arena Moscow       3500          Spartak     Metallurg Nk    08/09   Regular season |
    631. |  631   19:00                   Vityaz Arena Chekhov       1800           Vityaz          Torpedo    08/09   Regular season |
    632. |  632   19:00              Mytishchi Arena Mytishchi       7200           Atlant   Salavat Yulaev    08/09   Regular season |
    633. |  633   16:00                        Arena Omsk Omsk       9500         Avangard             Lada    08/09   Regular season |
    634. |  634   16:00                Kazakhstan Arena Astana       5000            Barys          Ak Bars    08/09   Regular season |
    635. |  635   20:30                        Arena Riga Riga       8200         Dinamo R            Sibir    08/09   Regular season |
    636. |  636   19:45                 CSKA Ice Palace Moscow       3200             CSKA          Traktor    08/09   Regular season |
    637. |  637   17:00            Arena Balashikha Balashikha       4700           HC MVD           Vityaz    08/09   Regular season |
    638. |  638   14:00                        Arena Omsk Omsk       9300         Avangard      Neftekhimik    08/09   Regular season |
    639. |  639   18:00                        Arena Riga Riga       8100         Dinamo R            Sibir    08/09   Regular season |
    640. |  640   18:00                    Sports Palace Minsk       3000        Dinamo Mn             Amur    08/09   Regular season |
    641. |  641   17:00                  Luzhniki Arena Moscow       2000         Dynamo M     Metallurg Nk    08/09   Regular season |
    642. |  642   17:00                 Sokolniki Arena Moscow       4500          Spartak          Torpedo    08/09   Regular season |
    643. |  643   14:00                Kazakhstan Arena Astana       5000            Barys             Lada    08/09   Regular season |
    644. |  644   17:00          Podmoskovye Arena Voskresensk       4200           Khimik          Ak Bars    08/09   Regular season |
    645. |  645   17:00     Yubileiny Complex Saint Petersburg       6500              SKA     Metallurg Mg    08/09   Regular season |
    646. |  646   17:00                 Ice Palace Cherepovets       3200        Severstal          Traktor    08/09   Regular season |
    647. |  647   17:00         Arena-2000-Lokomotiv Yaroslavl       9046        Lokomotiv   Salavat Yulaev    08/09   Regular season |
    648. |  648   16:00                Kazakhstan Arena Astana       5000            Barys      Neftekhimik    08/09   Regular season |
    649. |  649   20:30                        Arena Riga Riga       6100         Dinamo R     Metallurg Nk    08/09   Regular season |
    650. |  650   20:00                    Sports Palace Minsk       2600        Dinamo Mn            Sibir    08/09   Regular season |
    651. |  651   17:00                  Luzhniki Arena Moscow       2800         Dynamo M             Amur    08/09   Regular season |
    652. |  652   13:00                 CSKA Ice Palace Moscow       3500             CSKA          Torpedo    08/09   Regular season |
    653. |  653   17:00                   Vityaz Arena Chekhov       2800           Vityaz           Atlant    08/09   Regular season |
    654. |  654   16:00                        Arena Omsk Omsk      10200         Avangard          Ak Bars    08/09   Regular season |
    655. |  655   18:30          Podmoskovye Arena Voskresensk       2100           Khimik             Lada    08/09   Regular season |
    656. |  656   19:00            Ice Palace Saint Petersburg       5500              SKA          Traktor    08/09   Regular season |
    657. |  657   19:00                 Ice Palace Cherepovets       2800        Severstal   Salavat Yulaev    08/09   Regular season |
    658. |  658   19:00         Arena-2000-Lokomotiv Yaroslavl       9046        Lokomotiv     Metallurg Mg    08/09   Regular season |
    659. |  659   19:30                 Sokolniki Arena Moscow       3000          Spartak           HC MVD    08/09   Regular season |
    660. |  660   16:00                Kazakhstan Arena Astana       5000            Barys          Torpedo    08/09   Regular season |
    661. |  661   18:30          Podmoskovye Arena Voskresensk       1600           Khimik      Neftekhimik    08/09   Regular season |
    662. |  662   20:30                        Arena Riga Riga       6300         Dinamo R             Amur    08/09   Regular season |
    663. |  663   20:00                    Sports Palace Minsk       2600        Dinamo Mn     Metallurg Nk    08/09   Regular season |
    664. |  664   19:30                  Luzhniki Arena Moscow       2600         Dynamo M            Sibir    08/09   Regular season |
    665. |  665   19:00              Mytishchi Arena Mytishchi       7100           Atlant          Spartak    08/09   Regular season |
    666. |  666   16:00                Kazakhstan Arena Astana       5000            Barys           Vityaz    08/09   Regular season |
    667. |  667   19:00            Ice Palace Saint Petersburg      10500              SKA   Salavat Yulaev    08/09   Regular season |
    668. |  668   19:00                 Ice Palace Cherepovets       2800        Severstal     Metallurg Mg    08/09   Regular season |
    669. |  669   19:00         Arena-2000-Lokomotiv Yaroslavl       8800        Lokomotiv          Traktor    08/09   Regular season |
    670. |  670   20:30                        Arena Riga Riga       6500         Dinamo R             Amur    08/09   Regular season |
    671. |  671   20:00                    Sports Palace Minsk       2700        Dinamo Mn     Metallurg Nk    08/09   Regular season |
    672. |  672   19:45                 CSKA Ice Palace Moscow       1200             CSKA           HC MVD    08/09   Regular season |
         +----------------------------------------------------------------------------------------------------------------------------+
    
    .

    Comment


    • #3
      Andrew Musau many thanks,

      for those interested in learning such powerfull tool, Regular Expression, I recommend Asjad Naqvi´ post

      ​​​​​​​https://medium.com/the-stata-guide/regular-expressions-regex-in-stata-6e5c200ef27c

      and his cheatsheet

      github.com/asjadnaqvi/The-Stata-Guide/blob/master/Stata_regex_cheatsheet_v1.pdf


      However, Regex always involves lots of testing and keep generating new variables it is a bit annoying, so I tend to use display command, before generate, like this:

      Code:
      . di "`=ustrregexs(0) if ustrregexm("My email address is [email protected]", "\b[a-zA-Z]+[_|\-|\.]?[a-zA-Z0-9]+@[a-zA-Z]+\.[com|net]+\b")'"
      
      
      . di "`=ustrregexrf("My email address is [email protected]", "[a-zA-Z0-9]+@[a-zA-Z]+\.[com|net]"," >>> replaces <<<")'"
      My email address is other- >>> replaces <<<om
      
      . di "`=ustrregexs(0) if ustrregexm("My email address is [email protected]", "\b[a-zA-Z]+[_|\-|\.]?[a-zA-Z0-9]+@[a-zA-Z]+\.[com|net]+\b")'"
      [email protected]
      My question here is, why the third display command(note, it is the same as first ) only works after running the second? And, if you ReStart Stata, this behaviour occur, again.

      any clue?

      Comment


      • #4
        #3 see
        Code:
        help ifcmd
        Code:
        if ustrregexm("My email address is [email protected]", "\b[a-zA-Z]+[_|\-|\.]?[a-zA-Z0-9]+@[a-zA-Z]+\.[com|net]+\b") display ustrregexs(0)
        or
        Code:
        local str My email address is [email protected]
        local re \b[a-zA-Z]+[_|\-|\.]?[a-zA-Z0-9]+@[a-zA-Z]+\.[com|net]+\b
        
        if ( ustrregexm(`"`str'"', `"`re'"') )  {
            
            di ustrregexs(0)  // subexpression from a previous ustrregexm() match
        }

        Comment


        • #5
          What incredible coding prowess. I wanna ensure I'm understanding what "fileread" does. I tested it very quickly with
          Code:
          set obs 1
          gen s = fileread("https://en.khl.ru/game/160/21650/protocol/")
          export delimited using myfile3.txt, replace
          import delim myfile3.txt, clear
          So........ is it just taking the raw binary output of.... an html file(???) and reading it into Stata? I wonder, where else might this have ready applications to Stata.

          For example, might it be possible for me to scrape the first table of this page into Stata? Normally I outsource the more complicated webscraping tasks to Python, but I'm a big proponent of doing anything in Stata that I can do in Stata. I wonder how flexible this, and what other uses it might have. I wonder why this method isn't talked about more, taught, and why there aren't SJ articles about it (not that I've looked everywhere for one!). Perhaps Mike Lacy or Andrew Musau could elaborate further?


          EDIT: I'll post a followup to this later and link it to this thread. But this method is super cool, and could have PLENTY of real application for people who need data from interesting places.
          Last edited by Jared Greathouse; 26 Dec 2022, 16:15.

          Comment


          • #6
            Originally posted by Jared Greathouse View Post
            For example, might it be possible for me to scrape the first table of this page into Stata?
            There appears to be a very definite structure to the table and your ensuing post gets to it. I do not see much difficulty extracting the contents of the table.

            Comment


            • #7
              Thanks for sharing Andrew! This post looks very helpful for anyone looking to get started web scraping in Stata.

              Jared, I notice the table linked in #5 is a straightforward HTML <table> with some custom Wikipedia CSS and a sorting function. When data is pre-arranged in a table like this, it is easy enough to import the html table with excel. As I understand it, the OP solves two related problems that are somewhat more difficult than importing an HTML <table> element. First, the data come from many different URLs, and second, the data are not already organized into an HTML table - instead they are stored in various <div> elements.

              Comment


              • #8
                Jared, I notice the table linked in #5 is a straightforward HTML <table> with some custom Wikipedia CSS and a sorting function. When data is pre-arranged in a table like this, it is easy enough to import the html table with excel.
                I'm sorry could you please clarify? I had no idea! Is there a straightforward way to do this in Stata? Did you mean it would be easy "with excel", or something different? As you can imagine, I usually use Python for this kind thing, but hey, if it's possible in native Stata, I'd love to learn the code for it!

                Comment


                • #9
                  I'm sorry could you please clarify? I had no idea!
                  Yep. I usually just stumble my way through the excel interface, but this video provides a pretty good overview.

                  Is there a straightforward way to do this in Stata?
                  I don't know. I wouldn't be surprised, but I've never tried. If not, one could also write something like this for Stata over the course of a few days. I think the hard part in Stata would be figuring out what to do if the web page contains multiple tables. Maybe an argument that gives the number of the table of interest? Regardless, things become much harder to generalize when data isn't neatly stored in HTML tables.

                  As you can imagine, I usually use Python for this kind thing
                  I also prefer to do web scraping in Python, although I think Andrew has made it clear that it's not all that much more difficult in Stata. Ultimately, what you really need are regular expressions.

                  Comment


                  • #10
                    Code:
                    clear *
                    
                    set obs 1
                    
                    gen s = fileread("https://en.wikipedia.org/wiki/Tourism_in_China")
                    
                    
                    export delimited using myfile3.txt, replace
                    
                    
                    import delim myfile3.txt, clear
                    
                    keep v1-v4
                    
                    g flag = 1 if strpos(v1, "flagicon")
                    
                    keep in 133/283
                    
                    
                    /*
                    local slash strpos(v1, "_")
                    gen wanted = trim(cond(`slash', substr(v1, 1, `slash' -1), v1))
                    */
                    
                    
                    replace flag= flag[_n-1]+1 if missing(flag) & !missing(flag[_n-1]) & flag[_n-1]<=6
                    
                    drop if flag == .
                    
                    
                    replace v1 = substr(v1, 1, strpos(v1, ".svg") - 1) 
                    
                    
                    split v1, parse("/Flag_of")
                    drop v11
                    
                    replace v12 = subinstr(v12, "_", " ",4)
                    
                    egen id = seq(), f(1) t(17) b(7)
                    
                    qui forv i = 1/17 {
                    levelsof v12 if id ==`i', loc(state)
                    replace v12 = `state' if id == `i'
                    }
                    
                    replace v12 = "Australia" if strpos(v12, "Australia")
                    
                    replace v12 = "Canada" if strpos(v12, "Canada")
                    
                    replace v1=v12
                    
                    drop v12 flag
                    
                    drop if v3 == ""
                    
                    g num = substr(v2,-3,.)
                    
                    drop v2
                    
                    replace num = subinstr(num, ">", "",1)
                    
                    replace  v4 = abbrev(v4,3)
                    
                    replace v4 = subinstr(v4, "~>", "",1)
                    
                    egen visitors = concat(num v3 v4)
                    
                    split visitors, p("<")
                    drop visitors visitors2 v3 v4 num
                    
                    
                    rename vis visitors
                    
                    egen year = seq(), f(2018) t(2013)
                    
                    destring visitors, replace
                    
                    
                    xtset id year, y
                    
                    br
                    Again, regexm is key here, but, this is super cool that I got this to work. Even the approach I did here can be refined a lot, but this remains interesting conceptually in case anyone ever could take it further.

                    Comment


                    • #11
                      Re creative uses of fileread()/filewrite():

                      While Stata has long had facilities for reading files in a very low-level way (see e.g. -help file read-), the introduction of the -fileread()- function and the strL datatype about 10 yr. ago (?) made it a lot easier to use Stata to read and process files in that way. The fact that the whole collection of Stata's string functions accept strL variables also helps, of course. I think I discovered quite by accident that -fileread- would accept URLs as "filenames," since while nothing in the documentation precludes doing that, nothing mentions or illustrates it.

                      I have long thought of -fileread-/-filewrite- as under-appreciated programming tools. For example, I have used it (mostly as a proof of concept) to create a very short and fast one-time-pad file encryption/decryption program.
                      Last edited by Mike Lacy; 27 Dec 2022, 09:42.

                      Comment


                      • #12
                        RE #10 - I was considering trying to write a generic html table-reading command, but it looks like someone beat me to the punch. I believe the -htmltab2csv- command is also available on SSC. Looks like the author uses mata's ReadFileIn() function (line 21) instead of the -fileread- command, but the principle is the same. Also, check out the nesting structure of those conditional statements starting on line 69! I can already spot a few possible simplifications...

                        Comment

                        Working...
                        X