Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using insheetjson on nested JSON data on New York City pedestrian injuries

    Hi All

    I am trying to get the new york city pedestrian injury data into stata:

    http://www.nyc.gov/html/dot/download...ry_yearly.json

    I am unclear how I have to specify the "columns" function of the insheetjson command to make this work.

    A line from the file looks like this:
    {"type":"FeatureCollection","features":[{"geometry": {"type": "Point", "coordinates": [-73.92834752578327, 40.64219748869916]}, "type": "Feature", "id": 0, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 9, "PedInjurie": 0, "Injuries": 9}}

    and using

    insheetjson stringA (etc) using ".../Pedestrian Injuries/fatality_yearly.json", col("features","1","geometry","coordinates","1")

    to extract the latitude yields the error message "invalid expression"

    Help is much appreciated!

    Best

    Fabian

  • #2
    Fabian Eckert
    The first thing I would note is that you are not reading JSON, but a very specific application of JSON called GeoJSON. Depending on your specific needs/requirements, you may be able to use the querying functionality/capabilities in jsonio (https://wbuchanan.github.io/StataJSON).

    Comment


    • #3
      As I have mentioned before on this forum, JSON files are just plain ol' text files and I have yet to come across one that would not yield to targeted extraction of desired data using standard Stata data management tools.

      If you click on the link in #1, you see the content of the file in the browser. The only problem is that there are no line breaks so this is just a long data stream. You can however pick up the pattern quickly, and here are a few resulting lines once you insert a line break just before {"geometry":
      Code:
      {"geometry": {"type": "Point", "coordinates": [-73.92834752578327, 40.64219748869916]}, "type": "Feature", "id": 0, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 9, "PedInjurie": 0, "Injuries": 9}},
      {"geometry": {"type": "Point", "coordinates": [-73.92232152582858, 40.72038062648219]}, "type": "Feature", "id": 1, "properties": {"BikeInjuri": 1, "YR": "2016", "MVOInjurie": 0, "PedInjurie": 0, "Injuries": 1}},
      {"geometry": {"type": "Point", "coordinates": [-73.78568492829518, 40.7277301915846]}, "type": "Feature", "id": 2, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 0, "PedInjurie": 2, "Injuries": 2}},
      {"geometry": {"type": "Point", "coordinates": [-73.96297460245067, 40.762487075452434]}, "type": "Feature", "id": 3, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 4, "PedInjurie": 1, "Injuries": 5}},
      {"geometry": {"type": "Point", "coordinates": [-73.92987364834694, 40.66399077159336]}, "type": "Feature", "id": 4, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 10, "PedInjurie": 0, "Injuries": 10}},
      {"geometry": {"type": "Point", "coordinates": [-73.94255814386595, 40.593588490323526]}, "type": "Feature", "id": 5, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 1, "PedInjurie": 1, "Injuries": 2}},
      {"geometry": {"type": "Point", "coordinates": [-74.14864554649671, 40.564597628435045]}, "type": "Feature", "id": 6, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 1, "PedInjurie": 0, "Injuries": 1}},
      {"geometry": {"type": "Point", "coordinates": [-73.96864226156205, 40.76159995950388]}, "type": "Feature", "id": 7, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 3, "PedInjurie": 1, "Injuries": 4}},
      You can use filefilter to insert the line breaks for the whole JSON file. Then it's just a matter of reading each line as a long string to input the data into Stata. In this case, all the data to extract appear to be numeric so I use moss (from SSC) to extract them all in one pass. Finally, you destring each match and name each variable as desired.

      Code:
      * insert line breaks
      filefilter injury_yearly.json injury_yearly.txt, from({"geometry") to(\r{"geometry") replace
      
      * read each like as a long string, increase the width if needed
      infix str300 line 1-300 using "injury_yearly.txt", clear
      compress
      
      * make sure that we did not truncate any lines
      assert length(line) < 300
      
      * since all the desired data appears to be numbers, just extract those
      moss line, match("([0-9\.]+)") regex
      drop _pos*
      
      * convert to numeric
      destring _match1 _match2, gen(lon lat)
      destring _match3, gen(id)
      destring _match4, gen(BikeInjuri)
      destring _match5, gen(year)
      destring _match6, gen(MVOInjurie)
      destring _match7, gen(PedInjurie)
      destring _match8, gen(Injuries)

      Comment


      • #4
        @RobertPicard
        * make sure that we did not truncate any lines
        . assert length(line)<300
        148,045 contradictions in 148,049 observations
        assertion is false

        Comment


        • #5
          Robert Picard assuming your line break reference is related to separating objects from one another that would violate the JSON specification.

          Originally posted by Robert Picard View Post
          As I have mentioned before on this forum, JSON files are just plain ol' text files and I have yet to come across one that would not yield to targeted extraction of desired data using standard Stata data management tools.

          If you click on the link in #1, you see the content of the file in the browser. The only problem is that there are no line breaks so this is just a long data stream. You can however pick up the pattern quickly, and here are a few resulting lines once you insert a line break just before {"geometry":
          Code:
          {"geometry": {"type": "Point", "coordinates": [-73.92834752578327, 40.64219748869916]}, "type": "Feature", "id": 0, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 9, "PedInjurie": 0, "Injuries": 9}},
          {"geometry": {"type": "Point", "coordinates": [-73.92232152582858, 40.72038062648219]}, "type": "Feature", "id": 1, "properties": {"BikeInjuri": 1, "YR": "2016", "MVOInjurie": 0, "PedInjurie": 0, "Injuries": 1}},
          {"geometry": {"type": "Point", "coordinates": [-73.78568492829518, 40.7277301915846]}, "type": "Feature", "id": 2, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 0, "PedInjurie": 2, "Injuries": 2}},
          {"geometry": {"type": "Point", "coordinates": [-73.96297460245067, 40.762487075452434]}, "type": "Feature", "id": 3, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 4, "PedInjurie": 1, "Injuries": 5}},
          {"geometry": {"type": "Point", "coordinates": [-73.92987364834694, 40.66399077159336]}, "type": "Feature", "id": 4, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 10, "PedInjurie": 0, "Injuries": 10}},
          {"geometry": {"type": "Point", "coordinates": [-73.94255814386595, 40.593588490323526]}, "type": "Feature", "id": 5, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 1, "PedInjurie": 1, "Injuries": 2}},
          {"geometry": {"type": "Point", "coordinates": [-74.14864554649671, 40.564597628435045]}, "type": "Feature", "id": 6, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 1, "PedInjurie": 0, "Injuries": 1}},
          {"geometry": {"type": "Point", "coordinates": [-73.96864226156205, 40.76159995950388]}, "type": "Feature", "id": 7, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 3, "PedInjurie": 1, "Injuries": 4}},
          You can use filefilter to insert the line breaks for the whole JSON file. Then it's just a matter of reading each line as a long string to input the data into Stata. In this case, all the data to extract appear to be numeric so I use moss (from SSC) to extract them all in one pass. Finally, you destring each match and name each variable as desired.

          Code:
          * insert line breaks
          filefilter injury_yearly.json injury_yearly.txt, from({"geometry") to(\r{"geometry") replace
          
          * read each like as a long string, increase the width if needed
          infix str300 line 1-300 using "injury_yearly.txt", clear
          compress
          
          * make sure that we did not truncate any lines
          assert length(line) < 300
          
          * since all the desired data appears to be numbers, just extract those
          moss line, match("([0-9\.]+)") regex
          drop _pos*
          
          * convert to numeric
          destring _match1 _match2, gen(lon lat)
          destring _match3, gen(id)
          destring _match4, gen(BikeInjuri)
          destring _match5, gen(year)
          destring _match6, gen(MVOInjurie)
          destring _match7, gen(PedInjurie)
          destring _match8, gen(Injuries)

          Comment

          Working...
          X