Using insheetjson on nested JSON data on New York City pedestrian injuries

Fabian Eckert

Join Date: Sep 2017

Posts: 2
#1

Using insheetjson on nested JSON data on New York City pedestrian injuries

24 Sep 2017, 14:26

Hi All

I am trying to get the new york city pedestrian injury data into stata:

http://www.nyc.gov/html/dot/download...ry_yearly.json

I am unclear how I have to specify the "columns" function of the insheetjson command to make this work.

A line from the file looks like this:
{"type":"FeatureCollection","features":[{"geometry": {"type": "Point", "coordinates": [-73.92834752578327, 40.64219748869916]}, "type": "Feature", "id": 0, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 9, "PedInjurie": 0, "Injuries": 9}}

and using

insheetjson stringA (etc) using ".../Pedestrian Injuries/fatality_yearly.json", col("features","1","geometry","coordinates","1")

to extract the latitude yields the error message "invalid expression"

Help is much appreciated!

Best

Fabian
Tags: import, importing, insheet, insheetjson, JSON
wbuchanan

Join Date: Mar 2014

Posts: 1361
#2

29 Sep 2017, 20:41

Fabian Eckert
The first thing I would note is that you are not reading JSON, but a very specific application of JSON called GeoJSON. Depending on your specific needs/requirements, you may be able to use the querying functionality/capabilities in jsonio (https://wbuchanan.github.io/StataJSON).
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

30 Sep 2017, 11:24

As I have mentioned before on this forum, JSON files are just plain ol' text files and I have yet to come across one that would not yield to targeted extraction of desired data using standard Stata data management tools.

If you click on the link in #1, you see the content of the file in the browser. The only problem is that there are no line breaks so this is just a long data stream. You can however pick up the pattern quickly, and here are a few resulting lines once you insert a line break just before {"geometry":

Code:

{"geometry": {"type": "Point", "coordinates": [-73.92834752578327, 40.64219748869916]}, "type": "Feature", "id": 0, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 9, "PedInjurie": 0, "Injuries": 9}},
{"geometry": {"type": "Point", "coordinates": [-73.92232152582858, 40.72038062648219]}, "type": "Feature", "id": 1, "properties": {"BikeInjuri": 1, "YR": "2016", "MVOInjurie": 0, "PedInjurie": 0, "Injuries": 1}},
{"geometry": {"type": "Point", "coordinates": [-73.78568492829518, 40.7277301915846]}, "type": "Feature", "id": 2, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 0, "PedInjurie": 2, "Injuries": 2}},
{"geometry": {"type": "Point", "coordinates": [-73.96297460245067, 40.762487075452434]}, "type": "Feature", "id": 3, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 4, "PedInjurie": 1, "Injuries": 5}},
{"geometry": {"type": "Point", "coordinates": [-73.92987364834694, 40.66399077159336]}, "type": "Feature", "id": 4, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 10, "PedInjurie": 0, "Injuries": 10}},
{"geometry": {"type": "Point", "coordinates": [-73.94255814386595, 40.593588490323526]}, "type": "Feature", "id": 5, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 1, "PedInjurie": 1, "Injuries": 2}},
{"geometry": {"type": "Point", "coordinates": [-74.14864554649671, 40.564597628435045]}, "type": "Feature", "id": 6, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 1, "PedInjurie": 0, "Injuries": 1}},
{"geometry": {"type": "Point", "coordinates": [-73.96864226156205, 40.76159995950388]}, "type": "Feature", "id": 7, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 3, "PedInjurie": 1, "Injuries": 4}},

You can use filefilter to insert the line breaks for the whole JSON file. Then it's just a matter of reading each line as a long string to input the data into Stata. In this case, all the data to extract appear to be numeric so I use moss (from SSC) to extract them all in one pass. Finally, you destring each match and name each variable as desired.

Code:

* insert line breaks
filefilter injury_yearly.json injury_yearly.txt, from({"geometry") to(\r{"geometry") replace

* read each like as a long string, increase the width if needed
infix str300 line 1-300 using "injury_yearly.txt", clear
compress

* make sure that we did not truncate any lines
assert length(line) < 300

* since all the desired data appears to be numbers, just extract those
moss line, match("([0-9\.]+)") regex
drop _pos*

* convert to numeric
destring _match1 _match2, gen(lon lat)
destring _match3, gen(id)
destring _match4, gen(BikeInjuri)
destring _match5, gen(year)
destring _match6, gen(MVOInjurie)
destring _match7, gen(PedInjurie)
destring _match8, gen(Injuries)

Comment

Andrés Lahur Talavera Cuya

Join Date: Dec 2018

Posts: 54
#4

16 Jan 2022, 00:54

@RobertPicard
* make sure that we did not truncate any lines
. assert length(line)<300
148,045 contradictions in 148,049 observations
assertion is false
Comment

wbuchanan

Join Date: Mar 2014
Posts: 1361

01 Feb 2022, 09:05

Robert Picard assuming your line break reference is related to separating objects from one another that would violate the JSON specification.

Originally posted by Robert Picard View Post

Code:

{"geometry": {"type": "Point", "coordinates": [-73.92834752578327, 40.64219748869916]}, "type": "Feature", "id": 0, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 9, "PedInjurie": 0, "Injuries": 9}},
{"geometry": {"type": "Point", "coordinates": [-73.92232152582858, 40.72038062648219]}, "type": "Feature", "id": 1, "properties": {"BikeInjuri": 1, "YR": "2016", "MVOInjurie": 0, "PedInjurie": 0, "Injuries": 1}},
{"geometry": {"type": "Point", "coordinates": [-73.78568492829518, 40.7277301915846]}, "type": "Feature", "id": 2, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 0, "PedInjurie": 2, "Injuries": 2}},
{"geometry": {"type": "Point", "coordinates": [-73.96297460245067, 40.762487075452434]}, "type": "Feature", "id": 3, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 4, "PedInjurie": 1, "Injuries": 5}},
{"geometry": {"type": "Point", "coordinates": [-73.92987364834694, 40.66399077159336]}, "type": "Feature", "id": 4, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 10, "PedInjurie": 0, "Injuries": 10}},
{"geometry": {"type": "Point", "coordinates": [-73.94255814386595, 40.593588490323526]}, "type": "Feature", "id": 5, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 1, "PedInjurie": 1, "Injuries": 2}},
{"geometry": {"type": "Point", "coordinates": [-74.14864554649671, 40.564597628435045]}, "type": "Feature", "id": 6, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 1, "PedInjurie": 0, "Injuries": 1}},
{"geometry": {"type": "Point", "coordinates": [-73.96864226156205, 40.76159995950388]}, "type": "Feature", "id": 7, "properties": {"BikeInjuri": 0, "YR": "2016", "MVOInjurie": 3, "PedInjurie": 1, "Injuries": 4}},

Code:

* insert line breaks
filefilter injury_yearly.json injury_yearly.txt, from({"geometry") to(\r{"geometry") replace

* read each like as a long string, increase the width if needed
infix str300 line 1-300 using "injury_yearly.txt", clear
compress

* make sure that we did not truncate any lines
assert length(line) < 300

* since all the desired data appears to be numbers, just extract those
moss line, match("([0-9\.]+)") regex
drop _pos*

* convert to numeric
destring _match1 _match2, gen(lon lat)
destring _match3, gen(id)
destring _match4, gen(BikeInjuri)
destring _match5, gen(year)
destring _match6, gen(MVOInjurie)
destring _match7, gen(PedInjurie)
destring _match8, gen(Injuries)

Announcement