Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reading JSON files into Stata

    This is partly a request to Stata and partly a request for guidance. There are an increasing number of databases that fill data requests by exporting JSON files. Unfortunately, JSON is a long way from being a standardised format, though there are modules in Python and other languages that identify and handle various JSON formats. However, it would be really nice if Stata could write the equivalent of import delimited to handle different JSON formats.

    For now, there is the user written procedure insheetjson.ado but I am unable to find a way of getting it to import one kind of JSON file format that I encounter frequently. To explain, the conventional format for JSON files is a set of nested tables or n-tuples as follows

    Code:
    {<headers> , "tablename":[{"vara":x1, "varb":y1, "varc":z1, ...}{"vara":x2,"varb":y2,"varc":z2, ...}{ ... } ]
    In effect, each row in the table is presented as an n-tuple and the input process is to identify the table and then read each row as a separate observation in the Stata datasheet. So far so good.

    However, there is another variant of JSON files in which tables are presented one column at time as in:

    Code:
    <headers>, "tablename":[{"vara":[x1,x2,x3, ...],"varb":[y1,y2,y3,...],"varc:[z1,z2,z3, ...], ...}]
    which is much messier to handle. Think of the standard variant as row-wise format and the alternative as column-wise format. I haven't seen this terminology used but it seems to be the logical way of thinking about different JSON presentations of tabular data. Generating column-wise formatted data seems a bit strange to me but there may be reasons why database designers have chosen to organise their data in this way.

    I am able to use insheetjson.ado to read row-wise JSON files but it gets column-wise formats completely wrong. Maybe I don't understand the options fully, but there is nothing that seems to allow reading column-wise formatted data.

    Any suggestions?

  • #2
    While waiting, fwiw:
    Code:
    python:
    
    import pandas as pd
    import json
    
    with open('data.json', 'r', encoding='utf-8') as file:
        json_data = json.load(file)
        
    df = pd.DataFrame(json_data["tablename"][0])
        
    df.to_stata('data_from_json.dta', version=118)
    
    end
    Code:
    use data_from_json.dta
    
    . list 
    
         +----------------------------+
         | index   vara   varb   varc |
         |----------------------------|
      1. |     0     x1     y1     z1 |
      2. |     1     x2     y2     z2 |
      3. |     2     x3     y3     z3 |
         +----------------------------+
    
    . type data.json
    {
        "headers": ["vara", "varb", "varc"],
        "tablename": [
            {"vara": ["x1", "x2", "x3"], "varb": ["y1", "y2", "y3"], "varc": ["z1", "z2", "z3"]}
        ]
    }

    Comment

    Working...
    X