Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • JSON File into Stata

    Dear all,

    I face a problem using a .json file (https://www.sec.gov/files/company_tickers.json) in Stata. Using the insheetjson command gets me the following problem with this code.

    Any help would be appriciated!

    Code:

    Code:
    gen str10 cik_str = ""
    gen str10 ticker = ""
    gen str50 title = ""
    
    insheetjson cik_str ticker title using "cik_ticker_company.json", columns("cik_str" "ticker" "title") tableselector("*") flatten replace
    Response:


    Code:
    libjson::parse: unexpected character '9' at position 68609 forced error #-22
    VERSAL DISPLAY CORP \\PA\\"},"968":{"cik_str":6201,"ticker":
                                  ^  parse error here!
      IONAL INC /DE/"},"967":{"cik_str":1005284,"ticker":"OLED","title":"UNIVERSAL DISPLAY CORP \\PA\\"},"968":{"cik_
    > str":6201,"ticker":
     Warning: No response from source?!?

  • #2
    If you can't get that to work for you, then you can try a Python approach if you have it set up. For example, the following code works. (Do-file and log file attached if you want to explore it further.)
    Code:
    version 18.0
    
    clear *
    
    copy "https://www.sec.gov/files/company_tickers.json" "company_tickers.json"
    
    python:
    
    import json
    import pandas as pd
    from sfi import Data
    
    f = open("company_tickers.json", "r")
    data_frame = pd.DataFrame(json.load(f)).transpose()
    f.close()
    
    Data.setObsTotal(len(data_frame))
    
    Data.addVarLong("cik_str")
    Data.addVarStr("ticker", 10)
    Data.addVarStr("title", 100)
    
    Data.store("cik_str", None, data_frame["cik_str"], None)
    Data.store("ticker", None, data_frame["ticker"], None)
    Data.store("title", None, data_frame["title"], None)
    
    end
    
    compress
    list in 1/5, noobs
    
    count
    
    exit
    Unfortunately, for the moment in Stata proper you can't use context manager or anything in Python that requires indentation (which is just about everything), and so it's a little clunky if you want to run it from Stata's Python environment as above. But you can put the Python code (with indentation) in a script file and call it with python script.
    Attached Files

    Comment


    • #3
      In the meanwhile, an alternative, inspired in

      https://www.stata.com/meeting/chicag..._matsuoka.pptx

      would be:

      Code:
      clear all
      tempfile f1 f2
      copy "company_tickers.json" `f1'
      hexdump `f1', analyze
      filefilter `f1' `f2', from ("\r\n") to ("")
      filefilter `f2' `f1', replace from ("{") to ("\r\n{")
      import delimited `f1', delimit("@@@", asstring) clear
      split v1  , p(`",""') gen (x)
      list x* in 1/3
      replace x1 = subinstr( x1 , `"{"cik_str":"',  "", .)
      replace x2 = subinstr( x2 , `"ticker":"',  "", .)
      replace x3 = subinstr( x3 , `"title":"',  "", .)
      replace x3 = subinstr( x3 , `"}"',  "", .)
      replace x4 = subinstr( x4 , `"":"',  "", .)
      list x* in 1/10
      
           +----------------------------------------------------------------------+
           |      x1        x2                                            x3   x4 |
           |----------------------------------------------------------------------|
        1. |   {"0":                                                              |
        2. |  320193    "AAPL"                                  "Apple Inc."    1 |
        3. |  789019    "MSFT"                              "MICROSOFT CORP"    2 |
        4. | 1067983   "BRK-B"                      "BERKSHIRE HATHAWAY INC"    3 |
        5. |  731766     "UNH"                      "UNITEDHEALTH GROUP INC"    4 |
           |----------------------------------------------------------------------|
        6. |  200406     "JNJ"                           "JOHNSON & JOHNSON"    5 |
        7. |   34088     "XOM"                            "EXXON MOBIL CORP"    6 |
        8. | 1045810    "NVDA"                                 "NVIDIA CORP"    7 |
        9. | 1046179     "TSM"   "TAIWAN SEMICONDUCTOR MANUFACTURING CO LTD"    8 |
       10. |  104169     "WMT"                                "Walmart Inc."    9 |
           +----------------------------------------------------------------------+
      Last edited by Luis Pecht; 15 Nov 2023, 03:58. Reason: remembered,also, using jsonio, somewhere

      Comment


      • #4
        or, using jsonio (from SSC)

        Code:
        jsonio kv, file("company_tickers.json") nourl elem("[0-9]+")
        split key  , p("/") gen (x)
        drop x1 key
        reshape wide value,i(x2) j(x3) string
        . list in -10/-1
        
               +---------------------------------------------------------------------+
               |   x2   valuec~r   valuet~r                               valuetitle |
               |---------------------------------------------------------------------|
        11637. | 9990    1856948      CHEAW                  Chenghe Acquisition Co. |
        11638. | 9991    1878074      BWAQU              Blue World Acquisition Corp |
        11639. | 9992    1878074      BWAQR              Blue World Acquisition Corp |
        11640. | 9993    1871983      ANGHW                              Anghami Inc |
        11641. | 9994    1870404      PBAXW        PHOENIX BIOTECH ACQUISITION CORP. |
               |---------------------------------------------------------------------|
        11642. | 9995    1870404      PBAXU        PHOENIX BIOTECH ACQUISITION CORP. |
        11643. | 9996    1842937      HCVIW     Hennessy Capital Investment Corp. VI |
        11644. | 9997    1842937      HCVIU     Hennessy Capital Investment Corp. VI |
        11645. | 9998    1843762      EQRXW                               EQRx, Inc. |
        11646. | 9999    1845550      APTMU   Alpha Partners Technology Merger Corp. |
               +---------------------------------------------------------------------+
        
        ​​​​​​​

        Comment


        • #5
          Dear Joseph and Luis, thank you very much for your effort and the corresponding solutions!

          Since the second approach from Luis seems to be the shortest and (from my point of view) the most logical one, I'll go with this one.

          Comment


          • #6
            Actually, insheetjson (from SSC) presents two problems here:

            1) you will need to replace any "\" to "", externally, before parse it;
            2) this JSON structure embodies 11645 different objects, sequentially numbered, not a array, so you would need to loop thru it;

            Code:
            . gen str10 cik_str = ""
            . gen str10 ticker = ""
            . gen str50 title = ""
            . insheetjson cik_str ticker title using "company_tickers.json", columns("cik_str" "ticker" "title") tableselector("11645") replace
             1 observations updated/written.
            
            . list
            
                 +---------------------------------------+
                 | cik_str   ticker                title |
                 |---------------------------------------|
              1. |  753308   NEE-PP   NEXTERA ENERGY INC |
                 +---------------------------------------+

            Comment


            • #7
              While insheetjson, jsonio and Python occasionally work, my experience is that they are more trouble than they are worth for any but the simplest JSON structures. The problem is files that are nested more than one level deep, especially when the data is not organised as a table, even if the core data is tabular. For some reason jsonio seems to generate Java errors in many cases (using StataNow 18.5) and similarly I get Python errors when trying to follow Joseph's example. I find that insheetjson only works when the JSON structure is strictly tabular. That is as advertised but what seems to be tabular data is often organised differently in JSON files. I have no doubt that part of the blame lies with database managers who generate excessively complicated JSON structures.

              From experience, my preferred option is to convert any JSON file that I want to import into Stata to CSV format first and then use import delimited ... which is easy to control. [Of course, things would be so much easier if data providers would export CSV files but in some cases they appear to deliberately avoid that option. Nested tables need only some thought to export as CSV files but data managers prefer you to do the work rather than them.] How you do this will vary from whether you are carrying out an occasional conversion of small files, in which case online programs are an easy option. Since I have moderately complex and recurrent data extraction programs, I prefer to use a command line utility that can be called from Stata. These vary from the simple to the very powerful (but costly) such as jq.

              My current utility of choice is provided by Gunamoi Software (https://www.gunamoi.com.au/soft/savejson2csv/index.html) which offers a program called SaveJson2Csv as a menu driven GUI or a command line interface. I have no connection with them. The program is freeware but with donations invited. No doubt, at some point I will get dissatisfied with it but with the amount of time that I have spent trying to get other procedures working, I can strongly recommend it.

              Comment

              Working...
              X