Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Accessing Wrds data with JDBC

    My coauthor and I are trying to download data from a WRDS dataset using desktop Stata (17). The following code succeeds in downloading a very small sample:

    Code:
    local vars  "fyear,ceq,conm,ni"
    
    local gvkeylist "'001891', '001300', '028924'"
    
    jdbc load, exec("select `vars' from comp.funda where fyear between 2018 and 2020 and gvkey in (`gvkeylist') ")
    We need to read data not just for the 3 gvkeys currently specified in gvkeylist, but for several thousand stored in a csv file key1.csv. key1.csv has one gvkey per row and is stored on the local hard drive. (It should be easy to add the surrounding single pops if necessary.) Is there a way to read the file into the local gvkeylist? Or is some other strategy required?

    Thanks,
    Devra
    Devra Golbe
    Professor Emerita, Dept. of Economics
    Hunter College, CUNY

  • #2
    Perhaps this will start you in a useful direction.
    Code:
    input str6 gvkey
    001891
    001300
    028924
    end
    generate gk = "'" + gvkey + "'"
    quietly levelsof gk, clean local(gvkeylist) separate(,)
    display `" in (`gvkeylist') "'
    Code:
    . display `" in (`gvkeylist') "'
     in ('001300','001891','028924')
    With that said, perhaps you have so many gvkeys that levelsof will fail, although it seems unlikely. In that case, this may help.
    Code:
    local gvkeylist = "'" + gvkey[1] + "'"
    forvalues i = 2/`=c(N)' {
        local gvkeylist = "`gvkeylist'" + ",'" + gvkey[`i'] + "'"
    }
    display `" in (`gvkeylist') "'
    Code:
    . display `" in (`gvkeylist') "'
     in ('001891','001300','028924')

    Comment


    • #3
      Thanks! Very helpful. This code runs fine when there are 3 observations in key1.csv -- although the capture display statement displays nothing.

      Code:
      infile str6 gvkey using key1.csv
      generate gk="'" + gvkey + "'"
      quietly levelsof gk, clean local (gvkeylist) separate(,)
      quietly display `" in (`gvkeylist') "'
      drop gk gvkey
      
      local sql_statement ///
          select `vars' from comp.funda where ///
          fyear between 2018 and 2020 and gvkey in (`gvkeylist') ///
          and consol in ('N', 'C', 'P') and INDFMT in ('INDL') ///
          and DATAFMT in ('STD') and POPSRC in ('D') and CURCD in ('USD', 'CAD')
      
      capture display in smcl as text "`sql_statement'"
      jdbc load, exec("`sql_statement'")
      However, when there are 2000 + observations in key1.csv, we get the following error:

      .
      Code:
       jdbc load, exec("`sql_statement'")
      Incorrect number of args...
              Expected <string sqlstmt> <long rows> <boolean allstring> <boolean batch> <boolean filter>
      >   | <int resultType>
      r(198);
      I'd be grateful for the correct syntax for that display statement-- though the resulting text will probably be super long. I'd also be grateful for an explanation of the error statement from the jdbc load command, and some insight on how to fix it or get around it.

      Thanks,
      Devra



      Devra Golbe
      Professor Emerita, Dept. of Economics
      Hunter College, CUNY

      Comment


      • #4
        I do not understand why you have capture in front of the display command, which is precisely why nothing is displayed.
        Code:
        set obs 2000
        generate gvkey = string(100000+_n)
        
        generate gk="'" + gvkey + "'"
        quietly levelsof gk, clean local (gvkeylist) separate(,)
        quietly display `" in (`gvkeylist') "'
        drop gk gvkey
        
        local sql_statement ///
            select `vars' from comp.funda where ///
            fyear between 2018 and 2020 and gvkey in (`gvkeylist') ///
            and consol in ('N', 'C', 'P') and INDFMT in ('INDL') ///
            and DATAFMT in ('STD') and POPSRC in ('D') and CURCD in ('USD', 'CAD')
        
        display in smcl as text "`sql_statement'"
        Code:
        . display in smcl as text "`sql_statement'"
        select  from comp.funda where     fyear between 2018 and 2020 and gvkey in ('100001','100002','1
        > 00003','100004','100005','100006','100007','100008','100009','100010','100011','100012','10001
        > 3','100014','100015','100016','100017','100018','100019','100020','100021','100022','100023','
           [intervening output edited out]
        > ','101967','101968','101969','101970','101971','101972','101973','101974','101975','101976','1
        > 01977','101978','101979','101980','101981','101982','101983','101984','101985','101986','10198
        > 7','101988','101989','101990','101991','101992','101993','101994','101995','101996','101997','
        > 101998','101999','102000')     and consol in ('N', 'C', 'P') and INDFMT in ('INDL')     and DA
        > TAFMT in ('STD') and POPSRC in ('D') and CURCD in ('USD', 'CAD')
        So I'm guessing that the problem is either (a) an overlooked error in one of the lines of key1.csv, or perhaps a blank line at the beginning or end, which the display command will show, or a problem with jdbc.

        Comment


        • #5
          Thanks! I had forgotten that capture would suppress any output, as well as error messages. :-)
          Devra Golbe
          Professor Emerita, Dept. of Economics
          Hunter College, CUNY

          Comment


          • #6
            And for the "best of both worlds," there is -capture noisily- which shows the output and the error messages but does not stop execution.

            Comment


            • #7
              Devra Golbe -

              A few thoughts occurred to me overnight.

              First, the error message you are seeing is being passed through by Stata from - I expect - the underlying WRDS database, and there are perhaps several layers of middleware between you and the database. It is possible that your generated SQL select command - if it looks good in the display output - is being mangled by the middleware on its way to the database, which may have different limits on SQL commands than the underlying database.

              If you haven't been able to make your command work, one possibility is to create a loop that selects the data for some number greater than 3 and less than 2000+ of your gvkeys at a time, and then appends the selected data to a Stata dataset saved on disk, so you can build up your data by pieces.

              In doing this you'll want to be sure you haven't accidentally included any gvkey twice in key1.csv, so after you import it with infile, run duplicates drop to ensure there are no duplicates. When you process the entire list at once, levelsof takes care of deduplication, but if you're processing in batches, the same gvkey can appear in two separate batchts, and the result will be that you have duplicated observations in the appended data.

              Comment


              • #8
                Thanks, William! I bet you're right. I like your suggestion to build the dataset in pieces. Alternatively, I think I need to learn more about how to join the two datasets inside the SQL.

                Devra
                Devra Golbe
                Professor Emerita, Dept. of Economics
                Hunter College, CUNY

                Comment

                Working...
                X