Accessing Wrds data with JDBC

Devra Golbe

Join Date: Apr 2014

Posts: 170
#1

Accessing Wrds data with JDBC

04 Mar 2022, 13:28

My coauthor and I are trying to download data from a WRDS dataset using desktop Stata (17). The following code succeeds in downloading a very small sample:

Code:

local vars "fyear,ceq,conm,ni" local gvkeylist "'001891', '001300', '028924'" jdbc load, exec("select `vars' from comp.funda where fyear between 2018 and 2020 and gvkey in (`gvkeylist') ")

We need to read data not just for the 3 gvkeys currently specified in gvkeylist, but for several thousand stored in a csv file key1.csv. key1.csv has one gvkey per row and is stored on the local hard drive. (It should be easy to add the surrounding single pops if necessary.) Is there a way to read the file into the local gvkeylist? Or is some other strategy required?

Thanks,
Devra

Devra Golbe
Professor Emerita, Dept. of Economics
Hunter College, CUNY
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

04 Mar 2022, 14:01

Perhaps this will start you in a useful direction.

Code:

input str6 gvkey
001891
001300
028924
end
generate gk = "'" + gvkey + "'"
quietly levelsof gk, clean local(gvkeylist) separate(,)
display `" in (`gvkeylist') "'

Code:

. display `" in (`gvkeylist') "'
 in ('001300','001891','028924')

With that said, perhaps you have so many gvkeys that levelsof will fail, although it seems unlikely. In that case, this may help.

Code:

local gvkeylist = "'" + gvkey[1] + "'"
forvalues i = 2/`=c(N)' {
    local gvkeylist = "`gvkeylist'" + ",'" + gvkey[`i'] + "'"
}
display `" in (`gvkeylist') "'

Code:

. display `" in (`gvkeylist') "'
 in ('001891','001300','028924')

Comment

Devra Golbe

Join Date: Apr 2014
Posts: 170

05 Mar 2022, 15:12

Thanks! Very helpful. This code runs fine when there are 3 observations in key1.csv -- although the capture display statement displays nothing.

Code:

infile str6 gvkey using key1.csv
generate gk="'" + gvkey + "'"
quietly levelsof gk, clean local (gvkeylist) separate(,)
quietly display `" in (`gvkeylist') "'
drop gk gvkey

local sql_statement ///
    select `vars' from comp.funda where ///
    fyear between 2018 and 2020 and gvkey in (`gvkeylist') ///
    and consol in ('N', 'C', 'P') and INDFMT in ('INDL') ///
    and DATAFMT in ('STD') and POPSRC in ('D') and CURCD in ('USD', 'CAD')

capture display in smcl as text "`sql_statement'"
jdbc load, exec("`sql_statement'")

However, when there are 2000 + observations in key1.csv, we get the following error:

.

Code:

 jdbc load, exec("`sql_statement'")
Incorrect number of args...
        Expected <string sqlstmt> <long rows> <boolean allstring> <boolean batch> <boolean filter>
>   | <int resultType>
r(198);

I'd be grateful for the correct syntax for that display statement-- though the resulting text will probably be super long. I'd also be grateful for an explanation of the error statement from the jdbc load command, and some insight on how to fix it or get around it.

Thanks,
Devra

Devra Golbe
Professor Emerita, Dept. of Economics
Hunter College, CUNY

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

05 Mar 2022, 15:31

I do not understand why you have capture in front of the display command, which is precisely why nothing is displayed.

Code:

set obs 2000
generate gvkey = string(100000+_n)

generate gk="'" + gvkey + "'"
quietly levelsof gk, clean local (gvkeylist) separate(,)
quietly display `" in (`gvkeylist') "'
drop gk gvkey

local sql_statement ///
    select `vars' from comp.funda where ///
    fyear between 2018 and 2020 and gvkey in (`gvkeylist') ///
    and consol in ('N', 'C', 'P') and INDFMT in ('INDL') ///
    and DATAFMT in ('STD') and POPSRC in ('D') and CURCD in ('USD', 'CAD')

display in smcl as text "`sql_statement'"

Code:

. display in smcl as text "`sql_statement'"
select  from comp.funda where     fyear between 2018 and 2020 and gvkey in ('100001','100002','1
> 00003','100004','100005','100006','100007','100008','100009','100010','100011','100012','10001
> 3','100014','100015','100016','100017','100018','100019','100020','100021','100022','100023','
   [intervening output edited out]
> ','101967','101968','101969','101970','101971','101972','101973','101974','101975','101976','1
> 01977','101978','101979','101980','101981','101982','101983','101984','101985','101986','10198
> 7','101988','101989','101990','101991','101992','101993','101994','101995','101996','101997','
> 101998','101999','102000')     and consol in ('N', 'C', 'P') and INDFMT in ('INDL')     and DA
> TAFMT in ('STD') and POPSRC in ('D') and CURCD in ('USD', 'CAD')

So I'm guessing that the problem is either (a) an overlooked error in one of the lines of key1.csv, or perhaps a blank line at the beginning or end, which the display command will show, or a problem with jdbc.

Comment

Devra Golbe

Join Date: Apr 2014

Posts: 170
#5

05 Mar 2022, 15:44

Thanks! I had forgotten that capture would suppress any output, as well as error messages. :-)

Devra Golbe
Professor Emerita, Dept. of Economics
Hunter College, CUNY
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#6

05 Mar 2022, 15:49

And for the "best of both worlds," there is -capture noisily- which shows the output and the error messages but does not stop execution.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

06 Mar 2022, 09:05

Devra Golbe -

A few thoughts occurred to me overnight.

First, the error message you are seeing is being passed through by Stata from - I expect - the underlying WRDS database, and there are perhaps several layers of middleware between you and the database. It is possible that your generated SQL select command - if it looks good in the display output - is being mangled by the middleware on its way to the database, which may have different limits on SQL commands than the underlying database.

If you haven't been able to make your command work, one possibility is to create a loop that selects the data for some number greater than 3 and less than 2000+ of your gvkeys at a time, and then appends the selected data to a Stata dataset saved on disk, so you can build up your data by pieces.

In doing this you'll want to be sure you haven't accidentally included any gvkey twice in key1.csv, so after you import it with infile, run duplicates drop to ensure there are no duplicates. When you process the entire list at once, levelsof takes care of deduplication, but if you're processing in batches, the same gvkey can appear in two separate batchts, and the result will be that you have duplicated observations in the appended data.
Comment
Devra Golbe

Join Date: Apr 2014

Posts: 170
#8

06 Mar 2022, 10:42

Thanks, William! I bet you're right. I like your suggestion to build the dataset in pieces. Alternatively, I think I need to learn more about how to join the two datasets inside the SQL.

Devra

Devra Golbe
Professor Emerita, Dept. of Economics
Hunter College, CUNY
Comment

Announcement