Performing one code on 3,000 files, generating variable with file name

Nicole Grausam

Join Date: Apr 2022

Posts: 4
#1

Performing one code on 3,000 files, generating variable with file name

30 Apr 2022, 06:24

Hi all! I am totally new to stata and I was wondering if someone could help me out with my problem.

On my laptop, I have a folder with roughly 3,000 csv files for 3,000 different stocks (data from Robintrack). Each of these csv files contains information about the number of Robinhood accounts that held this stock for every hour from beginning of 2018 to August 2020. Therefore, there are 24 observations per day. However, I only need 1 observation per day (the last one), and I only need the data from January 2019 to August 2020. Also, I want to generate a new variable that lists the name of the file (which is the ticker of the stock).

For one of the files (named "_PRN"), I figured out a way to only keep one observation per day, to generate a new variable with the file name and to drop the observations from 2018. I split the variable "timestamp" because it listed the date and the time - after the split, timestamp1 is the variable that only contains the date (so for example 2019-01-01):

clear all
set more off
cd C:\Users\graus\OneDrive\Dokumente\Masterstudium\MA \DATA\test
insheet using _PRN.csv, clear
gen file_name = "_PRN"
split timestamp
bysort timestamp1 : gen seq=_n
bysort timestamp1 : keep if _n==_N
drop if strpos(timestamp1,"2018")>0

Does somebody know how to perform this for all of the 3,000 csv files in my folder? Especially regarding the variable with the file name? Thank you so so much in advance, your help is really appreciated.

All the best,
Nicole
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

30 Apr 2022, 06:55

Please provide a dataex example of one of the files.

And, include the name of that file in a // comment. You will need a loop to accomplish this.
Comment

Joseph Coveney

Join Date: Apr 2014
Posts: 4358

30 Apr 2022, 07:02

It would be like the following. I've commented out your lines of code that are replaced by mine, and those that are not needed at all.

Code:

version 17.0

clear all

// set more off
// cd C:\Users\graus\OneDrive\Dokumente\Masterstudium\MA \DATA\test

local dir "C:\Users\graus\OneDrive\Dokumente\Masterstudium\MA \DATA\test"

local files : dir "`dir'" files "*.csv", respectcase

tempfile dataset
quietly save `dataset', emptyok

foreach file of local files {

    // insheet using _PRN.csv, clear
    import delimited using "`dir'/`file'", varnames(1) clear

    // gen file_name = "_PRN"
    local file_name : subinstr local file ".csv" ""
    generate str file_name = "`file_name'"

    // split timestamp
    // bysort timestamp1 : gen seq=_n
    // bysort timestamp1 : keep if _n==_N
    // drop if strpos(timestamp1,"2018")>0
    drop if substr(timestamp, 1, 4) > "2018
    generate str hold_dt = substr(timestamp, 1, 10)
    bysort hold_dt (timestamp): keep if _n == _N

    append using `dataset'
    quietly save `dataset', replace
}

It's obviously untested and there are some questions that you'd need to answer, for example, is there really a space in the directory names?

Also, you'll need to make sure that your Stata is set up to accommodate the names of your 3000 files in a local macro. See

Code:

help limits

and scroll down to footnote (2) for help on that.

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4358
#4

30 Apr 2022, 07:12

You'll need a closing double-quotation mark on the line that restricts the data to the first year.

Code:

drop if substr(timestamp, 1, 4) > "2018"

Again, it's untested. And as Jared mentions, in the absence of actual example, I've assumed based upon the text in your post that the date-time information follows international convention.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4358
#5

30 Apr 2022, 07:39

Originally posted by Nicole Grausam View Post

I only need the data from January 2019 to August 2020.

Ah, sorry, misread that. Change that troublesome line to

Code:

drop if substr(timestamp, 1, 4) <= "2018"

And you will want to add quietly in front of the foreach in order to avoid the voluminous output incumbent upon 3000 files' worth of processing.
Comment
Nicole Grausam

Join Date: Apr 2022

Posts: 4
#6

30 Apr 2022, 07:48

Thank you so much Joseph!!! It worked!! Thank you I really appreciate it!
Comment

Announcement

Performing one code on 3,000 files, generating variable with file name

Comment

Comment

Comment

Comment

Comment