Hi all! I am totally new to stata and I was wondering if someone could help me out with my problem.
On my laptop, I have a folder with roughly 3,000 csv files for 3,000 different stocks (data from Robintrack). Each of these csv files contains information about the number of Robinhood accounts that held this stock for every hour from beginning of 2018 to August 2020. Therefore, there are 24 observations per day. However, I only need 1 observation per day (the last one), and I only need the data from January 2019 to August 2020. Also, I want to generate a new variable that lists the name of the file (which is the ticker of the stock).
For one of the files (named "_PRN"), I figured out a way to only keep one observation per day, to generate a new variable with the file name and to drop the observations from 2018. I split the variable "timestamp" because it listed the date and the time - after the split, timestamp1 is the variable that only contains the date (so for example 2019-01-01):
clear all
set more off
cd C:\Users\graus\OneDrive\Dokumente\Masterstudium\MA \DATA\test
insheet using _PRN.csv, clear
gen file_name = "_PRN"
split timestamp
bysort timestamp1 : gen seq=_n
bysort timestamp1 : keep if _n==_N
drop if strpos(timestamp1,"2018")>0
Does somebody know how to perform this for all of the 3,000 csv files in my folder? Especially regarding the variable with the file name? Thank you so so much in advance, your help is really appreciated.
All the best,
Nicole
On my laptop, I have a folder with roughly 3,000 csv files for 3,000 different stocks (data from Robintrack). Each of these csv files contains information about the number of Robinhood accounts that held this stock for every hour from beginning of 2018 to August 2020. Therefore, there are 24 observations per day. However, I only need 1 observation per day (the last one), and I only need the data from January 2019 to August 2020. Also, I want to generate a new variable that lists the name of the file (which is the ticker of the stock).
For one of the files (named "_PRN"), I figured out a way to only keep one observation per day, to generate a new variable with the file name and to drop the observations from 2018. I split the variable "timestamp" because it listed the date and the time - after the split, timestamp1 is the variable that only contains the date (so for example 2019-01-01):
clear all
set more off
cd C:\Users\graus\OneDrive\Dokumente\Masterstudium\MA \DATA\test
insheet using _PRN.csv, clear
gen file_name = "_PRN"
split timestamp
bysort timestamp1 : gen seq=_n
bysort timestamp1 : keep if _n==_N
drop if strpos(timestamp1,"2018")>0
Does somebody know how to perform this for all of the 3,000 csv files in my folder? Especially regarding the variable with the file name? Thank you so so much in advance, your help is really appreciated.
All the best,
Nicole
Comment