Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Performing one code on 3,000 files, generating variable with file name

    Hi all! I am totally new to stata and I was wondering if someone could help me out with my problem.

    On my laptop, I have a folder with roughly 3,000 csv files for 3,000 different stocks (data from Robintrack). Each of these csv files contains information about the number of Robinhood accounts that held this stock for every hour from beginning of 2018 to August 2020. Therefore, there are 24 observations per day. However, I only need 1 observation per day (the last one), and I only need the data from January 2019 to August 2020. Also, I want to generate a new variable that lists the name of the file (which is the ticker of the stock).

    For one of the files (named "_PRN"), I figured out a way to only keep one observation per day, to generate a new variable with the file name and to drop the observations from 2018. I split the variable "timestamp" because it listed the date and the time - after the split, timestamp1 is the variable that only contains the date (so for example 2019-01-01):

    clear all
    set more off
    cd C:\Users\graus\OneDrive\Dokumente\Masterstudium\MA \DATA\test
    insheet using _PRN.csv, clear
    gen file_name = "_PRN"
    split timestamp
    bysort timestamp1 : gen seq=_n
    bysort timestamp1 : keep if _n==_N
    drop if strpos(timestamp1,"2018")>0

    Does somebody know how to perform this for all of the 3,000 csv files in my folder? Especially regarding the variable with the file name? Thank you so so much in advance, your help is really appreciated.

    All the best,
    Nicole

  • #2
    Please provide a dataex example of one of the files.

    And, include the name of that file in a // comment. You will need a loop to accomplish this.

    Comment


    • #3
      It would be like the following. I've commented out your lines of code that are replaced by mine, and those that are not needed at all.
      Code:
      version 17.0
      
      clear all
      
      // set more off
      // cd C:\Users\graus\OneDrive\Dokumente\Masterstudium\MA \DATA\test
      
      local dir "C:\Users\graus\OneDrive\Dokumente\Masterstudium\MA \DATA\test"
      
      local files : dir "`dir'" files "*.csv", respectcase
      
      tempfile dataset
      quietly save `dataset', emptyok
      
      foreach file of local files {
      
          // insheet using _PRN.csv, clear
          import delimited using "`dir'/`file'", varnames(1) clear
      
          // gen file_name = "_PRN"
          local file_name : subinstr local file ".csv" ""
          generate str file_name = "`file_name'"
      
          // split timestamp
          // bysort timestamp1 : gen seq=_n
          // bysort timestamp1 : keep if _n==_N
          // drop if strpos(timestamp1,"2018")>0
          drop if substr(timestamp, 1, 4) > "2018
          generate str hold_dt = substr(timestamp, 1, 10)
          bysort hold_dt (timestamp): keep if _n == _N
      
          append using `dataset'
          quietly save `dataset', replace
      }
      It's obviously untested and there are some questions that you'd need to answer, for example, is there really a space in the directory names?

      Also, you'll need to make sure that your Stata is set up to accommodate the names of your 3000 files in a local macro. See
      Code:
      help limits
      and scroll down to footnote (2) for help on that.

      Comment


      • #4
        You'll need a closing double-quotation mark on the line that restricts the data to the first year.
        Code:
        drop if substr(timestamp, 1, 4) > "2018"
        Again, it's untested. And as Jared mentions, in the absence of actual example, I've assumed based upon the text in your post that the date-time information follows international convention.

        Comment


        • #5
          Originally posted by Nicole Grausam View Post
          I only need the data from January 2019 to August 2020.
          Ah, sorry, misread that. Change that troublesome line to
          Code:
          drop if substr(timestamp, 1, 4) <= "2018"
          And you will want to add quietly in front of the foreach in order to avoid the voluminous output incumbent upon 3000 files' worth of processing.

          Comment


          • #6
            Thank you so much Joseph!!! It worked!! Thank you I really appreciate it!

            Comment

            Working...
            X