Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing Very Large CSV into Stata

    I am trying to import a CSV containing ~15 variables at 80GB total. I am using Stata MP on a 512GB desktop, which I would presume is sufficient for a task such as this. Yet, I started the import using an import delimited command over 3 days ago, and it still has not loaded into Stata (still shows importing). Has anyone else ran into an issue like this and, if so, is this the type of situation where I just need to wait longer and it will eventually import, or will I end up waiting forever?

    Would anyone have recommendations on how to best tackle situations such as this? My initial thought was to use the rowrange option in import delimited, but I have been told this doesn't actually solve the problem because Stata still has to import the entire dataset into memory before it can filter down to a subset of rows. I've heard that SAS tends to be much more efficient for big data tasks since it doesn't need to read an entire dataset into memory at once, but I do not have any coding knowledge in SAS and am not even sure if I'd have a way to access it. I do have an understanding of Python and R, but I am unsure if either would be of benefit since I believe they also read an entire dataset into memory. Any help would be greatly appreciated!

  • #2
    Sounds like a waiting game to me. Out of curiosity, what's in this dataset that makes it all worth it?

    Comment


    • #3
      This sounds to me like your machine is hung. I'm usually the one here telling everybody to be patient and go read a good novel while they wait. But your experience seems disproportionate.

      Not that long ago, on a machine with only 24 GB of RAM, I read in a 5.5 GB text file, and it took about 25 minutes. Now, in fairness, that was a fixed format file read with -infix-, so there was essentially no parsing that had to be done. But tasks like this are, in any case, I/O bound. And the file I was reading had about 5 million observations each with about 40 variables, so a different layout. But even if we scale up the size of the file by a factor of 15 to be a little over 80 GB, we're still only talking about less than 7.5 hours. Throw in any kind of reasonable fudge factors you want for difference in processors and the inherently slower process of parsing a tab delimited file, and I still don't think you can get to 3 days.

      While I'm reasonably confident in my diagnosis here, I don't have a treatment to suggest. I think the most likely reason that -import delimited- is hung is that there is something corrupted in the csv file itself. Generally speaking, -import delimited- is not very robust to irregularities in the source file. If there are fields with unmatched quotes, or lines with the wrong number of input fields (variables), or missing (or excess) delimiters, or lines that don't terminate properly, they can wreak havoc with -import delimited-. I don't know of any good tools for dealing with this kind of problem.

      Comment


      • #4
        Though some packages in R require loading the dataset in memory, I recently learned about Apache Arrow, which is better equipped to handle large datasets. Here is the R documentation: https://arrow.apache.org/docs/r/

        If you'd like to integrate the import into your Stata code after you've figured it out in R, you can call R from Stata with this package: https://github.com/haghish/rcall

        Good luck! And if you don't mind sharing your do file when you're done, please do message me so that I can learn from your experience! Using Arrow and calling it from R has been on my to do list for a while, but I just haven't gotten around to it.

        Incidentally, I've tried to import a 30GB CSV using StataMP and 64GB RAM and it kept on crashing, but I got around it with the rowrange option, which is why I still haven't used Arrow yet.

        Comment


        • #5
          Cross-posted at https://stackoverflow.com/questions/...tatistical-sof

          Comment


          • #6
            The user-written module -chunky- is easy to use and might help here. See -ssc describe chunky-.

            Comment

            Working...
            X