Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Import of 18GB csv file

    I am a long time user of STATA, but this is the first large dataset I have tried to manage. I have what may be a very basic question.

    Can I use STATA to open this dataset if I wait long enough (I've waited ~30min with no success)? Is my computer memory the limiting factor?

    .csv file size: ~18GB
    STATA version- Stata/IC 16.1 for Mac (Intel 64-bit)
    Computer- 2019 MacBook Pro, 16 GB RAM (2.3Ghz 8-core Intel i9)

    I want to drop ~90% of the observations as soon as it is open, but it is the first step that is defeating me

    Many thanks



  • #2
    You can do this, but I would hesitate unless I had > 18 GB of RAM, since you may need to read all of the file into memory before you can drop the observations. You can also use -if/in- qualifiers with -import delimited-, but I am unsure right now if that applies at the time of reading in observations, or after the entire file is read in.

    If I had to do this, I would consider chunking the file using some other means (e.g., Python) to make more manageable pieces for import into Stata. Since you mention wanting to drop a substantial fraction of observations, I would create a loop to import each chunk, apply the filter, and then append the result to your growing final dataset.

    Comment


    • #3
      The user-written command -chunky- (See -search chunky-) does the kind of thing Leonardo suggests. It will break a large file into pieces and as likely would be desired for a CSV file with variable names on the first line (a header), insert that header into each piece so that it can be imported.

      I have not used this program for some time, but my recollection was that it was fast and robust. The command in the current situation might be used something like this:
      Code:
      chunky using "YourFile.csv", stub(work) header(include) chunksize(1 GB)
      //
      // This has broken the original file into chunks about 1GB (user choice) in size, and saved them to
      // files named work0001, work0002, ... in the current directory
      //
      // Import and append each one, dropping the extraneous observations along the way.
      clear
      save wholefile.dta, emptyok         // file that will hold the pieces
      foreach f in `s(filelist)' {        // chunky leaves behind a list of chunk file names
         clear
         di "Working on file: `f'"
         import delimited using "`f'", ......
         drop if SomeCondition  // 90% of observations to be dropped, right?
         append using "wholefile.dta"
         save "wholefile.dta"
      }
      I don't have any good data around to test this on, but the syntax should be close. Even if your computer *can* handle the whole 18GB file, this might actually be just about as fast. You'd also know where the process was at any point, and if something failed, you'd at least have part of the job done.
      Last edited by Mike Lacy; 15 Feb 2024, 11:45. Reason: P.S., after thought: I never tried -chunk- on something this big, and it's an older program, so hearing if it worked or not would be interesting.

      Comment


      • #4
        I’m curious about this myself and I had forgotten that -chunk- exists. If I get some time this weekend I might experiment to see how it works.

        Comment


        • #5
          To avoid any confusion: The correct name is -chunky-. Sorry for the error in my followup comment.

          Comment


          • #6
            To add to the excellent guidance others have already given... You may need much more of maybe even less memory to import that entire file, it really depends on its content, and it is difficult to predict. For example, "1.2" needs 3 bytes of storage in the csv file, but will require 4 bytes of storage (float) in Stata. Once a float is needed, then even a single value like "6", for example, will require 4 bytes if in the same column (variable) once imported into Stata. Similarly, any string data will consume as much as the widest value for a particular column (variable). The same rules can apply in reverse when dealing with floating point numbers. For example, "1.234567" will need 8 bytes of storage as a string in the csv file, but will require only 4 bytes to store as a float variable in Stata. With that said, a good starting point is to expect to need about 1.5 times the memory as the original file size.

            import delimited does have some facilities that can help you directly. See the options for rowrange() and colrange(). These options work as the csv file is parsed, rather than pruning the dataset after it has been imported. In other words, they are memory efficient.

            Comment


            • #7
              If you're will and able to use Python, this program might just be what you're looking for. Check out -readit- by William Buchanan and Brian Quistorff.

              Comment


              • #8
                Yet another fleksible alternative: read csv into a DB (could be SQLite), then use JDBC or ODBC to connect, and get any subset of data.

                Comment

                Working...
                X