Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing large .txt file

    Dear Statalist,

    I have ~50GB tab delimited .txt file and I'm having trouble with importing it to Stata.
    Stata keeps crashing after 10-15 mins.

    The dataset has ~20 vars and ~1.2 billion observations. No special characters.
    I'm using Stata MP/14.2 (64-bit).

    I would appreciate any solutions and suggestions!
    Thanks!

  • #2
    Hi, Botir.

    My experience with large datasets in Stata suggests that it is better to run analysis in multiple, smaller datasets, and then to combine results. Even though Stata MP holds up to 20 billion observations, the efficiency of loading up large datasets is questionable. I would recommend you to split data into smaller pieces. If this approach is not possible, e.g. you need to run regression models, I would try to load only those variables required for each analysis. Are you under Windows? If so, I would recommend an Unix operating system.

    Comment


    • #3
      Hi Tiago,

      Thanks for the tips. I tried to split my file to two parts, by selecting first 600 million rows. But, Stata couldn't handle it too. It's loading ~20,000 observations per second and at some point just crashes. Also I tried to load only one variable at a time, but loading speed (~20,000 rows) remained the same and it crashed anyway.

      Yes, I'm under Windows and I'm just curious what could be the reason of crashing Stata...

      Comment


      • #4
        Do you have a machine with 64gb or greater RAM? If not, then when Stata hits the RAM limit during the import it's switching to virtual memory (hd swap) which is slow (I'm not sure that it would by itsself cause a crash but depending on other conditions in the OS I could see it becoming non-responsive and then crashing). ( this is a bit dated, but useful info: http://www.stata.com/support/faqs/wi...-requirements/ )
        Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

        Comment


        • #5
          You should provide some details about your OS (at minimum, amount of RAM) and the command(s) you are using to import the data.
          You could try -import delimited- in lieu of -insheet- with options to load the first column only or the first 1000 rows or something to make sure there isn't some aspect of the data structure that is the culprit.
          Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

          Comment


          • #6
            Hi Eric,

            My machine has 8gb RAM and is running on Windows 7 (64-bit).

            I'm using following command to import:
            Code:
             import delimited X:\Data\file.txt, delimiter("|") rowrange(1:6000)
            When I specify last row as small number, it perfectly loads and I can see that my data structure is okay. But, when I try to load last 100 or something rows then Stata crashes anyway. I just tried using machine with 30gb RAM and limit data to only first 500 million rows (out of ~1.2 billion) and it didn't help.

            Comment


            • #7
              500 million rows might still be too much data for 30gb of RAM depending on available resources (I'm guessing that is a 32gb RAM machine with the OS overhead using 2 gb?). Certainly it's too much for an 8 gb laptop to open.

              Your issue with loading the last rows of data with import delimited might indicate that you are hitting some sort of limit(I dont know how -import delimited- loads in observations to find those last 100 rows that are causing a crash vs. a smaller number that arent causing a crash, it may require loading in a lot more data then the last 100 rows?) or that your installation has a problem (in which case you should contact Stata Tech support or reinstall Stata). I had an issue with -import delimited- crashing my Stata and it seemed to be linked to my Java installation (that's just my guess based on the error log in my Mac OSX console after my crash). I upgraded my Java and re-installed Stata and that fixed the issue.

              Maybe try opening a subset of data with -chunky- from SSC and then step up your data size (that is, read it in 1gb at a time or something) to see when you hit your limit? Regardless, managing 1.2 billion rows / 50+gb of data in Stata is going to require a machine that has a lot of ram (probably a minimum of 64gb) .
              Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

              Comment


              • #8
                Thank you Eric for your comments. I will try to get access to machine with bigger RAM and see what happens.

                Comment


                • #9
                  I have a similar problem that I'm curious if people have a solution to. I was able to successfully load the file yesterday after three previous failed attempts (each ending with Stata crashing). Stupidly, I turned off the computer at the end of the day yesterday and now Stata has continually failed to load the data.

                  File Specs:
                  - 5.4gb
                  - tab delimited
                  - .txt
                  - 14.9 million observations

                  Computer Specs:
                  - OS = Windows 10 64 bit
                  - Ram = 16gb
                  - Processor = 4.2 GHz (3.6 with boost overclocked to 4.2... running speed tracking software, shows that Stata is consuming apprx. 3.8 to 4.0 GHz)
                  - No other background or foreground processes running

                  Stata syntax used (Stata-se 15):
                  - set niceness 1
                  - import delimited using

                  This file will be merged with two other files, with approximately 2 million observations each.

                  .... Again, the file loaded (after 3 hours) yesterday. So, I'm curious as to why it won't load today. Any help with this would be great! ... Also, I've attempted to load the file on a linux OS, but unfortunately that machine does not have the same ram or processing power and crashed almost immediately as expected.

                  Cheers,
                  David

                  Comment


                  • #10
                    If anyone is curious, I was able to "fix" the above described problem by moving the file to an SSD. ... Although, this solution "works," I'm still curious as to why Stata succeeded loading the file from an HDD once and then failed every other time. Anyone else have this issue?

                    Comment


                    • #11
                      Dear Stata members,

                      Sorry for revisiting this again, I have a large text data file and I want to import it,


                      File specs:

                      - Text Document (.txt)
                      - Size: 17GB
                      - Variables : 20 variable
                      - observations : 20 million

                      Comp specs:
                      - Core i5
                      - 16.0 GB RAM
                      - Stata/SE 17.0

                      I tried to import this file using import delimited command , it takes about 15 mins and at the end stata crashed and the following messages appeared





                      Note: Unmatched quote while processing row 36851985; this can be due to a formatting problem in the file or because a quoted data
                      element spans multiple lines. You should carefully inspect your data after importing. Consider using option bindquote(strict)
                      if quoted data spans multiple lines or option bindquote(nobind) if quotes are not used for binding data.


                      insufficient memory
                      Stata could not obtain sufficient memory either because set max_memory is set too low or because the operating system said
                      no. Stata needed more memory to process character strings, probably strLs.

                      To contact Technical Services, see https://www.stata.com/tech-support/contact/
                      r(9377);





                      Could anyone help me to solve this issue?

                      Last edited by Mohamed Mahmoud; 29 Jul 2022, 08:38.

                      Comment

                      Working...
                      X