Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to reduce the size of large data files in Stata?


    Hi,

    I am working on high-frequency datasets. One-month Stata data file comes to around 4 GB (4 million rows and 39 variables). If I have to do a yearly analysis, the appended dataset becomes 48 GB. This slows the system even though I am using a system with 32-GB DDR4 RAM and 512 GB SSD.

    I am wondering is there a way to reduce the file size without removing any variables?

    However, in R, the size of one-month data is only 150 MB. Why does Stata take so much space while R takes so little space?
    Any suggestion to reduce the file size will be greatly appreciated.

  • #2
    There is nothing to do in Stata (or any other software for that matter I think) to reduce the data size without changing information apart from writing

    Code:
    compress
    I think you might have some success reducing the data size by saving only the variables that you need, and in numeric format (anyways for any statistical analysis you need the data in numerical format).

    Finally I do not have your experience with R vs Stata. This is the major reason why R never grew on me (apart from the horrendous syntax of R), because when I tried it circa year 2006-2008, R was a disaster for "big data" files.

    Comment


    • #3
      if your dataset contains string variables with non-unique values, then these string vars can be encoded into numeric with labels. the labels are only stored once per unique value.
      Roger Newson's (user-written commands) sencode and sdecode are quite helpful.

      Comment


      • #4
        You don't show what your dataset looks like, but the advice that you have already receive here is an excellent place to start. Your dataset likely includes several string variables, or is otherwise using large storage types for numeric data than is necessary.

        A thought, R may use some form of compression when saving out datasets. This means that the size is smaller on disk, but will still be decompressed In memory to work with, hence inflated in RAM.

        Comment


        • #5
          If your dataset's variables are all essentially categorical or integers then contract may reduce its size. You would then work with the frequency weights that are generated by contract when you conduct analyses.

          With 39 variables over 4M observations it's unlikely to help much (or at all). But if your analyses involve subsets of those 39 variables then it may be worth a try once those subsets are defined.

          Comment


          • #6

            Thank you, Joro Kolev. I will try with -compress-. I am not an R person. So, no experience with R vs Stata. I was trying in R since large file sizes in Stata reduced the speed very much and it became difficult to work in Stata.

            George Hoffman, I will try with -sencode- and -sdecode-

            Leonardo Guizzetti, You are right. The data contains a lot of string variables. The data contains information on the movement of goods from place A to place B across traders and dealers (a part of indirect tax data).

            John Mullahy, Yes, some of the variables are categorical. I will try with -contract-.

            Comment

            Working...
            X