Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Are there any faster alternatives to the -import delimited- command for importing standard datasets in Stata?

    Dear Statalisters, I am eager to hear your insights/thoughts and experiences on this matter:
    I have been analyzing a considerable number of datasets, one by one. Each dataset is relatively small, around 70kB, with 10,000 to 20,000 observations and tab-delimited format (.txt). While the time Stata takes to read each dataset into memory is not significant individually, collectively, with billions of datasets, these tiny processes take a lot of time, sometimes taking a few days.
    Are you aware of any potentially faster alternatives to the import delimited command? I have tried infile but it is slower than import delimited.
    All the best,

    Tiago




  • #2
    Well, -import delimited-, as you have already found out, is a speed improvement over -infile-. You might consider using the StatTransfer program--it's not made by StataCorp, you have to buy it. It can translate almost any kind of data file into any other kind, and it's generally fast. It certainly was much faster than -infile- back in those days. I'll be honest and say that I don't actually use StatTransfer myself anymore because contemporary Stata itself can import the different file types that I need to work with, and speed has not been an issue for me, so why spend extra money. But it was really helpful back when I did use it. And, back then, it was reasonably priced and customer support was outstanding. (I'm not implying it isn't any more, I just don't know.)

    Comment


    • #3
      Well it’s safe to say you’re in the edge case scenario with billions of files. Even things that take milliseconds will be noticeable at that scale. Do you really have billions of files?

      Clyde’s suggestion is good, and I have also noticed that StatTransfer is pretty fast, but I can’t hazard a guess at whether you’ll do better here.

      I would seriously consider R or Python here to do the heavy lifting of importing data and saving it back out as Stata datasets. Both can handle the conversion process easily enough (haven for R works well, and I don’t recall what I’ve used in Python). Also since your workflow heavily relies on disk I/O, you’d be very well served to have as fast an SSD drive as you can use.

      Other potential speed increases that may be realized depending on your specifics would be to either aggregate multiple files together, either as csv files or dta files. This might also be considered as a temporary intermediate step if you need to keep the same number of files for whatever reason.

      In my own work, I’ve encountered the reverse problem, where I needed to write out tens of thousands of files to a text-based format (glares in the direction of Mplus). Python was a tremendous speed up over native Stata or Mata solutions here, and I imagine the reverse could be true.
      Last edited by Leonardo Guizzetti; 06 Mar 2024, 17:41.

      Comment


      • #4
        As always, thank you so much Clyde and Leonardo, for your remarkably helpful insights and suggestions.

        Comment

        Working...
        X