Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparison with R

    Hello,

    I've run some benchmarks comparing the speed of R and Stata for common data manipulation, on datasets ranging from 100MB to 5GB. Below is the result for 500MB (1e7 rows), which may interest some of you: http://www.princeton.edu/~mattg/pictures/1e7.png



    You can find more information on the github repository https://github.com/matthieugomez/benchmark-stata-r , which contains a quick summary of the result, the R and Stata scripts I runned, and the results as a .csv file. Feel free to edit the scripts if you spot mistakes.

    I've also written a guide on data manipulations in R for Stata Users : http://www.princeton.edu/~mattg/statar/
    The guide targets topics I could not find elsewhere : equivalent to egen by commands, panel data commands, macros, and inplace transformations of large datasets. The guide is centered on the packages data.table and dplyr, which bring R syntax closer to Stata, while being generally an order of magnitude faster than Stata for common data manipulations. I hope this guide will be useful to some Stata users.

    Matthieu
    Last edited by Matthieu Gomez; 12 Nov 2014, 18:09.

  • #2
    Hi matthieu, thanks very much.Your website is very useful.

    Bests,wanhaiyou

    Comment


    • #3
      matthieu If you use your real name here, it would be nice. Well, regarding Stata and R, I have been thinking about this issue since long. Few questions that i want to explore:
      1. Is it a good investment of time to learn two packages, Stata and R do almost the same thing, yes with efficiency difference in seconds or minutes
      2. R is comparatively less user-friendly (my assumption), is the efficiency difference larger enough to justify learning R?
      3. The selection between the two package have any implication for pocket? Any cost saving?
      Regards
      --------------------------------------------------
      Attaullah Shah, PhD.
      Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
      FinTechProfessor.com
      https://asdocx.com
      Check out my asdoc program, which sends outputs to MS Word.
      For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

      Comment


      • #4
        Hello Attaullah.

        If speed is not a problem for you, and if you don't feel limited by Stata, I don't think learning another language is a good investment. In my personal case, I spent too much time optimizing my Stata code when working with large datasets (>1Go). If this happens to you, learning R makes a lot of sense.

        Beyond speed, comparisons are more subjective. I think that R is less user friendly than Stata. Now, in my opinion, R offers more flexibility, easier data exploration (you can examine multiple subset or collapsed versions of your dataset), a wider range of statistical estimation, easier ways to write and test new functions (through the devtools and testthat packages) and better visualization (through ggplot2).

        Like virtually every R package, the two packages are free.
        Last edited by Matthieu Gomez; 13 Nov 2014, 09:44.

        Comment


        • #5
          Matthieu, my understanding of the principal, unavoidable limitation of R is passing parameters by value (as a rule). It means that your 1Gb data set will get copied in RAM 3-4 times, taking up 4Gb. Stata can operate everything rather compactly, although it would still take some overhead (may be 30-50%, unless we talk about -merge- or -reshape- operations). Any comments about this aspect of the package comparison?
          -- Stas Kolenikov || http://stas.kolenikov.name
          -- Principal Survey Scientist, Abt SRBI
          -- Opinions stated in this post are mine only

          Comment


          • #6
            Yes, this is an excellent point. However, since R 3.1 (released this year), data.frame objects are shallow copied when possible. In particular, commands such as renaming, selecting or adding new variables do a *shallow* copy of the existing columns.
            Moreover data.table objects can be modified by reference, similarly to Stata : in particular, you can sort them in place and you can modify a few rows in place without creating whole new variables. Compared to Stata, the only exception for now is that one cannot add/delete rows in place (this includes merges other than left joins) although this feature is to be implemented in the near future. Still, in this case, the dataset is copied once, not several times.You may be interested in this page: http://www.princeton.edu/~mattg/statar/memory.html
            Last edited by Matthieu Gomez; 13 Nov 2014, 10:40.

            Comment


            • #7
              Hi, Mattheiu -
              Thanks very much for this. Also, of course, a beautiful use of ggplot2. :-)

              I haven't yet documented this, but I wonder if the speed differential depends on (1) the complexity of the statistical model and (2) the 'flavor" of Stata.

              In my own (as of yet undocumented case) I ran Stata/SE on 170,000 cases running a latent class regression model. I had a terrible time in R and had to run the initial analysis on a subsample.

              Long story short: do these large differentials hold up when using more complex stat models (e.g. polca() in my case, but also winbugs [Andy Gelman admits it's quite slow] and lavaan()?)

              Thanks much for this!

              Best,
              - Nate
              Nathan E. Fosse, PhD
              [email protected]

              Comment


              • #8
                Hello Nathan.

                No I don't think that the estimation of statistical model is generally faster in R than in Stata. For instance, Stata reg is an order of magnitude faster than lm in R. I would think the same holds across a large range of statistical estimations directly written by StataCorp (another example is logit: http://ekonometrics.blogspot.com/201...and-stata.html). Now, in my experience, user-written R packages tend to be faster than user-written Stata programs, since R packages tend to use C while user-written Stata programs just use Stata (Mata in the best case).

                My benchmark is more about typical data manipulation ; I agree it would be nice to know more about the estimation of statistical models.

                Comment

                Working...
                X