Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I strongly agree with William in #15.

    Comment


    • #17
      Originally posted by Robert Picard View Post
      I'm not seeing any speed advantage with respect to another player in this area (asreg). But the more options the merrier I guess.
      Since regressby is more of a marginal improvement over asreg, it wouldn't show up in a meaningful way unless we have observations in the millions (and perhaps more covariates; also note that regressby tries to sort the data internally). At any rate, consider upping the number of obs (and a few modifications I've made):

      Code:
      cap net uninstall regressby
      net install regressby, from(https://raw.githubusercontent.com/mcaceresb/stata-regressby/master/) replace
      
      * Set up
      clear all
      set obs 10000000
      set seed 123
      
      * Generate a dataset
      gen g = ceil(runiform()*1000)
      gen x = runiform()
      gen y = g + g*x + rnormal()
      sort g
      tempfile t1
      save `t1'
      
      * Test with rangestat
      use `t1', clear
      timer on 1
      rangestat (reg) y x, interval(g 0 0) by(g)
      timer off 1
      list in 1
      
      * Test with regressby
      use `t1', clear
      timer on 2
      regressby y x, by(g)
      timer off 2
      list in 1
      
      * Test with asreg
      use `t1', clear
      timer on 3
      by g: asreg y x, se
      timer off 3
      list in 1
      
      timer list
      This gives

      Code:
         1:     54.31 /        1 =      54.3080
         2:      5.46 /        1 =       5.4630
         3:     10.89 /        1 =      10.8890
      Which is a 2x improvement over asreg. Surely some of that is because asreg computes more statistics, but a good portion is probably a genuine speed improvement.

      Comment


      • #18
        I'm not seeing the speed improvements you see in #17 if I run the exact same code on my puter:

        Code:
        . * Set up
        . clear all
        
        . set obs 10000000
        number of observations (_N) was 0, now 10,000,000
        
        . set seed 123
        
        . 
        . * Generate a dataset
        . gen g = ceil(runiform()*1000)
        
        . gen x = runiform()
        
        . gen y = g + g*x + rnormal()
        
        . sort g
        
        . tempfile t1
        
        . save `t1'
        file /var/folders/cp/z8cssshn6935x9p181c71_7m0000gn/T//S_04610.000001 saved
        
        . 
        . * Test with rangestat
        . use `t1', clear
        
        . timer on 1
        
        . rangestat (reg) y x, interval(g 0 0) by(g)
        
        . timer off 1
        
        . list in 1
        
             +------------------------------------------------------------------------------------------------------------+
             | g          x           y   reg_nobs      reg_r2   reg_adj~2         b_x     b_cons        se_x     se_cons |
             |------------------------------------------------------------------------------------------------------------|
          1. | 1   .5980907   -.1704494      10105   .07682875   .07673738   1.0043109   .9939418   .03463556   .01994009 |
             +------------------------------------------------------------------------------------------------------------+
        
        . 
        . * Test with regressby
        . use `t1', clear
        
        . timer on 2
        
        . regressby y x, by(g)
        Running regressby with normal OLS standard errors.
        (0 observations deleted)
        
        . timer off 2
        
        . list in 1
        
             +-------------------------------------------------------------------+
             | g       N       _b_x      _se_x    _b_cons   _se_cons   _cov_co~x |
             |-------------------------------------------------------------------|
          1. | 1   10105   1.004311   .0346356   .9939418   .0199401   -.0005983 |
             +-------------------------------------------------------------------+
        
        . 
        . * Test with asreg
        . use `t1', clear
        
        . timer on 3
        
        . by g: asreg y x, se
        
        . timer off 3
        
        . list in 1
        
             +-------------------------------------------------------------------------------------------------------+
             | g          x           y   _Nobs         _R2      _adjR2    _b_cons        _b_x   _se_cons      _se_x |
             |-------------------------------------------------------------------------------------------------------|
          1. | 1   .5980907   -.1704494   10105   .07682875   .07673738   .9939418   1.0043109   .0199401   .0346356 |
             +-------------------------------------------------------------------------------------------------------+
        
        . 
        . timer list
           1:     34.59 /        1 =      34.5950
           2:      6.51 /        1 =       6.5090
           3:      6.18 /        1 =       6.1760
        
        . 
        end of do-file
        Again, rangestat was not designed with this type of problem in mind. rangestat independently calculates results for each observations. When it came out, it was the only tool that could quickly perform regressions using Mata and was truly several orders of magnitude faster than any alternatives at the time.

        So while you can get rangestat to calculate statistics within by-group, both regressby and asreg show that there are ways to do it even faster. I agree with William and Nick that programs that do not make all their code visible are less desirable.

        Comment


        • #19
          Originally posted by Robert Picard View Post
          I'm not seeing the speed improvements you see in #17 if I run the exact same code on my puter:
          Did you pull from my repo? It looks like you might not have; I had made some tweaks to make it run faster (you can check via "which regressby"; I just added a header so it says "version 0.1").

          Originally posted by Robert Picard View Post
          Again, rangestat was not designed with this type of problem in mind.
          I definitely agree with that part. "asreg" should have been the point of comparison from the start, not "rangestat".

          Comment


          • #20
            For those of us relatively new to GitHub, it would have been clearer if you had been as explicit in post #17 about forking a modified version of regressby in your own github repository as you were about increasing the number of observations and leaving the reader to infer the code changes from your parenthetical remark about improvements and the substitution of "mcaceresb" for "mdroste" in the installation URL. Or, if your point was primarily that of increasing the number of observations to sharpen the comparison, then you could have held the code base constant, as I expect Robert did.

            Comment


            • #21
              Indeed, I did not pick up on the fact that you were using a forked version. Looks like you managed to squeeze some added efficiency out of it. Here's the full run, this time copying the full content of the code window in #17:
              Code:
              . do test4
              
              . cap net uninstall regressby
              
              . net install regressby, from(https://raw.githubusercontent.com/mcaceresb/stata-regressby/master/) replace
              checking regressby consistency and verifying not already installed...
              installing into ./...
              installation complete.
              
              . 
              . * Set up
              . clear all
              
              . set obs 10000000
              number of observations (_N) was 0, now 10,000,000
              
              . set seed 123
              
              . 
              . * Generate a dataset
              . gen g = ceil(runiform()*1000)
              
              . gen x = runiform()
              
              . gen y = g + g*x + rnormal()
              
              . sort g
              
              . tempfile t1
              
              . save `t1'
              file /var/folders/cp/z8cssshn6935x9p181c71_7m0000gn/T//S_04610.000002 saved
              
              . 
              . * Test with rangestat
              . use `t1', clear
              
              . timer on 1
              
              . rangestat (reg) y x, interval(g 0 0) by(g)
              
              . timer off 1
              
              . list in 1
              
                   +------------------------------------------------------------------------------------------------------------+
                   | g          x          y   reg_nobs      reg_r2   reg_adj~2         b_x      b_cons        se_x     se_cons |
                   |------------------------------------------------------------------------------------------------------------|
                1. | 1   .1208283   .2628437      10105   .07836112   .07826989   1.0053896   .99283439   .03430357   .01974896 |
                   +------------------------------------------------------------------------------------------------------------+
              
              . 
              . * Test with regressby
              . use `t1', clear
              
              . timer on 2
              
              . regressby y x, by(g)
              Running regressby with normal OLS standard errors.
              
              . timer off 2
              
              . list in 1
              
                   +------------------------------------------------------------------+
                   | g       N      _b_x      _se_x    _b_cons   _se_cons   _cov_co~x |
                   |------------------------------------------------------------------|
                1. | 1   10105   1.00539   .0343036   .9928344    .019749   -.0005868 |
                   +------------------------------------------------------------------+
              
              . 
              . * Test with asreg
              . use `t1', clear
              
              . timer on 3
              
              . by g: asreg y x, se
              
              . timer off 3
              
              . list in 1
              
                   +-------------------------------------------------------------------------------------------------------+
                   | g          x          y   _Nobs         _R2      _adjR2     _b_cons        _b_x   _se_cons      _se_x |
                   |-------------------------------------------------------------------------------------------------------|
                1. | 1   .1208283   .2628437   10105   .07836112   .07826989   .99283439   1.0053896    .019749   .0343036 |
                   +-------------------------------------------------------------------------------------------------------+
              
              . 
              . timer list
                 1:     34.22 /        1 =      34.2170
                 2:      3.50 /        1 =       3.4980
                 3:      6.04 /        1 =       6.0410
              
              . 
              end of do-file
              
              .

              Comment


              • #22
                I will try to be more specific in the future wrt github forks and such; the script I posted re-installed regressby from my fork, but I can see why that would seem unnecessary if I haven't made it clearer that it's required.

                Comment

                Working...
                X