Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • New Stata program: regressby (statsby: regress replacement)

    Hi Statalisters,

    I wanted to share with you a program that I've been working on that some of you may find really useful.

    Suppose that you are interested in running the same OLS regression model on a set of mutually-exclusive subsets of your dataset, which I'll suppose is indexed by a variable g. You want the slopes and intercepts to vary across g. Here are a few more concrete motivating examples:

    1. You want to run the same regression model on all observations in a given year for many different years.
    2. You want to run the same regression model on all observations in a given geographic unit (US state, county, ZIP code, commuting zone, Census tract, Census block, etc) across all such units.

    You have a couple options to do this. You can do this manually with a quick loop over each distinct value (or tuple) of your 'by' group(s). You can also do this with statsby, but it is pretty slow, and for whatever reason you can't access the full VCV matrix of each estimate from statsby: regress. If there are only a small number of -by- groups and there aren't too many independent variables in your regression model, you can fully interact each regressor variable with your by group, so long as total number of regressors doesn't exceed Stata's limit of just under 11,000. Or you can use this new program, regressby, which can be hundreds of times faster than any of these options. In my research team's usage, this has sped up some of our estimation scripts by a factor of about 300, reducing the runtime of Stata scripts that would have otherwise taken weeks on a server to under an hour. In particular, this script generates huge improvements in performance relative to statsby when you have many -by- groups.

    You can read more about regressby on Github here: https://github.com/mdroste/stata-regressby

    Here is a minimal working example comparing the usage of statsby vs. regressby:
    Code:
    * Set up
    clear all
    set obs 100000
    set seed 123
    set rmsg on
    
    * Generate a dataset
    gen g = ceil(runiform()*1000)
    gen x = runiform()
    gen y = g + g*x + rnormal()
    tempfile t1
    save `t1'
    
    * Test with statsby
    use `t1', clear
    statsby _b _se, clear by(g): regress y x
    
    * Test with regressby
    use `t1', clear
    regressby y x, by(g)

    This program supports the usual OLS asymptotic standard errors, heteroskedasticity-robust (White) standard errors, cluster-robust standard errors, and analytic weights. Support for frequency weights, absorbed fixed effects, and outputting additional diagnostics (R2, RMSE, etc) will be coming very soon, maybe by the time you read this.

    I am especially eager to hear your feedback on features you would like and any bugs that you might encounter. I hope you find this useful - thanks so much!

    Best,
    Mike
    Last edited by Michael Droste; 30 Jul 2018, 15:00.

  • #2
    Acknowledgements are given on Github, but are due also on this thread. This program is based heavily on code that Michael Stepner wrote for his Health Inequality Project. He had the insight that a simple Mata function could calculate regression coefficients in this setting much more efficiently than statsby or a manual loop. I merely packaged this functionality a program that supported various standard errors, analytic weights, exception handling, and generalized it to allow an arbitrary number of regressors (w/ Wilbur Townsend, who quickly came up with a much more elegant solution than I had in mind).
    Last edited by Michael Droste; 30 Jul 2018, 14:54.

    Comment


    • #3
      Dear Michael, Many thanks for sharing with us this interesting code. I tried it out but found some errors
      Code:
      clear
      set seed 123
      
      // 2000 firms
      set obs 2000
      gen long id = _n
      
      // 18 years 
      expand 18
      bys id: gen year = 1999 + _n
      
      // 250 daily / 50 weekly observations
      expand 50
      
      gen ri = runiform()
      gen rm = runiform()
      
      bys id (year): gen t = _n
      xtset id t
      
      gen L1rm = L1.rm
      gen L2rm = L2.rm
      gen F1rm = F1.rm
      gen F2rm = F2.rm
      
      // statsby
      regressby ri L2rm L1rm rm F1rm F2rm, by(id year)
      and the error message is
      Code:
      . // statsby
      . regressby ri L2rm L1rm rm F1rm F2rm, by(id year)
      < is not a valid command name
      (error occurred while loading regressby.ado)
      r(199); t=0.00 12:09:48
      Any suggestion?

      Code:
      . net install regressby, from(https://raw.githubusercontent.com/mdroste/stata-regressby/master/)
      checking regressby consistency and verifying not already installed...
      all files already exist and are up to date.
      Ho-Chuan (River) Huang
      Stata 19.0, MP(4)

      Comment


      • #4
        Hi River,

        Thanks for your feedback!

        Interestingly, the example code you have provided above works perfectly for me, even after uninstalling regressby and re-installing to make sure that I was using the build currently on Github.
        Unfortunately, I have not seen an error like this before, either on my Windows machine or on the server environments in which my team has deployed this script. The '<' operator is only used a handful of times in my script (either alone or as <=), and I am not at all sure how any of the instances in which it is used would generate that particular error. I have disabled one instance where it is used for exception handling.

        If you wouldn't mind helping me out, perhaps you could -set trace on-, uninstall this program, reinstall from Github, and then try again? If -set trace on- gives you error text, I'd love to see it (perhaps in a private message). I would greatly appreciate your help tracking this bug down. Thank you so much!

        Comment


        • #5
          Dear Michael, Thanks for your reply. I re-run your example above, and still got the same message.
          Code:
          . * Set up
          . clear all
          
          . set obs 100000
          number of observations (_N) was 0, now 100,000
          
          . set seed 123
          
          . set rmsg on
          r; t=0.00 16:40:02
          
          . 
          . * Generate a dataset
          . gen g = ceil(runiform()*1000)
          r; t=0.01 16:40:02
          
          . gen x = runiform()
          r; t=0.00 16:40:02
          
          . gen y = g + g*x + rnormal()
          r; t=0.01 16:40:02
          
          . tempfile t1
          r; t=0.00 16:40:02
          
          . save `t1'
          file C:\Users\ADMINI~1\AppData\Local\Temp\ST_29bc_000001.tmp saved
          r; t=0.02 16:40:02
          
          . 
          . * Test with statsby
          . use `t1', clear
          r; t=0.00 16:40:02
          
          . statsby _b _se, clear by(g): regress y x
          (running regress on estimation sample)
          
                command:  regress y x
                     by:  g
          
          Statsby groups
          ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
          ..................................................    50
          ..................................................   100
          ..................................................   150
          ..................................................   200
          ..................................................   250
          ..................................................   300
          ..................................................   350
          ..................................................   400
          ..................................................   450
          ..................................................   500
          ..................................................   550
          ..................................................   600
          ..................................................   650
          ..................................................   700
          ..................................................   750
          ..................................................   800
          ..................................................   850
          ..................................................   900
          ..................................................   950
          ..................................................  1000
          r; t=8.79 16:40:11
          
          . 
          . * Test with regressby
          . use `t1', clear
          r; t=0.00 16:40:11
          
          . regressby y x, by(g)
          < is not a valid command name
          (error occurred while loading regressby.ado)
          r(199); t=0.00 16:40:11
          
          end of do-file
          
          r(199); t=0.00 16:40:11
          Then, I perform the following command and find that
          Code:
          . set trace on 
          r; t=0.00 16:45:16
          
          . ado uninstall regressby
          
          package regressby from https://raw.githubusercontent.com/mdroste/stata-regressby/master
                ^
          
          (package uninstalled)
          r; t=0.09 16:45:16
          I re-install the package
          Code:
          . net install regressby, from(https://raw.githubusercontent.com/mdroste/stata-regressby/master/)
          checking regressby consistency and verifying not already installed...
          installing into c:\ado\plus\...
          installation complete.
          But still got the same error message.
          Code:
          . regressby y x, by(g)
          < is not a valid command name
          (error occurred while loading regressby.ado)
          r(199); t=0.00 16:48:06
          Ho-Chuan (River) Huang
          Stata 19.0, MP(4)

          Comment


          • #6
            Just for curiosity, am I the only one encountering such problem?
            Ho-Chuan (River) Huang
            Stata 19.0, MP(4)

            Comment


            • #7
              Just to help identify the issue: I do not get an error when installing from github and running the example in post #3. That's on Stata 15.1 MP.

              River, can you try again with the same dataset and set trace on before running the regression, i.e.:
              Code:
              clear
              set seed 123
              // 2000 firms
              set obs 2000
              gen long id = _n
              // 18 years 
              expand 18
              bys id: gen year = 1999 + _n
              // 250 daily / 50 weekly observations
              expand 50
              gen ri = runiform()
              gen rm = runiform()
              bys id (year): gen t = _n
              xtset id t
              gen L1rm = L1.rm
              gen L2rm = L2.rm
              gen F1rm = F1.rm
              gen F2rm = F2.rm
              
              
              set trace on 
              // statsby
              regressby ri L2rm L1rm rm F1rm F2rm, by(id year)
              And then report back the last few lines in the output, i.e., the last few lines before Stata runs into an error?

              Comment


              • #8
                Dear Jorrit, The message is
                Code:
                . clear
                
                . set seed 123
                
                . // 2000 firms
                . set obs 2000
                number of observations (_N) was 0, now 2,000
                
                . gen long id = _n
                
                . // 18 years 
                . expand 18
                (34,000 observations created)
                
                . bys id: gen year = 1999 + _n
                
                . // 250 daily / 50 weekly observations
                . expand 50
                (1,764,000 observations created)
                
                . gen ri = runiform()
                
                . gen rm = runiform()
                
                . bys id (year): gen t = _n
                
                . xtset id t
                       panel variable:  id (strongly balanced)
                        time variable:  t, 1 to 900
                                delta:  1 unit
                
                . gen L1rm = L1.rm
                (2,000 missing values generated)
                
                . gen L2rm = L2.rm
                (4,000 missing values generated)
                
                . gen F1rm = F1.rm
                (2,000 missing values generated)
                
                . gen F2rm = F2.rm
                (4,000 missing values generated)
                
                . 
                . 
                . set trace on 
                
                . // statsby
                . regressby ri L2rm L1rm rm F1rm F2rm, by(id year)
                < is not a valid command name
                (error occurred while loading regressby.ado)
                r(199);
                
                end of do-file
                
                r(199);
                Ho-Chuan (River) Huang
                Stata 19.0, MP(4)

                Comment


                • #9
                  This looks like a good piece of work, but the alternatives are understated in #1. For example, rangestat (SSC) by Robert Picard and friends can do the simply stated tasks

                  1. You want to run the same regression model on all observations in a given year for many different years.
                  2. You want to run the same regression model on all observations in a given geographic unit (US state, county, ZIP code, commuting zone, Census tract, Census block, etc) across all such units.
                  just fine and has been around a while now (hundreds of posts here). It doesn't support weights and doesn't have the same options, but conversely rangestat does plenty that regressby doesn't.

                  Not trying to compete here, and no one can be familiar with the entire range of community-contributed software, and above all programs can be complementary and overlap in what they do. Nevertheless comparisons are better for being fairly stated.

                  There are yet other such programs, but I'll let their authors speak if so minded.

                  I have suggestion for future versions of the help. The minimal

                  Examples

                  See the Github page.

                  is likely to undermine take-up. Many -- I daresay most -- users of this program will expect more than an cross-reference. As I write, that's not a clickable link even.

                  Comment


                  • #10
                    Originally posted by Nick Cox View Post
                    This looks like a good piece of work, but the alternatives are understated in #1. For example, rangestat (SSC) by Robert Picard and friends can do the simply stated tasks



                    just fine and has been around a while now (hundreds of posts here). It doesn't support weights and doesn't have the same options, but conversely rangestat does plenty that regressby doesn't.

                    Not trying to compete here, and no one can be familiar with the entire range of community-contributed software, and above all programs can be complementary and overlap in what they do. Nevertheless comparisons are better for being fairly stated.

                    There are yet other such programs, but I'll let their authors speak if so minded.

                    I have suggestion for future versions of the help. The minimal



                    is likely to undermine take-up. Many -- I daresay most -- users of this program will expect more than an cross-reference. As I write, that's not a clickable link even.
                    Hi Nick,

                    Thanks for your feedback, which is super helpful as always. I have a few quick comments:

                    1. Yes, it is true that there are many ways to run grouped regressions in Stata. I note this important fact immediately below the text that you have quoted from my original post. I absolutely agree that I should be more explicit about third-party alternatives for grouped regressions - of which I believe there are now around a half-dozen - and I will make sure to amend the Github page to make these comparisons clear (I can no longer edit the original post of this thread). However, I stand by my assertion that regressby is several orders of magnitude faster than *all* of these alternatives in most use cases. I would be perfectly happy to be proven wrong here!

                    2. You can find an updated usage script below that benchmarks regressby against the rangestat program that you developed with Robert Picard. Some quick benchmarking indicates that regressby is approximately 30 times faster than rangestat in the canned dataset provided the original posts. More thorough benchmarks across a range of reasonable simulated datasets will be provided on Github later today.

                    3. The help file providing examples has been updated, per your helpful comment and William's comment below. Thanks!


                    I very much appreciate your feedback and would be eager to hear any more thoughts you have. Thank you so much, as always, for your time!

                    Best,
                    Mike

                    Code:
                    * Set up
                    clear all
                    set obs 100000
                    set seed 123
                    set rmsg on
                    timer clear
                    
                    * Generate a dataset
                    gen g = ceil(runiform()*1000)
                    gen x = runiform()
                    gen y = g + g*x + rnormal()
                    tempfile t1
                    save `t1'
                    
                    quietly {
                    
                        * Test with statsby
                        use `t1', clear
                        timer on 1
                        statsby _b _se, clear by(g): regress y x
                        timer off 1
                    
                        * Test with rangestat
                        use `t1', clear
                        gen o = _n
                        timer on 2
                        rangestat (reg) y x, interval(o 1 100000) by(g)
                        timer off 2
                    
                        * Test with regressby
                        use `t1', clear
                        timer on 3
                        regressby y x, by(g)
                        timer off 3
                        
                        timer list
                    
                    }
                    
                    * List timers
                    di "Time for statsby: `r(t1)'"
                    di "Time for rangestat: `r(t2)'"
                    di "Time for regressby: `r(t3)'"
                    In my run of the benchmark above, I get 6.769 seconds for statsby, 5.439 seconds for rangestat, and 0.146 seconds for regressby.
                    Last edited by Michael Droste; 31 Jul 2018, 14:56.

                    Comment


                    • #11
                      Michael Droste - I think you misunderstand Nick's GitHub comment.

                      As I read it, Nick makes no comment on the relative merits of GitHub v. SSC distribution.

                      My restatement of the point he makes is that someone with no experience with GitHub - which I hazard a guess is true of many Statalist members - who is told they can obtain regressby by running
                      Code:
                      net install regressby, from(https://raw.githubusercontent.com/mdroste/stata-regressby/master/)
                      and then runs help regressby will see only the unclickable "See the GitHub page" with no idea what or where that page might be. And if they try to copy and paste the only GitHub URL they know - https://raw.githubusercontent.com/mdroste/stata-regressby/master/ - into their web browser, that will be of no help.

                      So the point is not that you tell the user to visit the GitHub page, but rather, that you don't tell the user how to do that, or preferably, make it easy for the user to do so.

                      I expect this is an artifact of initial distribution via GitHub that required downloading a ZIP archive, and that you overlooked this when you made regressby installable with the net command, obviating the need for the user to have visited the page in question. It's an oversight in your documentation that can be improved.
                      Last edited by William Lisowski; 31 Jul 2018, 14:52.

                      Comment


                      • #12
                        Originally posted by William Lisowski View Post
                        Michael Droste - I think you misunderstand Nick's GitHub comment.

                        As I read it, Nick makes no comment on the relative merits of GitHub v. SSC distribution.

                        My restatement of the point he makes is that someone with no experience with GitHub - which I hazard a guess is true of many Statalist members - who is told they can obtain regressby by running
                        Code:
                        net install regressby, from(https://raw.githubusercontent.com/mdroste/stata-regressby/master/)
                        and then runs help regressby will see only the unclickable "See the GitHub page" with no idea what or where that page might be. And if they try to copy and paste the only GitHub URL they know - https://raw.githubusercontent.com/mdroste/stata-regressby/master/ - into their web browser, that will be of no help.

                        So the point is not that you tell the user to visit the GitHub page, but rather, that you don't tell the user how to do that, or preferably, make it easy for the user to do so.

                        I expect this is an artifact of initial distribution via GitHub that required downloading a ZIP archive, and that you overlooked this when you made regressby installable with the net command, obviating the need for the user to have visited the page in question. It's an oversight in your documentation that can be improved.
                        Hi William,

                        Thanks for this clarification, this is helpful (no pun intended). To be honest, I had totally forgotten about that line in the help file! The help file has been updated with a few examples using sysuse datasets. Please let me know if you have additional suggestions for improving the help file.

                        Best,
                        Mike

                        Comment


                        • #13
                          I truly hate to jump into a my thing is faster than your thing thing but claims that regressby is orders of magnitude faster than available alternatives simply show how little you have researched the subject.

                          Now it's well known that statsby is slow. A faster alternative is runby (from SSC) but it would still be slower than regressby because you would call regress and that command does far more than simply computing the regression. runby is far more versatile however as you can do anything you want on each by-group subsample. rangestat is very fast but it was designed to calculate statistics for each observation using a very flexible model for how you determine the sample for each observation. rangestat also return results for each original observation so there's some extra overhead for that as well.

                          You data example is a bit odd as there is no order within groups and groups are not sorted. Here's what I get when I perform the same benchmarks using the same data example you use in #10:
                          Code:
                          * Set up
                          clear all
                          set obs 100000
                          set seed 123
                          
                          * Generate a dataset
                          gen g = ceil(runiform()*1000)
                          gen x = runiform()
                          gen y = g + g*x + rnormal()
                          sort g
                          tempfile t1
                          save `t1'
                          
                          * Test with rangestat
                          use `t1', clear
                          timer on 1
                          rangestat (reg) y x, interval(g 0 0) by(g)
                          timer off 1
                          list in 1
                          
                          * Test with regressby
                          use `t1', clear
                          timer on 2
                          regressby y x, by(g)
                          timer off 2
                          list in 1
                          
                          * Test with asreg
                          use `t1', clear
                          timer on 3
                          by g: asreg y x, se
                          timer off 3
                          list in 1
                          
                          timer list
                          and the full output
                          Code:
                          . * Test with rangestat
                          . use `t1', clear
                          
                          . timer on 1
                          
                          . rangestat (reg) y x, interval(g 0 0) by(g)
                          
                          . timer off 1
                          
                          . list in 1
                          
                               +-----------------------------------------------------------------------------------------------------------+
                               | g          x          y   reg_nobs      reg_r2   reg_adj~2         b_x     b_cons        se_x     se_cons |
                               |-----------------------------------------------------------------------------------------------------------|
                            1. | 1   .5146425   1.068691         98   .03226088   .02218027   .64124098   1.161035   .35844843   .21360415 |
                               +-----------------------------------------------------------------------------------------------------------+
                          
                          . 
                          . * Test with regressby
                          . use `t1', clear
                          
                          . timer on 2
                          
                          . regressby y x, by(g)
                          Running regressby with normal OLS standard errors.
                          (0 observations deleted)
                          
                          . timer off 2
                          
                          . list in 1
                          
                               +---------------------------------------------------------------+
                               | g    N      _b_x      _se_x    _b_cons   _se_cons   _cov_co~x |
                               |---------------------------------------------------------------|
                            1. | 1   98   .641241   .3584484   1.161035   .2136042   -.0678853 |
                               +---------------------------------------------------------------+
                          
                          . 
                          . * Test with asreg
                          . use `t1', clear
                          
                          . timer on 3
                          
                          . by g: asreg y x, se
                          
                          . timer off 3
                          
                          . list in 1
                          
                               +------------------------------------------------------------------------------------------------------+
                               | g          x          y   _Nobs         _R2      _adjR2    _b_cons        _b_x   _se_cons      _se_x |
                               |------------------------------------------------------------------------------------------------------|
                            1. | 1   .5146425   1.068691      98   .03226088   .02218027   1.161035   .64124098   .2136042   .3584484 |
                               +------------------------------------------------------------------------------------------------------+
                          
                          . 
                          . timer list
                             1:      0.30 /        1 =       0.2960
                             2:      0.10 /        1 =       0.0980
                             3:      0.10 /        1 =       0.1050
                          
                          .
                          I could work on optimizing rangestat for this particular task (statistic on a by-group) but I don't really think it's worth the effort; it's a versatile tool that is .2 seconds slower in this particular test. In addition, I'm not seeing any speed advantage with respect to another player in this area (asreg). But the more options the merrier I guess.

                          Comment


                          • #14
                            Michael: Thanks in turn for this reply. I am clear that you have a good command there, which in due course will be further enhanced and more fully documented. But I have to say that your reply is just a little disingenuous. In #1 you outline the options

                            You can do this manually with a quick loop over each distinct value (or tuple) of your 'by' group(s). You can also do this with statsby, but it is pretty slow, and for whatever reason you can't access the full VCV matrix of each estimate from statsby: regress. If there are only a small number of -by- groups and there aren't too many independent variables in your regression model, you can fully interact each regressor variable with your by group, so long as total number of regressors doesn't exceed Stata's limit of just under 11,000. Or you can use this new program, regressby, which can be hundreds of times faster than any of these options.
                            and there's precisely no mention there of community-contributed alternatives. That's my first point, negative though it may seem. Clearly you're not obliged to mention them all -- I don't think I know about all of the programs you allude to -- but my comment.was, I think, fair. (I would say that...)

                            The speed comparisons with rangestat are striking, but fairly easy to explain without being unduly defensive. rangestat is constructed with some generality of scope. Indeed, a major motivation for its development was to tackle a bunch of awkward problems concerning moving intervals. The challenge for the next major version of rangestat is to recognise, and optimise code for, simpler problems within its scope. rangestat is mostly Robert's work, and his stance might differ, but I see it as a kind of Swiss Army knife that happens to cover quite a range of problems. At the same time, it is a command to generate new variables rather than a strongly statistical command. (EDIT: He posted while this was being written.)

                            I agree this isn't the best place to argue for or against GitHub. No one is trying to force me to use it, so that's fine. For me SSC offers a homegrown alternative for Stata programmers that fits my needs. GitHub is clearly something much, much more elaborate and extensive than SSC, but I don't need it. It's like starting Proust or finishing the Divine Comedy (I got stuck in the middle of Hell three times, and none was fun): some people rave about them, but I doubt I'll get to either in my life. You mention some very good Stata programs developed on GitHub, which is more than fine, but there are hundreds and hundreds on SSC and in the Stata Journal that never touched GitHub. The case for GitHub is not how many Stata programmers use it, but what the others may be missing.

                            A crucial point is that most Stata programs that are community-contributed were written by one or two people. With one, no problem; with two, the easiest mechanism for collaboration is trivial, called email. If half-a-dozen people collaborate on a Stata program, they really will need something else unless they all work in the same place (which, come to think of it, is exactly the case for StataCorp). .

                            My point about your help file was well developed by William Lisowski, so I'll leave that there.

                            Comment


                            • #15
                              Let me add a comment: I note and commend regressby, like rangestat, for making all its code visible in ado files, rather than distributing parts within a Mata library, hewing to the ideal of open source software and allowing users to build on it if necessary to better meet their needs. Or to fix a problem they chance upon while crashing on a deadline. That's a generous approach for developers to take.

                              Comment

                              Working...
                              X