reghdfe 10X faster

David Roodman

Join Date: Jul 2014
Posts: 465

reghdfe 10X faster

27 Nov 2023, 15:35

I'm excited to announce a pair of packages that radically speeds up reghdfe--at least for hard problems where speed is an issue. The spark for this project is the observation that the equivalent in the Julia programming language of Sergio Correia's path-breaking reghdfe--the FixedEffectModels.jl package by Matthieu Gomez--is much faster. The new command reghdfejl is meant to mimic reghdfe while calling Julia to do the hard work.

Below is an edited log comparing three ways to perform the same estimate on a data set with 100,000,000 observations and two sets of absorbed fixed effects. It is run in Stata with 1 processor, on a high-end Windows laptop. First reghdfe is used, then reghdfejl, then reghdfejl with the "gpu" option to request computation on an NVIDIA GPU. (Apple Silicon GPUs are also supported.) The run times are 312, 36, and 27 seconds.

To do the same on your computer you need to:

Install the "julia" module for Stata, which includes the "jl" command for accessing Julia. Type "ssc install julia".
Type "help jl" and read and follow the instructions in the Installation section to install the Julia programming language (which is free). On Windows and macOS machines, there is a complication related to assuring that Stata can find Julia once installed.
In Stata, type "ssc install reghdfejl" to add reghdfejl.

If Julia and the Stata "jl" command are working, then you should be able to do things like this at the Stata prompt:

Code:

. jl: "Hello world!"
Hello world!
. jl: sqrt(2)
1.4142135623730951

Because Julia uses just-in-time compilation, first calls to "jl" and Julia-based programs such as "reghdfejl" are slow--very slow if required Julia packages need to be downloaded and installed. Be patient.

The "jl" command is meant as low-level infrastructure. Its core is written in C/C++ and separately compiled for four platforms (Windows, Linux, Intel Mac, ARM Mac). It includes subcommand for high-speed, C-based copying of data between Stata and Julia. But see the readme for a simple of example of its use in estimation.

reghdfejl starts by copying the data needed for estimation into a Julia DataFrame. This duplication of data takes a bit of time and potentially a lot of RAM. In extreme cases, it will more than double the storage demand because even variables stored in Stata in small types, such as byte, will be stored double-precision--8 bytes per value--in Julia. If the memory demand is too great, performance will plummet. reghdfejl therefore is most useful when you have plenty of RAM, when the number of non-absorbed regressors is low (fewer variables need copying), and when the number of absorbed terms is high (for then the computational efficiency of Julia shines).

reghdfejl lacks some reghdfe features that are typically secondary for users:

It does not correct the estimates of the degrees of freedom consumed by absorbed fixed effects for collinearity and redundance among fixed-effect dummies. reghdfe displays these corrections in a table after the main results. reghdfejl does not.
It does not offer the Group FE features.
It does not allow control over whether the constant term is reported. The constant is always absorbed.
It does not offer options such as technique() that give finer control over the algorithm. But these are largely obviated by reghdfejl's speed.

reghdfejl has two novel options. The "threads()" can reduce the number of CPU threads used. (Type "jl: Threads.nthreads()" to see the default, which is also upper limit.) And it offers the "gpu" option, which offers generally modest time savings on computers with NVIDIA or Apple Silicon GPUs.

This is a brand new project with a lot of moving parts. Think of this as the beta release. Please post about any installation problems.

Benchmark log:

Code:

.     scalar N = 100000000
.     scalar K = 100
.     set obs `=N'
.     gen id1 = runiformint(1, N/K)
.     gen id2 = runiformint(1, K)
.     drawnorm x1 x2
.     gen double y = 3 * x1 + 2 * x2 + sin(id1) + cos(id2) + runiform()

.     set rmsg on
. 
.  set processors 1

. reghdfe y x1 x2, a(id1 id2) cluster(id1 id2)
(MWFE estimator converged in 4 iterations)
Warning: VCV matrix was non-positive semi-definite; adjustment from Cameron, Gelbach & Miller applied.

HDFE Linear regression                            Number of obs   =  100000000
Absorbing 2 HDFE groups                           F(   2,     99) =   9.35e+09
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.9941
                                                  Adj R-squared   =     0.9941
Number of clusters (id1)     =  1,000,000         Within R-sq.    =     0.9936
Number of clusters (id2)     =        100         Root MSE        =     0.2887

                              (Std. err. adjusted for 100 clusters in id1 id2)
------------------------------------------------------------------------------
             |               Robust
           y | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x1 |   2.999985   .0000282  1.1e+05   0.000     2.999929    3.000041
          x2 |   2.000006   .0000271  7.4e+04   0.000     1.999952     2.00006
       _cons |   .4946679   4.33e-09  1.1e+08   0.000     .4946679     .494668
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
         id1 |   1000000     1000000           0    *|
         id2 |       100         100           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation
r; t=312.21 16:51:54


. reghdfejl y x1 x2, a(id1 id2) cluster(id1 id2)
(MWFE estimator converged in 6 iterations)

HDFE linear regression with Julia                 Number of obs   =  100000000
Absorbing 2 HDFE groups                           F(   2,     99) =   9.35e+09
Statistics cluster-robust                         Prob > F        =     0.0000
                                                  R-squared       =     0.9941
                                                  Adj R-squared   =     0.9941
Number of clusters (id1)     =    1000000         Within R-sq.    =     0.9936
Number of clusters (id2)     =        100         Root MSE        =     0.2887

                              (Std. err. adjusted for 100 clusters in id1 id2)
------------------------------------------------------------------------------
             |               Robust
           y | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x1 |   2.999985   .0000282  1.1e+05   0.000     2.999929    3.000041
          x2 |   2.000006   .0000271  7.4e+04   0.000     1.999952     2.00006
------------------------------------------------------------------------------
r; t=35.94 16:56:13


. reghdfejl y x1 x2, a(id1 id2) cluster(id1 id2) gpu
(MWFE estimator converged in 8 iterations)

HDFE linear regression with Julia                 Number of obs   =  100000000
Absorbing 2 HDFE groups                           F(   2,     99) =   9.35e+09
Statistics cluster-robust                         Prob > F        =     0.0000
                                                  R-squared       =     0.9941
                                                  Adj R-squared   =     0.9941
Number of clusters (id1)     =    1000000         Within R-sq.    =     0.9936
Number of clusters (id2)     =        100         Root MSE        =     0.2887

                              (Std. err. adjusted for 100 clusters in id1 id2)
------------------------------------------------------------------------------
             |               Robust
           y | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x1 |   2.999985   .0000282  1.1e+05   0.000     2.999929    3.000041
          x2 |   2.000006   .0000271  7.4e+04   0.000     1.999952     2.00006
------------------------------------------------------------------------------
r; t=27.28 16:58:22

Tags: None

ericmelse

Join Date: May 2014

Posts: 425
#2

28 Nov 2023, 02:26

Dear David,

Most interesting contribution that I have exercised on my system:

Code:

OS Windows 11 Pro 64-bit RAM 64.0GB CPU 13th Gen Intel Core i9-13900K - 12 processors GPU NVIDIA GeForce RTX 4070 Ti (Gigabyte) 4089MB

with the following results:

Code:

t=294.66 // 1 processor reghdfe t=188.88 // 1 processor reghdfejl t=18.98 // 1 processor reghdfejl gpu t=25.72 // 12 processors reghdfejl t=22.73 // 12 processors reghdfejl gpu

I am totally inexperienced using Julia, reghdfe or, reghdfejl, so I am not in a position to assess these results.
It might be either the specificities of the calculations involved with reghdfejl or how tasks are distributed between the CPU and the GPU, using Julia, why the difference between using the GPU or not is rather small when I employ 12 processors instead of 1 processor. Or, possibly, the difference will be greater when even larger datasets are processed. Maybe you can shed some light on this result.

http://publicationslist.org/eric.melse
Comment
Kevin Denny

Join Date: May 2014

Posts: 8
#3

28 Nov 2023, 02:47

Unlike reghdfe string variables cannot be used as factor variables. In this case I will need to create a numeric version of cnt which is a country code.

reghdfejl mathbar $controls3 , a(cnt schoolid)
cnt: string variables may not be used as factor variables
Comment
David Roodman

Join Date: Jul 2014

Posts: 465
#4

28 Nov 2023, 06:44

ericmelse Those results look about right to me--almost exactly the same as mine. I'm on an 2023 Dell XPS-17. I think my CPU was a bit hot when I started the benchmarks in my post, so the CPU times are a little lower.

BUT a few things require explanation. I think most are mentioned in the help file:
Because Julia uses just-in-time compilation, the first time you call a routine it can be slow. There can be large lags the first time ever and smaller ones on the first time in each Stata/Julia session. I would rerun the whole test suite.

In my laptop-based experience, the gpu option, is not transformative. Possibly there is an opportunity for improvement to the FixedEffectModels.jl package. Possibly it's just intrinsic to this computation problem. Or our GPUs are not high-end.

Sometimes more processors is bad because the overhead of multithreading is not compensated by the efficiency. In addition, it can especially backfire on CPUs with a mix of high-speed Performance cores and energy-efficient Efficiency cores; there it can be best to limit to the number of P cores (you have 6). Something similar goes for (Intel) CPUs with hyperthreading, which make one CPU look like two and can be great for applications where threads are often bottlenecked by waiting for memory access, but maybe not so great for us.

Note that the numbers of threads in Stata and Julia are controlled separately, and the same issues pertain to both. With reghdfejl, you can reduce the number of threads from its default/max with the threads() option. On my computer, I find little benefit beyond 4 or 6. You can determine the number of threads being used in Julia by typing this at the Stata prompt:

Code:

jl: Threads.nthreads()
Comment
David Roodman

Join Date: Jul 2014

Posts: 465
#5

28 Nov 2023, 21:49

Thanks for the feedback Kevin Denny. I have just posted a new version that I believe handles strings. It also now accepts interaction terms in cluster(). These are both non-standard features, but convenient.

The latest version is not yet on SSC, but you can install it now with:

Code:

net install reghdfejl, replace from(https://raw.github.com/droodman/reghdfejl/v0.3.0)

Separately, I added a section to the jl help file explaining how to check and control the number of CPU threads Julia uses. I realized Julia as installed generally defaults to 1, which is not optimal. Get that update that with:

Code:

net install julia, replace from(https://raw.github.com/droodman/julia.ado/v0.5.8)
3 likes
Comment
ericmelse

Join Date: May 2014

Posts: 425
#6

29 Nov 2023, 05:23

Dear David,
After using your net install julia... in 5#, using which jl, still reports *! jl 0.5.6 25 November 2023
Also the version history in the ado file does not go beyond * 0.5.6 File reorganization
So, my guess is that the ado was not updated in the designated folder.

http://publicationslist.org/eric.melse
Comment
David Roodman

Join Date: Jul 2014

Posts: 465
#7

29 Nov 2023, 07:14

The only thing that changed was the help file.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#8

29 Nov 2023, 21:06

I haven't succeeded yet in getting reghdfejl to run, but I'll keep trying.

I am curious is to why, with two-way fixed effects (with, say, N = 10,000,000, T = 6, 25 covariates) reghdfe runs much slower than xtreg. I'm talking about a run time that's five times longer for reghdfe. I've had the impression that reghdfe is more efficient coding than xtreg, fe but that doesn't always seem to be the case.
Comment
David Roodman

Join Date: Jul 2014

Posts: 465
#9

30 Nov 2023, 06:08

Hi Jeff, feel free to share specifics of your travails. I need to diagnose and fix. Are you working in Linux? Over the weekend I hope to revamp it to make it better at finding the Julia libraries, which I think is the chief source of fragility.

In the xtreg example, only one set of fixed effects is absorbed and the other is listed before the comma, right? I can't speak for reghdfe. But I can imagine that if the absorbed variable doesn't have too many levels, that xtreg would be faster. The data might already be sorted by the absorbed variable, because xtset does that automatically. A routine written in C to demean a set of variables within one set of groups should be extremely fast. Whereas reghdfe is implemented in Mata, and is of course designed for a more general problem.
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#10

30 Nov 2023, 07:17

Is there a formula that can predict the amount of memory required for any of these fixed effect programs? I had a user approach me with -reghdfe- apparently requesting 16 terabytes of memory (according to top) when there was only 200 megabytes in the .dta file they were using. They did have an inordinant number of fixed effect variiables, does that affect the memory required? As I understand the method of solution, memory requirements don't depend on the number of fixed effect error terms that are absorbed. Is that wrong?
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#11

30 Nov 2023, 07:52

Originally posted by Daniel Feenberg View Post

Is there a formula that can predict the amount of memory required for any of these fixed effect programs? I had a user approach me with -reghdfe- apparently requesting 16 terabytes of memory (according to top) when there was only 200 megabytes in the .dta file they were using. They did have an inordinant number of fixed effect variiables, does that affect the memory required? As I understand the method of solution, memory requirements don't depend on the number of fixed effect error terms that are absorbed. Is that wrong?

Hi Dan,

16tb of RAM indeed feels quite crazy. A possible scenario is this:

Instead of running reghdfe y x1 x2, absorb(id), the user ran reghdfe y x1 x2 i.id so ended up including a gajillion variables.

If that's not the case, then the other possibility is they just have a lot of dummies they want to estimate, such as reghdfe y x1 x2 i.id2, absorb(id1) , and they do want to compute SEs for the id2 dummies. In this case, each possible level of id2 will create a new variable in Mata.

In terms of memory requirements, if you run reghdfe y x1 x2, a(id) pool(0) on a dataset with 1m rows, you will use about 1e6 (rows) * 3 (columns) * 8 (bytes in double) * 3 (overhead) = 72mb even thought the data has 1e6*3*8=24mb. This difference is affected by a few things:
Difference will be larger if the data is stored as byte in the dataset, because there is no "byte" type in Mata

It will be larger if there are dummy variables such as in the example above

It will be smaller if there are a lot of variables not used in the regression.

reghdfe has a few tricks to help optimize memory usage, at a small slowdown cost:
The "compact" option will trim down the dataset as needed, so it will e.g. save it as a tempfile (preserve+restore) once the data is in Mata. This means memory usage peaks to 2x of the dataset

The pool(#) option defines how many variables are simultaneously partialled out. When you run "x = partial_out(x)", there is a moment when two copies of X are stored. pool(0) does all variables simultaneously, at the cost of large memory usage, with pool(1) being the slowest but that uses less memory

You can also edit reghdfe.mata and inject "mata memory" calls to see what's going on, and in which part of the program memory usage peaks.

Best,
S
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#12

30 Nov 2023, 10:35

Originally posted by David Roodman View Post

Hi Jeff, feel free to share specifics of your travails. I need to diagnose and fix. Are you working in Linux? Over the weekend I hope to revamp it to make it better at finding the Julia libraries, which I think is the chief source of fragility.

In the xtreg example, only one set of fixed effects is absorbed and the other is listed before the comma, right? I can't speak for reghdfe. But I can imagine that if the absorbed variable doesn't have too many levels, that xtreg would be faster. The data might already be sorted by the absorbed variable, because xtset does that automatically. A routine written in C to demean a set of variables within one set of groups should be extremely fast. Whereas reghdfe is implemented in Mata, and is of course designed for a more general problem.

Thanks David. I agree with your analysis. xtreg becomes much more attractive when you only have to absorb one set of high-dimensional fixed effects. In a small-T panel with large N, putting in the time dummies is more efficient than the two-way demeaning used by reghdfe. When I think about it, computing averages of 56 variables across 1,000,000 observations, repeated for t = 1, 2, ..., 30, is pretty time consuming.

Oh, I'm using a Dell with Windows running Stata 18 SE.
Comment
David Roodman

Join Date: Jul 2014

Posts: 465
#13

30 Nov 2023, 10:57

Jeff Wooldridge, if you want to take the time, can you share what you try and what happens? If you go to the command line, e.g., by hitting the Windows button, typing "cmd" and Enter, can you get into Julia simply by typing "julia"? In the Julia installation, did you go through some dialog boxes and click "Add Julia to PATH" (see https://julialang.org/downloads/platform/)?
Comment
Filippos Petroulakis

Join Date: Dec 2023

Posts: 1
#14

01 Dec 2023, 18:14

Originally posted by Jeff Wooldridge View Post

I haven't succeeded yet in getting reghdfejl to run, but I'll keep trying.

I had an old Julia version and it wasn’t working. I removed and installed the newest one, with the Add Path option, still not working. Then I just closed and opened Stata again and it worked great! So you can try either or both of these and it should work.

The improvement I got was exactly 10x. My GPU is really basic so it was in fact slower than basic reghdfejl, and about 60% of reghdfe.

Awesome package David, many thanks for this!
Comment
David Roodman

Join Date: Jul 2014

Posts: 465
#15

01 Dec 2023, 18:50

Oh good points. I just edited the help file to mention needing Julia 1.9 or later and to add the suggestion of restarting Stata.

I should have a version of the julia package out soon that is better at finding the needed Julia files on your computer. I think this is the main failure mode.
Comment

Announcement

reghdfe 10X faster

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment