Parallel Bootstrapping using parallel package in Stata for Esteban-Ray Index

Sergey Alexeev

Join Date: Oct 2016
Posts: 34

Parallel Bootstrapping using parallel package in Stata for Esteban-Ray Index

08 Apr 2025, 00:56

Note: This post is about the parallel package for multicore computing, not parallel trends in DiD.

Hi everyone,

I'm estimating the Esteban-Ray polarisation index in Stata using the er package and trying to bootstrap standard errors across multiple countries, years, and measures. With thousands of combinations to run, the sequential bootstrap is too slow — I'm looking at ~5 months for the full dataset. 😅

The goal is to use parallel computing via the parallel package (by George G. Vega Yon), but I'm running into issues. The sequential version works fine, but switching to parallel bs gives me errors (some bs).

❓The problem

Code:

* Required packages
* ssc install er, replace
* net install parallel, from(https://raw.github.com/gvegayon/parallel/master/) replace
* mata mata mlib index

clear all
set more off
cls

* Sample data
input ///
year    income    r_income    str2 country
2001    34741.34    34741.34    "AU"
2001    51427.33    53573.95    "AU"
2001    9920        17120       "AU"
2002    31028       34108       "AU"
2002    38488.99    43497.78    "AU"
2002    29254.92    30756.03    "AU"
2003    22482.5     22482.5     "AU"
2003    14344.27    14344.27    "AU"
2003    14787.72    14787.72    "AU"
2004    54711.39    65734.1     "AU"
2004    68986.75    100099.5    "AU"
2004    32581.61    33158.96    "AU"
2005    36805.5     44545.5     "AU"
2005    35426       69426       "AU"
2005    87865.09    99744.48    "AU"
2006    22325       30325       "AU"
2006    10102       13114       "AU"
2006    10031.46    15751.27    "AU"
2007    52181.65    60440.66    "AU"
2007    12393       12393       "AU"
2007    30776.12    37847.18    "AU"
2001    40572.92    40572.92    "CH"
2001    35774.6     35774.6     "CH"
2002    65736.88    65736.88    "CH"
2003    41624.76    41624.76    "CH"
2004    31950       31950       "CH"
2004    37470.07    47556.53    "CH"
2005    53032.09    60051.34    "CH"
2005    59281.06    59281.06    "CH"
2006    106161.1    106161.1    "CH"
2006    91924.49    91924.49    "CH"
2007    49548       49548       "CH"
2007    48583.47    48583.47    "CH"
end

* Define combinations
local countries AU US DE CH
local years 2001 2002 2003 2004 2005 2006 2007

tempfile er_results
capture postclose handle

postfile handle str10 country int year str10 measure double a double er_index double er_se double ci_lower double ci_upper using `er_results', replace

parallel initialize 4, f

foreach country in `countries' {
    foreach year in `years' {
        foreach measure in "income" "r_income" {
            foreach a in 0 1.6 {
                preserve
                keep if country == "`country'" & year == `year'
                if _N > 0 {
                    di "Processing `country' in `year' for `measure' with alpha=`a'"
                    
                    * THIS WORKS:
                    bs er_index = r(er_1), reps(2): er `measure', alpha(`a') normalize(none)

                    * ALL THESE FAIL:
                    * parallel bs er_index = r(er_1), reps(2): er `measure', alpha(`a') normalize(none)
                    * parallel bs, reps(2): er `measure', alpha(`a') normalize(none)
                    * parallel bs, er_index = r(er_1) reps(2): er `measure', alpha(`a') normalize(none)

                    matrix b = e(b)
                    matrix se = e(se)
                    matrix ci = e(ci_percentile)

                    scalar er_mean = b[1,1]
                    scalar er_se = se[1,1]
                    scalar ci_lower = ci[1,1]
                    scalar ci_upper = ci[2,1]

                    post handle ("`country'") (`year') ("`measure'") (`a') (er_mean) (er_se) (ci_lower) (ci_upper)
                }
                restore
            }
        }
    }
}

postclose handle
use `er_results', clear
list

🔍 What I’ve tried

The er command works well inside bs sequentially.
The parallel package works with other functions.
I’ve tried increasing memory, testing on smaller samples, tweaking syntax, etc.

🧠 Final thoughts

If anyone has experience using parallel bs with user-written commands like er, or insight into what might break inside a parallel session, I’d love your help.

Thanks a ton in advance!

Best,
Serge

Last edited by Sergey Alexeev; 08 Apr 2025, 01:05. Reason: Added tags and clarified that 'parallel' refers to multithreading, not the common trend assumption

Kind regards,
Sergey Alexeev | The University of Sydney
https://alexeev.pw/

Tags: bootstrap, multithreading, parallel (package), parallel-computing, performance

Sergey Alexeev

Join Date: Oct 2016
Posts: 34

08 Apr 2025, 01:46

Hey everyone,

Apologies for bugging folks earlier and for the extra noise on the forum — I really appreciate the help and patience. I ended up finding a solution that works, so just wanted to follow up and share in case it helps someone else later.

The key was getting everything set up properly with parallel and making sure er was called correctly inside the bootstrap. Here's the working code for reference:

Code:

* Required packages
* ssc install er, replace
* net install parallel, from(https://raw.github.com/gvegayon/parallel/master/) replace
* mata mata mlib index

clear all
set more off
cls

* Sample data
input ///
year    income    r_income    str2 country
2001    34741.34    34741.34    "AU"
2001    51427.33    53573.95    "AU"
2001    9920        17120       "AU"
2002    31028       34108       "AU"
2002    38488.99    43497.78    "AU"
2002    29254.92    30756.03    "AU"
2003    22482.5     22482.5     "AU"
2003    14344.27    14344.27    "AU"
2003    14787.72    14787.72    "AU"
2004    54711.39    65734.1     "AU"
2004    68986.75    100099.5    "AU"
2004    32581.61    33158.96    "AU"
2005    36805.5     44545.5     "AU"
2005    35426       69426       "AU"
2005    87865.09    99744.48    "AU"
2006    22325       30325       "AU"
2006    10102       13114       "AU"
2006    10031.46    15751.27    "AU"
2007    52181.65    60440.66    "AU"
2007    12393       12393       "AU"
2007    30776.12    37847.18    "AU"
2001    40572.92    40572.92    "CH"
2001    35774.6     35774.6     "CH"
2002    65736.88    65736.88    "CH"
2003    41624.76    41624.76    "CH"
2004    31950       31950       "CH"
2004    37470.07    47556.53    "CH"
2005    53032.09    60051.34    "CH"
2005    59281.06    59281.06    "CH"
2006    106161.1    106161.1    "CH"
2006    91924.49    91924.49    "CH"
2007    49548       49548       "CH"
2007    48583.47    48583.47    "CH"
end

* Define combinations
local countries AU US DE CH
local years 2001 2002 2003 2004 2005 2006 2007

tempfile er_results
capture postclose handle

postfile handle str10 country int year str10 measure double a double er_index double er_se double ci_lower double ci_upper using `er_results', replace

parallel initialize 4, f

foreach country in `countries' {
    foreach year in `years' {
        foreach measure in "income" "r_income" {
            foreach a in 0 1.6 {
                preserve
                keep if country == "`country'" & year == `year'
                if _N > 0 {
                    di "Processing `country' in `year' for `measure' with alpha=`a'"
                    
                    * THIS WORKS:
                     parallel bs, expression(r(er_1)) reps(2): er `measure', alpha(`a') normalize(none)

                    matrix b = e(b)
                    matrix se = e(se)
                    matrix ci = e(ci_percentile)

                    scalar er_mean = b[1,1]
                    scalar er_se = se[1,1]
                    scalar ci_lower = ci[1,1]
                    scalar ci_upper = ci[2,1]

                    post handle ("`country'") (`year') ("`measure'") (`a') (er_mean) (er_se) (ci_lower) (ci_upper)
                }
                restore
            }
        }
    }
}

postclose handle
use `er_results', clear
list

Thanks again, and hope this helps someone else down the line! 🙌

Kind regards,
Sergey Alexeev | The University of Sydney
https://alexeev.pw/

Comment

daniel klein

Join Date: Mar 2014

Posts: 3818
#3

08 Apr 2025, 01:57

The documented syntax of parallel bs is

Code:

parallel bs , expression(exp_list) ...) : ...

Applied to your example

Code:

parallel bs , expression(er_index = r(er_1)) reps(2) : ...
2 likes
Comment
Sergey Alexeev

Join Date: Oct 2016

Posts: 34
#4

08 Apr 2025, 13:54

Hi Daniel,

Thank you so much — this is exactly what I was missing! I really appreciate the quick and clear explanation. I somehow managed to overlook the expression() bit in the parallel bs syntax, and your example made it instantly click.

Thanks again for taking the time — your help is massively appreciated!

Best,
Sergey

Kind regards,
Sergey Alexeev | The University of Sydney
https://alexeev.pw/
Comment
George Vega

Join Date: May 2014

Posts: 13
#5

08 Apr 2025, 15:52

I love to see users helping each other :D! I'm glad you got it, Sergey.
1 like
Comment
Sergey Alexeev

Join Date: Oct 2016

Posts: 34
#6

08 Apr 2025, 22:22

Hey George,

Mate, thank you so much for the package — honestly a lifesaver! I’d been running the code sequentially for like 4 days and only got through maybe 2% of the dataset. But last night, after getting parallel bs working properly, I processed around 15% in one go. My CPU is basically crying right now — it's burning hot 😅 — but hey, it works beautifully!

Really wish Stata would just natively parallelise more stuff and support GPU acceleration, especially with Stata 19 on the horizon. It’s kind of wild that we’re still stuck in single-threaded hell for so many things, given how compute-heavy modern workflows are. But anyway, your package seriously fills that gap — thank you again!

Cheers,
Serge

Kind regards,
Sergey Alexeev | The University of Sydney
https://alexeev.pw/
Comment

Announcement

Parallel Bootstrapping using parallel package in Stata for Esteban-Ray Index

Comment

Comment

Comment

Comment

Comment