Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Parallel Bootstrapping using parallel package in Stata for Esteban-Ray Index

    Note: This post is about the parallel package for multicore computing, not parallel trends in DiD.

    Hi everyone,

    I'm estimating the Esteban-Ray polarisation index in Stata using the er package and trying to bootstrap standard errors across multiple countries, years, and measures. With thousands of combinations to run, the sequential bootstrap is too slow — I'm looking at ~5 months for the full dataset. 😅

    The goal is to use parallel computing via the parallel package (by George G. Vega Yon), but I'm running into issues. The sequential version works fine, but switching to parallel bs gives me errors (some bs).

    ❓The problem
    Code:
    * Required packages
    * ssc install er, replace
    * net install parallel, from(https://raw.github.com/gvegayon/parallel/master/) replace
    * mata mata mlib index
    
    clear all
    set more off
    cls
    
    * Sample data
    input ///
    year    income    r_income    str2 country
    2001    34741.34    34741.34    "AU"
    2001    51427.33    53573.95    "AU"
    2001    9920        17120       "AU"
    2002    31028       34108       "AU"
    2002    38488.99    43497.78    "AU"
    2002    29254.92    30756.03    "AU"
    2003    22482.5     22482.5     "AU"
    2003    14344.27    14344.27    "AU"
    2003    14787.72    14787.72    "AU"
    2004    54711.39    65734.1     "AU"
    2004    68986.75    100099.5    "AU"
    2004    32581.61    33158.96    "AU"
    2005    36805.5     44545.5     "AU"
    2005    35426       69426       "AU"
    2005    87865.09    99744.48    "AU"
    2006    22325       30325       "AU"
    2006    10102       13114       "AU"
    2006    10031.46    15751.27    "AU"
    2007    52181.65    60440.66    "AU"
    2007    12393       12393       "AU"
    2007    30776.12    37847.18    "AU"
    2001    40572.92    40572.92    "CH"
    2001    35774.6     35774.6     "CH"
    2002    65736.88    65736.88    "CH"
    2003    41624.76    41624.76    "CH"
    2004    31950       31950       "CH"
    2004    37470.07    47556.53    "CH"
    2005    53032.09    60051.34    "CH"
    2005    59281.06    59281.06    "CH"
    2006    106161.1    106161.1    "CH"
    2006    91924.49    91924.49    "CH"
    2007    49548       49548       "CH"
    2007    48583.47    48583.47    "CH"
    end
    
    * Define combinations
    local countries AU US DE CH
    local years 2001 2002 2003 2004 2005 2006 2007
    
    tempfile er_results
    capture postclose handle
    
    postfile handle str10 country int year str10 measure double a double er_index double er_se double ci_lower double ci_upper using `er_results', replace
    
    parallel initialize 4, f
    
    foreach country in `countries' {
        foreach year in `years' {
            foreach measure in "income" "r_income" {
                foreach a in 0 1.6 {
                    preserve
                    keep if country == "`country'" & year == `year'
                    if _N > 0 {
                        di "Processing `country' in `year' for `measure' with alpha=`a'"
                        
                        * THIS WORKS:
                        bs er_index = r(er_1), reps(2): er `measure', alpha(`a') normalize(none)
    
                        * ALL THESE FAIL:
                        * parallel bs er_index = r(er_1), reps(2): er `measure', alpha(`a') normalize(none)
                        * parallel bs, reps(2): er `measure', alpha(`a') normalize(none)
                        * parallel bs, er_index = r(er_1) reps(2): er `measure', alpha(`a') normalize(none)
    
                        matrix b = e(b)
                        matrix se = e(se)
                        matrix ci = e(ci_percentile)
    
                        scalar er_mean = b[1,1]
                        scalar er_se = se[1,1]
                        scalar ci_lower = ci[1,1]
                        scalar ci_upper = ci[2,1]
    
                        post handle ("`country'") (`year') ("`measure'") (`a') (er_mean) (er_se) (ci_lower) (ci_upper)
                    }
                    restore
                }
            }
        }
    }
    
    postclose handle
    use `er_results', clear
    list
    🔍 What I’ve tried
    • The er command works well inside bs sequentially.
    • The parallel package works with other functions.
    • I’ve tried increasing memory, testing on smaller samples, tweaking syntax, etc.
    🧠 Final thoughts


    If anyone has experience using parallel bs with user-written commands like er, or insight into what might break inside a parallel session, I’d love your help.

    Thanks a ton in advance!

    Best,
    Serge
    Last edited by Sergey Alexeev; 08 Apr 2025, 01:05. Reason: Added tags and clarified that 'parallel' refers to multithreading, not the common trend assumption
    Kind regards,
    Sergey Alexeev | ​The University of Sydney
    https://alexeev.pw/

  • #2
    Hey everyone,

    Apologies for bugging folks earlier and for the extra noise on the forum — I really appreciate the help and patience. I ended up finding a solution that works, so just wanted to follow up and share in case it helps someone else later.

    The key was getting everything set up properly with parallel and making sure er was called correctly inside the bootstrap. Here's the working code for reference:

    Code:
    * Required packages
    * ssc install er, replace
    * net install parallel, from(https://raw.github.com/gvegayon/parallel/master/) replace
    * mata mata mlib index
    
    clear all
    set more off
    cls
    
    * Sample data
    input ///
    year    income    r_income    str2 country
    2001    34741.34    34741.34    "AU"
    2001    51427.33    53573.95    "AU"
    2001    9920        17120       "AU"
    2002    31028       34108       "AU"
    2002    38488.99    43497.78    "AU"
    2002    29254.92    30756.03    "AU"
    2003    22482.5     22482.5     "AU"
    2003    14344.27    14344.27    "AU"
    2003    14787.72    14787.72    "AU"
    2004    54711.39    65734.1     "AU"
    2004    68986.75    100099.5    "AU"
    2004    32581.61    33158.96    "AU"
    2005    36805.5     44545.5     "AU"
    2005    35426       69426       "AU"
    2005    87865.09    99744.48    "AU"
    2006    22325       30325       "AU"
    2006    10102       13114       "AU"
    2006    10031.46    15751.27    "AU"
    2007    52181.65    60440.66    "AU"
    2007    12393       12393       "AU"
    2007    30776.12    37847.18    "AU"
    2001    40572.92    40572.92    "CH"
    2001    35774.6     35774.6     "CH"
    2002    65736.88    65736.88    "CH"
    2003    41624.76    41624.76    "CH"
    2004    31950       31950       "CH"
    2004    37470.07    47556.53    "CH"
    2005    53032.09    60051.34    "CH"
    2005    59281.06    59281.06    "CH"
    2006    106161.1    106161.1    "CH"
    2006    91924.49    91924.49    "CH"
    2007    49548       49548       "CH"
    2007    48583.47    48583.47    "CH"
    end
    
    * Define combinations
    local countries AU US DE CH
    local years 2001 2002 2003 2004 2005 2006 2007
    
    tempfile er_results
    capture postclose handle
    
    postfile handle str10 country int year str10 measure double a double er_index double er_se double ci_lower double ci_upper using `er_results', replace
    
    parallel initialize 4, f
    
    foreach country in `countries' {
        foreach year in `years' {
            foreach measure in "income" "r_income" {
                foreach a in 0 1.6 {
                    preserve
                    keep if country == "`country'" & year == `year'
                    if _N > 0 {
                        di "Processing `country' in `year' for `measure' with alpha=`a'"
                        
                        * THIS WORKS:
                         parallel bs, expression(r(er_1)) reps(2): er `measure', alpha(`a') normalize(none)
    
                        matrix b = e(b)
                        matrix se = e(se)
                        matrix ci = e(ci_percentile)
    
                        scalar er_mean = b[1,1]
                        scalar er_se = se[1,1]
                        scalar ci_lower = ci[1,1]
                        scalar ci_upper = ci[2,1]
    
                        post handle ("`country'") (`year') ("`measure'") (`a') (er_mean) (er_se) (ci_lower) (ci_upper)
                    }
                    restore
                }
            }
        }
    }
    
    postclose handle
    use `er_results', clear
    list
    Thanks again, and hope this helps someone else down the line! 🙌
    Kind regards,
    Sergey Alexeev | ​The University of Sydney
    https://alexeev.pw/

    Comment


    • #3
      The documented syntax of parallel bs is
      Code:
      parallel bs , expression(exp_list) ...) : ...
      Applied to your example
      Code:
      parallel bs , expression(er_index = r(er_1)) reps(2) : ...

      Comment


      • #4
        Hi Daniel,

        Thank you so much — this is exactly what I was missing! I really appreciate the quick and clear explanation. I somehow managed to overlook the expression() bit in the parallel bs syntax, and your example made it instantly click.

        Thanks again for taking the time — your help is massively appreciated!

        Best,
        Sergey
        Kind regards,
        Sergey Alexeev | ​The University of Sydney
        https://alexeev.pw/

        Comment


        • #5
          I love to see users helping each other :D! I'm glad you got it, Sergey.

          Comment


          • #6
            Hey George,

            Mate, thank you so much for the package — honestly a lifesaver! I’d been running the code sequentially for like 4 days and only got through maybe 2% of the dataset. But last night, after getting parallel bs working properly, I processed around 15% in one go. My CPU is basically crying right now — it's burning hot 😅 — but hey, it works beautifully!

            Really wish Stata would just natively parallelise more stuff and support GPU acceleration, especially with Stata 19 on the horizon. It’s kind of wild that we’re still stuck in single-threaded hell for so many things, given how compute-heavy modern workflows are. But anyway, your package seriously fills that gap — thank you again!

            Cheers,
            Serge
            Kind regards,
            Sergey Alexeev | ​The University of Sydney
            https://alexeev.pw/

            Comment

            Working...
            X