Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Speed up bootstrapped t-stat calcilations

    Dear Stata users,

    My dataset has 776 variables in time-series format.

    I am running the following code:

    forval i=1/776 {
    regress var_MA22Return`i', vce(bootstrap, reps(500))
    eststo var_MA22Return`i'
    }
    I need to obtain bootstrapped t-stat (with 500 replications) of mean stock return.

    My problem is that current code setting is very time consuming. It took Stata 18 hrs to calculate t-stats for 176 variables. And I need to repeat this code over another 2328 variables.

    Any idea how to incease prodactivity of this estimation? I may reshape data into panel format if it may help...

    Thank you.

  • #2
    If you are just looking for bootstrap estimates of the mean, I think that -bootstrap r(mean), reps(500): summarize var_MA22Return`i', meanonly- would run a lot faster. You would not be able to use -eststo- after that, but the results of -bootstrap- are placed in r(table), which you can keep as a matrix and then dispose of in some appropriate way. I ran some experiments on my own setup and found that using -regress- instead of -summarize- multiplies the run time by a factor of about 2.9.

    Comment


    • #3
      Clyde

      I run the code you suggested:


      forval i=1/776 {
      bootstrap r(mean), reps(500): summarize var_MA22Return`i', meanonly

      }
      It did estimation but I have no experience how to extract t-stat out of this results.

      May you please let me know how to get t-stat? Eventually I need to export results to the spreadsheet.

      Thank you.

      Comment


      • #4
        Olena:
        as far as bootstrapped t-stat can be obtained, you may be interested in developing some chuncks of code along the following lines:
        Code:
        . sysuse auto.dta, clear
        (1978 Automobile Data)
        . ttest price, by(foreign) unequal
        
        Two-sample t test with unequal variances
        ------------------------------------------------------------------------------
           Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
        ---------+--------------------------------------------------------------------
        Domestic |      52    6072.423    429.4911    3097.104    5210.184    6934.662
         Foreign |      22    6384.682    558.9942    2621.915     5222.19    7547.174
        ---------+--------------------------------------------------------------------
        combined |      74    6165.257    342.8719    2949.496    5481.914      6848.6
        ---------+--------------------------------------------------------------------
            diff |           -312.2587    704.9376               -1730.856    1106.339
        ------------------------------------------------------------------------------
            diff = mean(Domestic) - mean(Foreign)                         t =  -0.4430
        Ho: diff = 0                     Satterthwaite's degrees of freedom =  46.4471
        
            Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
         Pr(T < t) = 0.3299         Pr(|T| > |t|) = 0.6599          Pr(T > t) = 0.6701
        
        . bootstrap r(mu_1) r(mu_2) r(t), reps(200) strata(foreign) saving(C:\Users\user\Desktop\Carlo_try.dta, replace) : ttest price, by(foreign) unequal
        (running ttest on estimation sample)
        
        Warning:  Because ttest is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations
                  are used in calculating the statistics and so assumes that all observations are used.  This means that no observations will
                  be excluded from the resampling because of missing values or other reasons.
        
                  If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded.  Be sure that
                  the dataset in memory contains only the relevant data.
        
        Bootstrap replications (200)
        ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
        ..................................................    50
        ..................................................   100
        ..................................................   150
        ..................................................   200
        
        Bootstrap results
        
        Number of strata   =         2                  Number of obs     =         74
                                                        Replications      =        200
        
              command:  ttest price, by(foreign) unequal
                _bs_1:  r(mu_1)
                _bs_2:  r(mu_2)
                _bs_3:  r(t)
        
        ------------------------------------------------------------------------------
                     |   Observed   Bootstrap                         Normal-based
                     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
               _bs_1 |   6072.423   468.8205    12.95   0.000     5153.552    6991.294
               _bs_2 |   6384.682   565.3566    11.29   0.000     5276.603     7492.76
               _bs_3 |  -.4429594   1.098561    -0.40   0.687      -2.5961    1.710181
        ------------------------------------------------------------------------------
        
        . use "C:\Users\user\Desktop\Carlo_try.dta", clear
        (bootstrap: ttest)
        Kind regards,
        Carlo
        (StataNow 18.5)

        Comment


        • #5
          It is a z-stat, not t. Anyway, See Buis (2007).

          Code:
          sysuse auto , clear
          replace price = price - 6100
          bootstrap r(mean) , reps(500) : summarize price
          tempname z
          scalar `z' = _b[_bs_1]/_se[_bs_1]
          display "z = " `z'
          display "p = " 2 * normal(-abs(`z'))
          You can skip the last step since you are interested in the z-stat, only, not the p-value.

          Best
          Daniel


          Buis, M., L. (2007). Stata tip 53: Where did my p-values go? The Stata Journal, 7(4), pp. 584-586.
          Last edited by daniel klein; 01 Mar 2016, 00:16.

          Comment


          • #6
            Daniel

            I am not 100% sure I understand your code. In your case you run bootstrap r(mean) for one variable - Price. My case is for 776 variables and that's where difficulty arises.

            Following your advice do I run estimates using loop?
            When I try something like that it does not display anything.

            Code:
            forval i=1/776 {
            bootstrap r(mean), reps(500): summarize var_MA22Return`i', meanonly
            tempname z
            scalar `z' = _b[_bs_1]/_se[_bs_1]
            display "z = " `z'
            }

            Comment


            • #7
              Details depend very much on what exactly you are trying to do. In your initial post you obtain bootstrapped standard errors only. Adopting Clyde's suggestion you now obtain point estimates and the standard errors via bootstrap. Sticking with this latter approach, I guess you want something along these lines

              Code:
              // set up a 776 by 2 matrix
              matrix mean_z = J(776, 2, .)
              matrix colnames mean_z = "mean" "z"
              
              // now fill the matirx with means and z-stats
              forvalues j = 1/776 {
                  quietly bootstrap r(mean) , reps(500) : ///
                      summarize var_MA22Return`j' , meanonly
                  matrix mean_z[`j', 1] = _b[_bs_1]
                  matrix mean_z[`j', 2] = _b[_bs_1]/_se[bs_1]
              }
              
              // now put the matrix into a new Stata dataset
              preserve // <- makes a temporary copy of the curret dataset
              clear
              set obs 776
              svmat mean_z , names(col)
              
              // optionally save the dataset
              save mean_z.dta
              
              // ... or export to Excel
              export excel ...
              
              // ... or ...
              whatever
              restore // <- brings back the original dataset
              Whether there are (much) better ways to get what you want, I cannot tell for sure. Perhaps something like statsby is an alternative here?

              Best
              Daniel

              Comment


              • #8
                Daniel

                I am running the code you have suggested:
                Code:
                set matsize 11000
                matrix mean_z = J(776, 2, .)
                matrix colnames mean_z = "mean" "z"
                forvalues j=1/776 {
                    bootstrap r(mean), reps(500):summarize var_MA22Return`j', meanonly
                    matrix mean_z[`j', 1] = _b[_bs_1]
                    matrix mean_z[`j', 2] = _b[_bs_1]/_se[bs_1]
                }
                It starts bootstrap replications and finishes shortly sending the following error message:
                [bs_1] not found
                r(111);
                What may cause this error message?

                Thank you.

                Comment


                • #9
                  There is a typo in the code. Look for [bs_1] instead of [_bs_1] in the last line of the loop and change it.

                  Comment


                  • #10
                    Hello all,

                    I have a similar issue that I'd like to address as well. I am a first-time poster so please bear with me.

                    Context:
                    I am conducting an analysis where I have 13 sites, and am conducting a risk factor analysis of tuberculosis (TB) on the outcome of COPD, adjusting for just 7 variables. Unfortunately, although the dataset that I've been given is quite large (12k participants), among the 13 sites, the amount of reported TB cases can be very low, in some cases <10. Therefore, to get the site-specific 95% CI, my advisor (who is an R man, hence this posting), suggested I bootstrap the mixed-effects regression I'm using, with the reps equal to the site N. That all sounded great, but when I test out my code even for 11 reps, let alone the nearly 2,000 I'll have to do for some sites, it takes at least 20 minutes. I would ideally like to get the OR and 95% CI from each of the covariates, but if push came to shove, I would settle simply for the TB variable.

                    I am using Stata 14.2 I/C and the melogit package.



                    Code:
                    bootstrap if site_city==1,reps(1102) seed(1984): melogit copd tb center_age sex center_bmi center_pack_years10 secondary_education biomass || site_city: ,or
                    estat bootstrap, percentile


                    Primary issue: What can I do to speed up the code?
                    Secondary Issue: What is the best way to pull out the 5th and 95th values for the ORs that I need? I'm currently using estat bootstrap, percentile, which only gives me the tb covariate, but then also generates the following error, breaking up my dofile.

                    equation [var(_cons[site_city] not found
                    r(303);
                    I think this is because the random intercept for the site variable is not being stored through estat, but I also haven't been able to find where it might be in the Stata file on melogit, if it is stored at all.

                    Thank you for the time, I hope to hear from you soon!

                    Best,
                    Reuben

                    Comment

                    Working...
                    X