Bootstrap t-test and OLS-Regression

Ariane Arbol

Join Date: May 2021
Posts: 36

Bootstrap t-test and OLS-Regression

06 Jan 2022, 12:06

Dear all,

due to problems with homoscedasticity and normal distribution in my data, I find bootstrap as an adecuate solution to this. Unfortunately, I do not fully understand how to do it in Stata.

The regression part looks like this in the un-bootstrapped version:

Code:

.    reg             ch_helpfreq_abs ///
>    i.ch_female i.ch_employment i.ch_partner c.ch_nrkids i.ch_coresiding i.ch_faraway i.ch_educhigh i.transfer_childpar i.transfer_parchild c.ch_age ///
>    c.nr_sons c.nr_daught           ///
>    c.r_age i.r_female i.r_partner i.r_educhigh c.lnr_hhincome c.health_lim         if sample_main==1

      Source |       SS           df       MS      Number of obs   =       229
-------------+----------------------------------   F(19, 209)      =      5.56
       Model |  390155.989        19  20534.5257   Prob > F        =    0.0000
    Residual |  771894.928       209  3693.27717   R-squared       =    0.3357
-------------+----------------------------------   Adj R-squared   =    0.2754
       Total |  1162050.92       228  5096.71455   Root MSE        =    60.772

-------------------------------------------------------------------------------------
    ch_helpfreq_abs |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
          ch_female |
            1. Yes  |   13.01426    12.3828     1.05   0.294    -11.39694    37.42546
                    |
      ch_employment |
Employed part-time  |   10.28111   15.54928     0.66   0.509    -20.37241    40.93464
Employed full-time  |   4.257441   12.18111     0.35   0.727    -19.75615    28.27103
                    |
         ch_partner |
            1. Yes  |  -8.814668   10.92722    -0.81   0.421    -30.35637    12.72703
          ch_nrkids |  -.2130886   4.415956    -0.05   0.962    -8.918613    8.492435
                    |
      ch_coresiding |
            1. Yes  |   29.34034   11.27993     2.60   0.010     7.103317    51.57735
                    |
         ch_faraway |
            1. Yes  |  -11.79207   9.713842    -1.21   0.226    -30.94174      7.3576
                    |
        ch_educhigh |
            1. Yes  |   17.65725   9.575949     1.84   0.067    -1.220576    36.53508
                    |
  transfer_childpar |
            1. Yes  |   83.87799   32.69203     2.57   0.011      19.4296    148.3264
                    |
  transfer_parchild |
            1. Yes  |   -6.66581   10.32876    -0.65   0.519    -27.02772     13.6961
             ch_age |   1.341725   .9014449     1.49   0.138    -.4353653    3.118814
            nr_sons |   1.576808   6.353965     0.25   0.804    -10.94927    14.10288
          nr_daught |  -1.500502    5.74762    -0.26   0.794    -12.83124    9.830239
              r_age |  -.8002276    .908177    -0.88   0.379    -2.590589    .9901339
                    |
           r_female |
            1. Yes  |  -5.036884   8.952904    -0.56   0.574    -22.68646    12.61269
                    |
          r_partner |
            1. Yes  |  -8.088559   11.96358    -0.68   0.500    -31.67332    15.49621
                    |
         r_educhigh |
            1. Yes  |  -3.394952   10.96352    -0.31   0.757    -25.00822    18.21831
       lnr_hhincome |   7.603345   5.235145     1.45   0.148    -2.717113     17.9238
         health_lim |   10.04284   1.382689     7.26   0.000     7.317033    12.76864
              _cons |  -75.22747   60.65407    -1.24   0.216    -194.7997    44.34472
-------------------------------------------------------------------------------------

I added the vce(bootstrap) option to get the bootstrap estimation (with 1000 replicates and a seed number for reproduction):

Code:

.  // bootstrap
.  reg             ch_helpfreq_abs ///
>  i.ch_female i.ch_employment i.ch_partner c.ch_nrkids i.ch_coresiding i.ch_faraway i.ch_educhigh i.transfer_childpar i.transfer_parchild c.ch_age ///
>  c.nr_sons c.nr_daught           ///
>  c.r_age i.r_female i.r_partner i.r_educhigh c.lnr_hhincome c.health_lim         if sample_main==1, vce(bootstrap, reps(1000) seed(8086411))
(running regress on estimation sample)

Bootstrap replications (1000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
............................x.....................   100
......x...........................................   150
..................................................   200
............x.................x...................   250
........x.........................................   300
..x.......x.......................................   350
x......................x..........................   400
..................................................   450
..................................................   500
..................................................   550
x........................................x........   600
.................................................x   650
..................................................   700
..................................................   750
.......x..........................................   800
.x................................................   850
..................................................   900
......x..........x....................x...........   950
........x.........................................  1000

Linear regression                               Number of obs     =        229
                                                Replications      =        982
                                                Wald chi2(19)     =      21.81
                                                Prob > chi2       =     0.2939
                                                R-squared         =     0.3357
                                                Adj R-squared     =     0.2754
                                                Root MSE          =    60.7723

-------------------------------------------------------------------------------------
                    |   Observed   Bootstrap                         Normal-based
    ch_helpfreq_abs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
          ch_female |
            1. Yes  |   13.01426   12.65994     1.03   0.304    -11.79876    37.82728
                    |
      ch_employment |
Employed part-time  |   10.28111   15.54569     0.66   0.508    -20.18788    40.75011
Employed full-time  |   4.257441   11.95636     0.36   0.722     -19.1766    27.69148
                    |
         ch_partner |
            1. Yes  |  -8.814668   10.55624    -0.84   0.404    -29.50452    11.87518
          ch_nrkids |  -.2130886   5.190599    -0.04   0.967    -10.38648    9.960298
                    |
      ch_coresiding |
            1. Yes  |   29.34034    14.3009     2.05   0.040     1.311084    57.36959
                    |
         ch_faraway |
            1. Yes  |  -11.79207   6.482636    -1.82   0.069     -24.4978    .9136635
                    |
        ch_educhigh |
            1. Yes  |   17.65725   10.12226     1.74   0.081    -2.182018    37.49653
                    |
  transfer_childpar |
            1. Yes  |   83.87799   81.37532     1.03   0.303    -75.61471    243.3707
                    |
  transfer_parchild |
            1. Yes  |   -6.66581   11.79164    -0.57   0.572    -29.77699    16.44537
             ch_age |   1.341725   1.077577     1.25   0.213    -.7702883    3.453737
            nr_sons |   1.576808   5.564748     0.28   0.777    -9.329897    12.48351
          nr_daught |  -1.500502   4.458608    -0.34   0.736    -10.23921     7.23821
              r_age |  -.8002276   .9779172    -0.82   0.413     -2.71691    1.116455
                    |
           r_female |
            1. Yes  |  -5.036884   9.440082    -0.53   0.594    -23.53911    13.46534
                    |
          r_partner |
            1. Yes  |  -8.088559   13.75266    -0.59   0.556    -35.04328    18.86617
                    |
         r_educhigh |
            1. Yes  |  -3.394952   9.514389    -0.36   0.721    -22.04281    15.25291
       lnr_hhincome |   7.603345   6.883106     1.10   0.269    -5.887295    21.09399
         health_lim |   10.04284   2.891411     3.47   0.001     4.375774     15.7099
              _cons |  -75.22747   57.22399    -1.31   0.189    -187.3844    36.92949
-------------------------------------------------------------------------------------
Note: One or more parameters could not be estimated in 18 bootstrap replicates;
      standard-error estimates include only complete replications.

Is this how it works? Does the error message matter? As I understand it, it means that 18 out of 1000 replicates could not be estimated, which - in my opinion - should not be too much as a problem? Or, in other words, at how many non-estimated replications would it become problematic?

However, the first part of my analysis is to test the difference of the mean between two groups (t-test) and I do not get how this is done in a bootstrapped version. The vce(bootstrap) option does not work here, so I tried to build the command with Stata's help bootstrap, but I'm nearly lost ...
This is the result of the "normal" t-test (with preceding Levene's test for equal variances:

Code:

. sdtest  ch_helpfreq_abs if sample_main==1, by(ch_female)                      

Variance ratio test
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
   0. No |     107          14    4.962334    51.33078     4.16169    23.83831
  1. Yes |     122    24.58197     7.70499    85.10439    9.327908    39.83603
---------+--------------------------------------------------------------------
combined |     229    19.63755    4.717668    71.39128    10.34175    28.93336
------------------------------------------------------------------------------
    ratio = sd(0. No) / sd(1. Yes)                                f =   0.3638
Ho: ratio = 1                                    degrees of freedom = 106, 121

    Ha: ratio < 1               Ha: ratio != 1                 Ha: ratio > 1
  Pr(F < f) = 0.0000         2*Pr(F < f) = 0.0000           Pr(F > f) = 1.0000

.
. * t-Test
. ttest   ch_helpfreq_abs if sample_main==1, by(ch_female) unequal

Two-sample t test with unequal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
   0. No |     107          14    4.962334    51.33078     4.16169    23.83831
  1. Yes |     122    24.58197     7.70499    85.10439    9.327908    39.83603
---------+--------------------------------------------------------------------
combined |     229    19.63755    4.717668    71.39128    10.34175    28.93336
---------+--------------------------------------------------------------------
    diff |           -10.58197    9.164694               -28.65247    7.488534
------------------------------------------------------------------------------
    diff = mean(0. No) - mean(1. Yes)                             t =  -1.1546
Ho: diff = 0                     Satterthwaite's degrees of freedom =  202.439

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.1248         Pr(|T| > |t|) = 0.2496          Pr(T > t) = 0.8752

Then I requested the stored results and tried to create the bootstrap command with the following results:

Code:

. return list

scalars:
              r(level) =  95
                 r(sd) =  71.39127781724939
               r(sd_2) =  85.10439288697191
               r(sd_1) =  51.33078079090336
                 r(se) =  9.16469442146575
                r(p_u) =  .8752014715105874
                r(p_l) =  .1247985284894127
                  r(p) =  .2495970569788254
                  r(t) =  -1.154644849732189
               r(df_t) =  202.4387773576472
               r(mu_2) =  24.58196721311475
                r(N_2) =  122
               r(mu_1) =  14
                r(N_1) =  107

.

. * t-Test (bootstrap)
. set seed 8086411

. bootstrap meanM=r(mu_1) meanF=r(mu_2) sig=r(p), reps(1000): ttest       ch_helpfreq_abs if sample_main==1, by(ch_female) unequal
(running ttest on estimation sample)

Warning:  Because ttest is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used.  This means
          that no observations will be excluded from the resampling because of missing values or other reasons.

          If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded.  Be sure that the dataset in memory contains only the relevant data.

Bootstrap replications (1000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100
..................................................   150
..................................................   200
..................................................   250
..................................................   300
..................................................   350
..................................................   400
..................................................   450
..................................................   500
..................................................   550
..................................................   600
..................................................   650
..................................................   700
..................................................   750
..................................................   800
..................................................   850
..................................................   900
..................................................   950
..................................................  1000

Bootstrap results                               Number of obs     =        229
                                                Replications      =      1,000

      command:  ttest ch_helpfreq_abs, by(ch_female) unequal
        meanM:  r(mu_1)
        meanF:  r(mu_2)
          sig:  r(p)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meanM |         14   5.098768     2.75   0.006     4.006598     23.9934
       meanF |   24.58197   7.851312     3.13   0.002     9.193678    39.97026
         sig |   .2495971   .2908453     0.86   0.391    -.3204492    .8196433
------------------------------------------------------------------------------

.

But I really don't know if this is what I want and the correct version of the command. After all, I need the group means and the p-value (significance) of the difference of the means ... and here are separate significances given out for the three estimates? This confuses me, all in all ...

Thanks for any hints and help!

Tags: None

George Ford

Join Date: Aug 2014

Posts: 3152
#2

06 Jan 2022, 13:27

for the bootstrap ttest, "sig" is the significance level. you can ttest=r(t) to get the t-stat.
Comment
George Ford

Join Date: Aug 2014

Posts: 3152
#3

06 Jan 2022, 13:28

I suspect you'll want to bootstrap the robust standard errors in the regression if you've got heteroskedasticity.

I wouldn't stress about the non-normality of the disturbance. Might transform the DV if that helps (probably with both).

Probably just use robust errors and call it a day.
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 695
#4

07 Jan 2022, 00:01

Regarding the first question, I think the 18 lost replications are acceptable but in general I would use BC intervals or at least compare them to the normal ones, like:

Code:

bootstrap... estat bootstrap, all

For the ttest you can either bootstrap the difference of the two means (if you want a CI for the difference) or use a permutation test (if you like p-values more). This then becomes

Code:

bootstrap diff =(r(mu_1) - r(mu_2)), reps(1000): ttest...

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment

Announcement

Bootstrap t-test and OLS-Regression

Comment

Comment

Comment