Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bootstrap t-test and OLS-Regression

    Dear all,

    due to problems with homoscedasticity and normal distribution in my data, I find bootstrap as an adecuate solution to this. Unfortunately, I do not fully understand how to do it in Stata.

    The regression part looks like this in the un-bootstrapped version:
    Code:
    .    reg             ch_helpfreq_abs ///
    >    i.ch_female i.ch_employment i.ch_partner c.ch_nrkids i.ch_coresiding i.ch_faraway i.ch_educhigh i.transfer_childpar i.transfer_parchild c.ch_age ///
    >    c.nr_sons c.nr_daught           ///
    >    c.r_age i.r_female i.r_partner i.r_educhigh c.lnr_hhincome c.health_lim         if sample_main==1
    
          Source |       SS           df       MS      Number of obs   =       229
    -------------+----------------------------------   F(19, 209)      =      5.56
           Model |  390155.989        19  20534.5257   Prob > F        =    0.0000
        Residual |  771894.928       209  3693.27717   R-squared       =    0.3357
    -------------+----------------------------------   Adj R-squared   =    0.2754
           Total |  1162050.92       228  5096.71455   Root MSE        =    60.772
    
    -------------------------------------------------------------------------------------
        ch_helpfreq_abs |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    --------------------+----------------------------------------------------------------
              ch_female |
                1. Yes  |   13.01426    12.3828     1.05   0.294    -11.39694    37.42546
                        |
          ch_employment |
    Employed part-time  |   10.28111   15.54928     0.66   0.509    -20.37241    40.93464
    Employed full-time  |   4.257441   12.18111     0.35   0.727    -19.75615    28.27103
                        |
             ch_partner |
                1. Yes  |  -8.814668   10.92722    -0.81   0.421    -30.35637    12.72703
              ch_nrkids |  -.2130886   4.415956    -0.05   0.962    -8.918613    8.492435
                        |
          ch_coresiding |
                1. Yes  |   29.34034   11.27993     2.60   0.010     7.103317    51.57735
                        |
             ch_faraway |
                1. Yes  |  -11.79207   9.713842    -1.21   0.226    -30.94174      7.3576
                        |
            ch_educhigh |
                1. Yes  |   17.65725   9.575949     1.84   0.067    -1.220576    36.53508
                        |
      transfer_childpar |
                1. Yes  |   83.87799   32.69203     2.57   0.011      19.4296    148.3264
                        |
      transfer_parchild |
                1. Yes  |   -6.66581   10.32876    -0.65   0.519    -27.02772     13.6961
                 ch_age |   1.341725   .9014449     1.49   0.138    -.4353653    3.118814
                nr_sons |   1.576808   6.353965     0.25   0.804    -10.94927    14.10288
              nr_daught |  -1.500502    5.74762    -0.26   0.794    -12.83124    9.830239
                  r_age |  -.8002276    .908177    -0.88   0.379    -2.590589    .9901339
                        |
               r_female |
                1. Yes  |  -5.036884   8.952904    -0.56   0.574    -22.68646    12.61269
                        |
              r_partner |
                1. Yes  |  -8.088559   11.96358    -0.68   0.500    -31.67332    15.49621
                        |
             r_educhigh |
                1. Yes  |  -3.394952   10.96352    -0.31   0.757    -25.00822    18.21831
           lnr_hhincome |   7.603345   5.235145     1.45   0.148    -2.717113     17.9238
             health_lim |   10.04284   1.382689     7.26   0.000     7.317033    12.76864
                  _cons |  -75.22747   60.65407    -1.24   0.216    -194.7997    44.34472
    -------------------------------------------------------------------------------------
    I added the vce(bootstrap) option to get the bootstrap estimation (with 1000 replicates and a seed number for reproduction):
    Code:
    .  // bootstrap
    .  reg             ch_helpfreq_abs ///
    >  i.ch_female i.ch_employment i.ch_partner c.ch_nrkids i.ch_coresiding i.ch_faraway i.ch_educhigh i.transfer_childpar i.transfer_parchild c.ch_age ///
    >  c.nr_sons c.nr_daught           ///
    >  c.r_age i.r_female i.r_partner i.r_educhigh c.lnr_hhincome c.health_lim         if sample_main==1, vce(bootstrap, reps(1000) seed(8086411))
    (running regress on estimation sample)
    
    Bootstrap replications (1000)
    ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
    ..................................................    50
    ............................x.....................   100
    ......x...........................................   150
    ..................................................   200
    ............x.................x...................   250
    ........x.........................................   300
    ..x.......x.......................................   350
    x......................x..........................   400
    ..................................................   450
    ..................................................   500
    ..................................................   550
    x........................................x........   600
    .................................................x   650
    ..................................................   700
    ..................................................   750
    .......x..........................................   800
    .x................................................   850
    ..................................................   900
    ......x..........x....................x...........   950
    ........x.........................................  1000
    
    Linear regression                               Number of obs     =        229
                                                    Replications      =        982
                                                    Wald chi2(19)     =      21.81
                                                    Prob > chi2       =     0.2939
                                                    R-squared         =     0.3357
                                                    Adj R-squared     =     0.2754
                                                    Root MSE          =    60.7723
    
    -------------------------------------------------------------------------------------
                        |   Observed   Bootstrap                         Normal-based
        ch_helpfreq_abs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    --------------------+----------------------------------------------------------------
              ch_female |
                1. Yes  |   13.01426   12.65994     1.03   0.304    -11.79876    37.82728
                        |
          ch_employment |
    Employed part-time  |   10.28111   15.54569     0.66   0.508    -20.18788    40.75011
    Employed full-time  |   4.257441   11.95636     0.36   0.722     -19.1766    27.69148
                        |
             ch_partner |
                1. Yes  |  -8.814668   10.55624    -0.84   0.404    -29.50452    11.87518
              ch_nrkids |  -.2130886   5.190599    -0.04   0.967    -10.38648    9.960298
                        |
          ch_coresiding |
                1. Yes  |   29.34034    14.3009     2.05   0.040     1.311084    57.36959
                        |
             ch_faraway |
                1. Yes  |  -11.79207   6.482636    -1.82   0.069     -24.4978    .9136635
                        |
            ch_educhigh |
                1. Yes  |   17.65725   10.12226     1.74   0.081    -2.182018    37.49653
                        |
      transfer_childpar |
                1. Yes  |   83.87799   81.37532     1.03   0.303    -75.61471    243.3707
                        |
      transfer_parchild |
                1. Yes  |   -6.66581   11.79164    -0.57   0.572    -29.77699    16.44537
                 ch_age |   1.341725   1.077577     1.25   0.213    -.7702883    3.453737
                nr_sons |   1.576808   5.564748     0.28   0.777    -9.329897    12.48351
              nr_daught |  -1.500502   4.458608    -0.34   0.736    -10.23921     7.23821
                  r_age |  -.8002276   .9779172    -0.82   0.413     -2.71691    1.116455
                        |
               r_female |
                1. Yes  |  -5.036884   9.440082    -0.53   0.594    -23.53911    13.46534
                        |
              r_partner |
                1. Yes  |  -8.088559   13.75266    -0.59   0.556    -35.04328    18.86617
                        |
             r_educhigh |
                1. Yes  |  -3.394952   9.514389    -0.36   0.721    -22.04281    15.25291
           lnr_hhincome |   7.603345   6.883106     1.10   0.269    -5.887295    21.09399
             health_lim |   10.04284   2.891411     3.47   0.001     4.375774     15.7099
                  _cons |  -75.22747   57.22399    -1.31   0.189    -187.3844    36.92949
    -------------------------------------------------------------------------------------
    Note: One or more parameters could not be estimated in 18 bootstrap replicates;
          standard-error estimates include only complete replications.
    Is this how it works? Does the error message matter? As I understand it, it means that 18 out of 1000 replicates could not be estimated, which - in my opinion - should not be too much as a problem? Or, in other words, at how many non-estimated replications would it become problematic?


    However, the first part of my analysis is to test the difference of the mean between two groups (t-test) and I do not get how this is done in a bootstrapped version. The vce(bootstrap) option does not work here, so I tried to build the command with Stata's help bootstrap, but I'm nearly lost ...
    This is the result of the "normal" t-test (with preceding Levene's test for equal variances:
    Code:
    . sdtest  ch_helpfreq_abs if sample_main==1, by(ch_female)                      
    
    Variance ratio test
    ------------------------------------------------------------------------------
       Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
       0. No |     107          14    4.962334    51.33078     4.16169    23.83831
      1. Yes |     122    24.58197     7.70499    85.10439    9.327908    39.83603
    ---------+--------------------------------------------------------------------
    combined |     229    19.63755    4.717668    71.39128    10.34175    28.93336
    ------------------------------------------------------------------------------
        ratio = sd(0. No) / sd(1. Yes)                                f =   0.3638
    Ho: ratio = 1                                    degrees of freedom = 106, 121
    
        Ha: ratio < 1               Ha: ratio != 1                 Ha: ratio > 1
      Pr(F < f) = 0.0000         2*Pr(F < f) = 0.0000           Pr(F > f) = 1.0000
    
    .
    . * t-Test
    . ttest   ch_helpfreq_abs if sample_main==1, by(ch_female) unequal
    
    Two-sample t test with unequal variances
    ------------------------------------------------------------------------------
       Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
       0. No |     107          14    4.962334    51.33078     4.16169    23.83831
      1. Yes |     122    24.58197     7.70499    85.10439    9.327908    39.83603
    ---------+--------------------------------------------------------------------
    combined |     229    19.63755    4.717668    71.39128    10.34175    28.93336
    ---------+--------------------------------------------------------------------
        diff |           -10.58197    9.164694               -28.65247    7.488534
    ------------------------------------------------------------------------------
        diff = mean(0. No) - mean(1. Yes)                             t =  -1.1546
    Ho: diff = 0                     Satterthwaite's degrees of freedom =  202.439
    
        Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
     Pr(T < t) = 0.1248         Pr(|T| > |t|) = 0.2496          Pr(T > t) = 0.8752
    Then I requested the stored results and tried to create the bootstrap command with the following results:
    Code:
    . return list
    
    scalars:
                  r(level) =  95
                     r(sd) =  71.39127781724939
                   r(sd_2) =  85.10439288697191
                   r(sd_1) =  51.33078079090336
                     r(se) =  9.16469442146575
                    r(p_u) =  .8752014715105874
                    r(p_l) =  .1247985284894127
                      r(p) =  .2495970569788254
                      r(t) =  -1.154644849732189
                   r(df_t) =  202.4387773576472
                   r(mu_2) =  24.58196721311475
                    r(N_2) =  122
                   r(mu_1) =  14
                    r(N_1) =  107
    
    .
    
    . * t-Test (bootstrap)
    . set seed 8086411
    
    . bootstrap meanM=r(mu_1) meanF=r(mu_2) sig=r(p), reps(1000): ttest       ch_helpfreq_abs if sample_main==1, by(ch_female) unequal
    (running ttest on estimation sample)
    
    Warning:  Because ttest is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used.  This means
              that no observations will be excluded from the resampling because of missing values or other reasons.
    
              If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded.  Be sure that the dataset in memory contains only the relevant data.
    
    Bootstrap replications (1000)
    ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
    ..................................................    50
    ..................................................   100
    ..................................................   150
    ..................................................   200
    ..................................................   250
    ..................................................   300
    ..................................................   350
    ..................................................   400
    ..................................................   450
    ..................................................   500
    ..................................................   550
    ..................................................   600
    ..................................................   650
    ..................................................   700
    ..................................................   750
    ..................................................   800
    ..................................................   850
    ..................................................   900
    ..................................................   950
    ..................................................  1000
    
    Bootstrap results                               Number of obs     =        229
                                                    Replications      =      1,000
    
          command:  ttest ch_helpfreq_abs, by(ch_female) unequal
            meanM:  r(mu_1)
            meanF:  r(mu_2)
              sig:  r(p)
    
    ------------------------------------------------------------------------------
                 |   Observed   Bootstrap                         Normal-based
                 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           meanM |         14   5.098768     2.75   0.006     4.006598     23.9934
           meanF |   24.58197   7.851312     3.13   0.002     9.193678    39.97026
             sig |   .2495971   .2908453     0.86   0.391    -.3204492    .8196433
    ------------------------------------------------------------------------------
    
    .
    But I really don't know if this is what I want and the correct version of the command. After all, I need the group means and the p-value (significance) of the difference of the means ... and here are separate significances given out for the three estimates? This confuses me, all in all ...

    Thanks for any hints and help!

  • #2
    for the bootstrap ttest, "sig" is the significance level. you can ttest=r(t) to get the t-stat.

    Comment


    • #3
      I suspect you'll want to bootstrap the robust standard errors in the regression if you've got heteroskedasticity.

      I wouldn't stress about the non-normality of the disturbance. Might transform the DV if that helps (probably with both).

      Probably just use robust errors and call it a day.

      Comment


      • #4
        Regarding the first question, I think the 18 lost replications are acceptable but in general I would use BC intervals or at least compare them to the normal ones, like:

        Code:
        bootstrap...
        estat bootstrap, all
        For the ttest you can either bootstrap the difference of the two means (if you want a CI for the difference) or use a permutation test (if you like p-values more). This then becomes

        Code:
        bootstrap diff =(r(mu_1) - r(mu_2)), reps(1000): ttest...
        Best wishes

        (Stata 16.1 MP)

        Comment

        Working...
        X