Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • what does the size() option on bootstrap do?

    I apparently do not understand the documentation for the bootstrap command.
    Code:
    . sysuse auto, clear
    (1978 automobile data)
    
    . set seed 42
    
    . bootstrap e(N) e(r2), size(30): regress price weight
    (running regress on estimation sample)
    
    Bootstrap replications (50)
    ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
    ..................................................    50
    
    Linear regression                                           Number of obs = 74
                                                                Replications  = 50
    
          Command: regress price weight
            _bs_1: e(N)
            _bs_2: e(r2)
    
    ------------------------------------------------------------------------------
                 |   Observed   Bootstrap                         Normal-based
                 | coefficient  std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
           _bs_1 |         74          .        .       .            .           .
           _bs_2 |   .2901023   .1183702     2.45   0.014      .058101    .5221036
    ------------------------------------------------------------------------------
    
    .
    What I expected was "Number of obs = 30" and _bs_1 to also be 30, with no variation. Instead, bootstrap has apparently used size(_N) which is the default, and given me 74 observations.

    Can someone explain to me what effect the size(30) option had?

  • #2
    This is confusing, and I recall an earlier thread from some years ago about this behaviour but I cannot find that thread now. The -size()- option is supposed to draw a simple random sample with replacement of size n, where n < _N, ignoring clustering. The resulting statistics are based on the specified size, as demonstrated by examining the noisy output of 2 bootstrap samples.

    Code:
    sysuse auto, clear
    set seed 42
    
    tempfile res
    bootstrap n=(e(N)) _b, reps(2) noi size(30) saving(`res', replace): regress price weight
    
    use `res', clear
    list
    Result

    Code:
    . bootstrap n=(e(N)) _b, reps(2) noi size(30) saving(`res', replace): regress price weight
    bootstrap: First call to regress with data as is:
    
    . regress price weight
    
          Source |       SS           df       MS      Number of obs   =        74
    -------------+----------------------------------   F(1, 72)        =     29.42
           Model |   184233937         1   184233937   Prob > F        =    0.0000
        Residual |   450831459        72  6261548.04   R-squared       =    0.2901
    -------------+----------------------------------   Adj R-squared   =    0.2802
           Total |   635065396        73  8699525.97   Root MSE        =    2502.3
    
    ------------------------------------------------------------------------------
           price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
          weight |   2.044063   .3768341     5.42   0.000     1.292857    2.795268
           _cons |  -6.707353    1174.43    -0.01   0.995     -2347.89    2334.475
    ------------------------------------------------------------------------------
    (file C:\tmp\ST_3fe4_000001.tmp not found)
    
    Bootstrap replications (2)
    . regress price weight
    
          Source |       SS           df       MS      Number of obs   =        30
    -------------+----------------------------------   F(1, 28)        =      5.08
           Model |  21802684.2         1  21802684.2   Prob > F        =    0.0323
        Residual |   120244319        28  4294439.96   R-squared       =    0.1535
    -------------+----------------------------------   Adj R-squared   =    0.1233
           Total |   142047003        29  4898172.52   Root MSE        =    2072.3
    
    ------------------------------------------------------------------------------
           price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
          weight |   1.301304   .5775335     2.25   0.032     .1182806    2.484328
           _cons |   1359.267   1738.839     0.78   0.441    -2202.584    4921.118
    ------------------------------------------------------------------------------
    . regress price weight
    
          Source |       SS           df       MS      Number of obs   =        30
    -------------+----------------------------------   F(1, 28)        =     20.14
           Model |   126219603         1   126219603   Prob > F        =    0.0001
        Residual |   175456520        28  6266304.28   R-squared       =    0.4184
    -------------+----------------------------------   Adj R-squared   =    0.3976
           Total |   301676123        29  10402624.9   Root MSE        =    2503.3
    
    ------------------------------------------------------------------------------
           price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
          weight |   2.548572   .5678573     4.49   0.000     1.385369    3.711775
           _cons |  -1629.684   1936.397    -0.84   0.407    -5596.213    2336.845
    ------------------------------------------------------------------------------
    
    Linear regression                                           Number of obs = 74
                                                                Replications  =  2
    
          Command: regress price weight
          [_eq2]n: e(N)
    
    ------------------------------------------------------------------------------
                 |   Observed   Bootstrap                         Normal-based
                 | coefficient  std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
    _eq1         |
          weight |   2.044063   .8819513     2.32   0.020     .3154698    3.772655
           _cons |  -6.707353   2113.507    -0.00   0.997    -4149.106    4135.691
    -------------+----------------------------------------------------------------
    _eq2         |
               n |         74          .        .       .            .           .
    ------------------------------------------------------------------------------
    
    . list
         +-------------------------------+
         | _b_wei~t     _b_cons   _eq2_n |
         |-------------------------------|
      1. | 1.301304    1359.267       30 |
      2. | 2.548572   -1629.684       30 |
         +-------------------------------+
    So each bootstrap draws a sample of the requested size and bootstrap correctly saves the statistics for each of those samples (shown in red). The hiccup here, I think, is that it is somehow overwriting that in the estimation table output because it's using the coefficient point estimates and those where estimated on the full sample. I would have expected that number to match the size requested.

    Comment


    • #3
      Perhaps this is the thread you meant? Leonardo Guizzetti

      Comment


      • #4
        Originally posted by Jared Greathouse View Post
        Perhaps this is the thread you meant? Leonardo Guizzetti
        Honestly, I don't recall at this point. These two threads may also be of interest for how to see bootstrap at work and how to (not) abuse it.

        Comment


        • #5
          My thanks to Leonardo Guizzetti and Jared Greathouse for their discussion and links. How disconcerting to see my earlier participation in discussions similar to the one I have just started. Clearly not all of my 8700+ (at this time) prior posts here are readily accessible in my memory.

          I now (and apparently, for some limited time into the future) understand that "Number of obs = 74" is in fact the number of observations used in the initial regression on the full estimation sample. And I now recall something I once knew, that the "observed coefficient" is in fact the value from that initial regression - the bootstrap process gives us only the estimates of the standard errors of those estimated coefficients; it does not provide new estimates of those coefficients. And this is noticeable in Leonardo's example: the two bootstrapped constants are 1359 and -1629; the "observed constant" is -6, which would be an incongruous melding of the two bootstrap results.

          So the "observed coefficient" for e(N) is indeed correctly reported as that from the full estimation sample.

          Comment

          Working...
          X