Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bootstrapping Confidence Interval for bins/groups

    Hello everyone,

    First my dataset: The bins were created by recoding my initial x for a specific date:

    Code:
    recode x*2012xxxx (2000/2500=2500)(2500/3000=3000)(3000/3500=3500)(3500/4000=4000)(4000/4500=4500)(4500/5000=5000)(5000/5500=5000)(5500/6000=5500)(6000/6500=6000)(6500/7000=6500)(7000/max=7000), gen(bin)
    When running my whole code and using dataex it looks like this: I dont know if this is useful for you, since the bins have these values and not the ones above, but my code seems to work nevertheless.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(bin change)
                0 .08722616 
    1.9073486e-06 .20833333 
    .000014781952 .08333334 
    .000061035156      .125 
              .01 .28812057 
        .01999919 .08333334 
              .02  .2240991 
              .03 .19109195 
         .0390625  .6666667 
              .04     .1875 
        .04999995      .125 
              .05  .2847222 
              .05  .2872807 
              .06  .2847222 
              .06      .125 
              .07 .23214285 
              .08 .20833333 
              .09 .28030303 
               .1  .3333333 
              .11 .29166666
              .12 .24358974 
              .13 .23958333 
              .14  .2857143 
              .15  .3090278 
              .16 .24264705 
              .17 .21666667 
              .18 .19791667 
              .19 .25462964 
               .2       .25 
               .2 .29166666 
              .21 .21666667 
              .22 .26190478 
              .23 .15277778 
              .24 .20833333 
              .25 .28333333 
              .26 .23148148 
        .26999998 .08333334 
              .27  .3166667 
              .28 .20833333 
              .29 .27314815 
               .3       .25 
               .3 .19791667 
              .31 .30555555
              .32  .3583333 
              .33  .4583333 
              .33 .27604166 
              .34 .22222222 
              .35 .29166666 
              .36 .08333334 
              .36 .25833333 
              .37 .20833333 
              .38 .22916667 
              .39 .20833333 
               .4       .25 
               .4  .2559524 
              .41  .2638889
              .42 .11111111 
              .43 .22916667 
              .44 .29166666 
              .45 .30555555 
              .45 .08333334
              .46 .20833333
              .47 .18333334
              .48 .22916667
              .49 .30555555
               .5       .25 
              .51 .20833333 
              .52 .22916667 
              .53    .40625 
              .54 .08333334
              .55 .20833333
              .56        .2 
              .57 .22916667 
              .58 .15833333 
              .59        .2 
               .6 .20833333 
               .6  .3020833 
              .61        .5 
              .62 .15833333 
              .63  .4166667 
              .64  .3333333 
              .65    .21875 
              .66 .29166666 
              .67    .15625 
              .68  .2777778 
              .69      .125 
               .7 .29166666 
              .71  .4583333 
              .72 .20833333 
              .72 .29166666 
              .73 .08333334 
              .75 .23333333 
              .76 .08333334 
              .77  .4166667 
              .78       .25 
              .79  .4583333 
               .8  .3020833 
              .81 .22916667
              .82  .3333333 
              .83      .125 
    end
    My dataset includes the bin variable and the change variable, which is constructed by the change to x from x-1. For each bin I plotted the change variable through a connected graph

    Code:
    gen low_limit = change if bin< 4750 & bin> 2000
    gen big_limit =  change if bin> 5250 
    
    twoway connected low_limit big_limit bin if bin != 5000 & bin > 2000, msymbol(D D) xlabel(2500(500)7500) xtitle("") xlabel(3000 "3000" 4000 "4000" 5000 "5000" 6000 "6000" 7000 "7000") ylabel(.30(.10)0.7) ytitle("change") xline(5000) color(red) title("")
    Now, I want to include for each bin confidence intervals (with rcap design) by using bootstrapping. I have looked at @Clyde Schechter approach from #6:
    https://www.statalist.org/forums/for...dence-interval
    , but I havent been able to produce any fruitful result for my context using this approach. Either I get error messages like "invalid syntax" "not valid command" or in case of graphs, they looked completely desastrous.

    Code:
    use temp_data, clear
    set seed 756124839
    
    levelsof bin_debt, local(bins)
    
    /*Bootstrapping*/
    foreach b of local bins {
            bootstrap _b, reps(1000) bca: mean change if bin == `b' 
            estat bootstrap
            bysort bin: egen upperConf = pctile(change), p(95)  
            bysort bin: egen lowerConf = pctile(change), p(5)
        }
    
    collapse (mean) change (p5) lowerConf = change (p95) upperConf = change, by(bin)
    
    /*Graph*/
    gen low_limit = change if bin< 4750 & bin> 2000
    gen big_limit =  change if bin> 5250 
    local xmin = r(min)       
    local xmax = r(max)     
    twoway  (rcap upperConf lowerConf bin, lcolor(gs6) lwidth(medthick)) ///  
            (connected low_limit big_limit bin, mcolor(navy)) ///  
            , xlabel(`bins', valuelabel) graphregion(color(white) icolor(white) lwidth(0)) xtitle("") ytitle("%{&Delta} change") ///
            legend(off) xscale(alt range(`=`xmin'-0.5' `=`xmax'+0.5')) bgcolor(white) title("`title'")
    Really hope someone can help me. If you need more information about the code or the dataset, I will try to provide it as good as possible.

  • #2
    Your example data has no values for bin above 0.83 and so the rule for creating bin given as

    Code:
    (2000/2500=2500)
    (2500/3000=3000)
    (3000/3500=3500)
    (3500/4000=4000)
    (4000/4500=4500)
    (4500/5000=5000)
    (5000/5500=5000)
    (5500/6000=5500)
    (6000/6500=6000)
    (6500/7000=6500)
    (7000/max=7000)
    is irrelevant to your example data, or vice versa.

    You also have different instructions for values equal to the bin limits 2500(500)7000 and your rules involve rounding up for low values and rounding down for high values.

    I am not a big fan of recode for reasons you need not care about, but I ran your rule against some invented data and was surprised that the rule named first applies when bins overlap. My guess was the opposite. (In fact, this is one of my reasons for disliking recode: quite what several sub-rules do when they clash is not transparent from the command.)

    The resulting bins look capriciously defined to me and certainly are not of equal width. Perhaps there is a rationale for this. Perhaps there was a mistake in writing down the bin limits as can easily happen with repetitive and tedious code.

    In
    https://journals.sagepub.com/doi/pdf...867X1801800311 and elsewhere I have pushed the use of floor() and ceil() functions as much easier machinery for binning systematically.

    Code:
    clear
    set obs 25
    gen x = 1750 + 250 * _n
    recode x (2000/2500=2500)(2500/3000=3000)(3000/3500=3500) ///
    (3500/4000=4000)(4000/4500=4500)(4500/5000=5000)(5000/5500=5000)   ///
    (5500/6000=5500)(6000/6500=6000)(6500/7000=6500)(7000/max=7000), gen(bin)
    
    list  , sepby(bin)
    
         +-------------+
         |    x    bin |
         |-------------|
      1. | 2000   2500 |
      2. | 2250   2500 |
      3. | 2500   2500 |
         |-------------|
      4. | 2750   3000 |
      5. | 3000   3000 |
         |-------------|
      6. | 3250   3500 |
      7. | 3500   3500 |
         |-------------|
      8. | 3750   4000 |
      9. | 4000   4000 |
         |-------------|
     10. | 4250   4500 |
     11. | 4500   4500 |
         |-------------|
     12. | 4750   5000 |
     13. | 5000   5000 |
     14. | 5250   5000 |
     15. | 5500   5000 |
         |-------------|
     16. | 5750   5500 |
     17. | 6000   5500 |
         |-------------|
     18. | 6250   6000 |
     19. | 6500   6000 |
         |-------------|
     20. | 6750   6500 |
     21. | 7000   6500 |
         |-------------|
     22. | 7250   7000 |
     23. | 7500   7000 |
     24. | 7750   7000 |
     25. | 8000   7000 |
         +-------------+


    "seems to work nevertheless" is a better report than the opposite but does not convey any clarity or conviction about what you are trying to do when there are such disconnects between parts of your question.

    I didn't really get as far as your bootstrap code as it seems to me that there is not much point in discussing those results until you have clear and consistent rules for binning.

    If I understand correctly, you don't use bootstrap results for your 5 and 95% limits, but the empirical 5% and 95% percentiles for the raw data, so your graph won't show confidence intervals at all.

    Without commenting on everything I note that you carry out this calculation

    Code:
    bysort bin: egen upperConf = pctile(change), p(95)  
    bysort bin: egen lowerConf = pctile(change), p(5)
    again and again. If the results are interesting or useful you should take those commands out of the loop.

    Worse, I can't see how that code could possibly work, as second time around the loop the
    egen command should complain that a variable already exists.

    It's hard to help much more here, unfortunately.

    Rightly or wrongly I sense a kind of bricolage in finding bits of code here and there and copying and pasting them together without really understanding what each code segment does or even whether it is consistent with others.

    The error reports aren't consistent either. As the
    foreach loop should fail on my diagnosis, I can't see that you would ever get to a graph at all, even a graph that looked bizarre.

    So I have to guess that you are showing some kind of amalgam of bits of code from different attempts, and hoping that we can sort it out for you. Unfortunately that is optimistic.

    What is your situation? If you are a student, you need support from your institution. Alternatively, if you want much more help from us, you need to pose one problem at a time.
    Last edited by Nick Cox; 18 Jul 2020, 05:51.

    Comment


    • #3
      Hello,
      first of all, I apologize for the confusion and the effort you made to understand the code. I created the bins to replicate a graph for a better understanding of tutorial exercises from uni. Since semester is over in my location contacting an instructor of the course isn't feasible right now. At first the bins looked similar the way you constructed them. In effort to replicate the graph better, I changed the bin limits (unfortunately cannot recall with what code), which ended up being a better replication, but afterwards my bin data looked as you can see above. So I continued.

      I did try out codes from the tutorial or from this website, and tried to put the bits together, which I thought would apply to my dataset, because at one point I got flustered. The graph results I got was through using forvalues at one point.

      I do get the general idea of bootstrapping, after reading the manual and the overview, but there are many different ways to code it, with commands I havent been able to completely understand yet.

      I guess it would be better if I try to carefully read the instruction manuals for bootstrapping again. If you want, I can delete this thread, since it probably won't be solvable without a better written data from me and due to time constraints I probably wont get into this exercise further.

      Thanks again for your time. I try in future to ask questions using more understandable codes than this one.
      Last edited by Pierre Kind; 18 Jul 2020, 06:46.

      Comment


      • #4
        OK, but you can't delete a thread, as we do explain at https://www.statalist.org/forums/help#closure

        Comment

        Working...
        X