Bootstrapping Confidence Interval for bins/groups

Pierre Kind

Join Date: Jul 2020
Posts: 10

Bootstrapping Confidence Interval for bins/groups

18 Jul 2020, 04:36

Hello everyone,

First my dataset: The bins were created by recoding my initial x for a specific date:

Code:

recode x*2012xxxx (2000/2500=2500)(2500/3000=3000)(3000/3500=3500)(3500/4000=4000)(4000/4500=4500)(4500/5000=5000)(5000/5500=5000)(5500/6000=5500)(6000/6500=6000)(6500/7000=6500)(7000/max=7000), gen(bin)

When running my whole code and using dataex it looks like this: I dont know if this is useful for you, since the bins have these values and not the ones above, but my code seems to work nevertheless.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(bin change)
            0 .08722616 
1.9073486e-06 .20833333 
.000014781952 .08333334 
.000061035156      .125 
          .01 .28812057 
    .01999919 .08333334 
          .02  .2240991 
          .03 .19109195 
     .0390625  .6666667 
          .04     .1875 
    .04999995      .125 
          .05  .2847222 
          .05  .2872807 
          .06  .2847222 
          .06      .125 
          .07 .23214285 
          .08 .20833333 
          .09 .28030303 
           .1  .3333333 
          .11 .29166666
          .12 .24358974 
          .13 .23958333 
          .14  .2857143 
          .15  .3090278 
          .16 .24264705 
          .17 .21666667 
          .18 .19791667 
          .19 .25462964 
           .2       .25 
           .2 .29166666 
          .21 .21666667 
          .22 .26190478 
          .23 .15277778 
          .24 .20833333 
          .25 .28333333 
          .26 .23148148 
    .26999998 .08333334 
          .27  .3166667 
          .28 .20833333 
          .29 .27314815 
           .3       .25 
           .3 .19791667 
          .31 .30555555
          .32  .3583333 
          .33  .4583333 
          .33 .27604166 
          .34 .22222222 
          .35 .29166666 
          .36 .08333334 
          .36 .25833333 
          .37 .20833333 
          .38 .22916667 
          .39 .20833333 
           .4       .25 
           .4  .2559524 
          .41  .2638889
          .42 .11111111 
          .43 .22916667 
          .44 .29166666 
          .45 .30555555 
          .45 .08333334
          .46 .20833333
          .47 .18333334
          .48 .22916667
          .49 .30555555
           .5       .25 
          .51 .20833333 
          .52 .22916667 
          .53    .40625 
          .54 .08333334
          .55 .20833333
          .56        .2 
          .57 .22916667 
          .58 .15833333 
          .59        .2 
           .6 .20833333 
           .6  .3020833 
          .61        .5 
          .62 .15833333 
          .63  .4166667 
          .64  .3333333 
          .65    .21875 
          .66 .29166666 
          .67    .15625 
          .68  .2777778 
          .69      .125 
           .7 .29166666 
          .71  .4583333 
          .72 .20833333 
          .72 .29166666 
          .73 .08333334 
          .75 .23333333 
          .76 .08333334 
          .77  .4166667 
          .78       .25 
          .79  .4583333 
           .8  .3020833 
          .81 .22916667
          .82  .3333333 
          .83      .125 
end

My dataset includes the bin variable and the change variable, which is constructed by the change to x from x-1. For each bin I plotted the change variable through a connected graph

Code:

gen low_limit = change if bin< 4750 & bin> 2000
gen big_limit =  change if bin> 5250 

twoway connected low_limit big_limit bin if bin != 5000 & bin > 2000, msymbol(D D) xlabel(2500(500)7500) xtitle("") xlabel(3000 "3000" 4000 "4000" 5000 "5000" 6000 "6000" 7000 "7000") ylabel(.30(.10)0.7) ytitle("change") xline(5000) color(red) title("")

Now, I want to include for each bin confidence intervals (with rcap design) by using bootstrapping. I have looked at @Clyde Schechter approach from #6:
https://www.statalist.org/forums/for...dence-interval
, but I havent been able to produce any fruitful result for my context using this approach. Either I get error messages like "invalid syntax" "not valid command" or in case of graphs, they looked completely desastrous.

Code:

use temp_data, clear
set seed 756124839

levelsof bin_debt, local(bins)

/*Bootstrapping*/
foreach b of local bins {
        bootstrap _b, reps(1000) bca: mean change if bin == `b' 
        estat bootstrap
        bysort bin: egen upperConf = pctile(change), p(95)  
        bysort bin: egen lowerConf = pctile(change), p(5)
    }

collapse (mean) change (p5) lowerConf = change (p95) upperConf = change, by(bin)

/*Graph*/
gen low_limit = change if bin< 4750 & bin> 2000
gen big_limit =  change if bin> 5250 
local xmin = r(min)       
local xmax = r(max)     
twoway  (rcap upperConf lowerConf bin, lcolor(gs6) lwidth(medthick)) ///  
        (connected low_limit big_limit bin, mcolor(navy)) ///  
        , xlabel(`bins', valuelabel) graphregion(color(white) icolor(white) lwidth(0)) xtitle("") ytitle("%{&Delta} change") ///
        legend(off) xscale(alt range(`=`xmin'-0.5' `=`xmax'+0.5')) bgcolor(white) title("`title'")

Really hope someone can help me. If you need more information about the code or the dataset, I will try to provide it as good as possible.

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35444
#2

18 Jul 2020, 05:37

Your example data has no values for bin above 0.83 and so the rule for creating bin given as

Code:

(2000/2500=2500) (2500/3000=3000) (3000/3500=3500) (3500/4000=4000) (4000/4500=4500) (4500/5000=5000) (5000/5500=5000) (5500/6000=5500) (6000/6500=6000) (6500/7000=6500) (7000/max=7000)

is irrelevant to your example data, or vice versa.

You also have different instructions for values equal to the bin limits 2500(500)7000 and your rules involve rounding up for low values and rounding down for high values.

I am not a big fan of recode for reasons you need not care about, but I ran your rule against some invented data and was surprised that the rule named first applies when bins overlap. My guess was the opposite. (In fact, this is one of my reasons for disliking recode: quite what several sub-rules do when they clash is not transparent from the command.)

The resulting bins look capriciously defined to me and certainly are not of equal width. Perhaps there is a rationale for this. Perhaps there was a mistake in writing down the bin limits as can easily happen with repetitive and tedious code.

In https://journals.sagepub.com/doi/pdf...867X1801800311 and elsewhere I have pushed the use of floor() and ceil() functions as much easier machinery for binning systematically.

Code:

clear set obs 25 gen x = 1750 + 250 * _n recode x (2000/2500=2500)(2500/3000=3000)(3000/3500=3500) /// (3500/4000=4000)(4000/4500=4500)(4500/5000=5000)(5000/5500=5000) /// (5500/6000=5500)(6000/6500=6000)(6500/7000=6500)(7000/max=7000), gen(bin) list , sepby(bin) +-------------+ | x bin | |-------------| 1. | 2000 2500 | 2. | 2250 2500 | 3. | 2500 2500 | |-------------| 4. | 2750 3000 | 5. | 3000 3000 | |-------------| 6. | 3250 3500 | 7. | 3500 3500 | |-------------| 8. | 3750 4000 | 9. | 4000 4000 | |-------------| 10. | 4250 4500 | 11. | 4500 4500 | |-------------| 12. | 4750 5000 | 13. | 5000 5000 | 14. | 5250 5000 | 15. | 5500 5000 | |-------------| 16. | 5750 5500 | 17. | 6000 5500 | |-------------| 18. | 6250 6000 | 19. | 6500 6000 | |-------------| 20. | 6750 6500 | 21. | 7000 6500 | |-------------| 22. | 7250 7000 | 23. | 7500 7000 | 24. | 7750 7000 | 25. | 8000 7000 | +-------------+

"seems to work nevertheless" is a better report than the opposite but does not convey any clarity or conviction about what you are trying to do when there are such disconnects between parts of your question.

I didn't really get as far as your bootstrap code as it seems to me that there is not much point in discussing those results until you have clear and consistent rules for binning.

If I understand correctly, you don't use bootstrap results for your 5 and 95% limits, but the empirical 5% and 95% percentiles for the raw data, so your graph won't show confidence intervals at all.

Without commenting on everything I note that you carry out this calculation

Code:

bysort bin: egen upperConf = pctile(change), p(95) bysort bin: egen lowerConf = pctile(change), p(5)

again and again. If the results are interesting or useful you should take those commands out of the loop.

Worse, I can't see how that code could possibly work, as second time around the loop the egen command should complain that a variable already exists.

It's hard to help much more here, unfortunately.

Rightly or wrongly I sense a kind of bricolage in finding bits of code here and there and copying and pasting them together without really understanding what each code segment does or even whether it is consistent with others.

The error reports aren't consistent either. As the foreach loop should fail on my diagnosis, I can't see that you would ever get to a graph at all, even a graph that looked bizarre.

So I have to guess that you are showing some kind of amalgam of bits of code from different attempts, and hoping that we can sort it out for you. Unfortunately that is optimistic.

What is your situation? If you are a student, you need support from your institution. Alternatively, if you want much more help from us, you need to pose one problem at a time.

Last edited by Nick Cox; 18 Jul 2020, 05:51.
Comment
Pierre Kind

Join Date: Jul 2020

Posts: 10
#3

18 Jul 2020, 06:43

Hello,
first of all, I apologize for the confusion and the effort you made to understand the code. I created the bins to replicate a graph for a better understanding of tutorial exercises from uni. Since semester is over in my location contacting an instructor of the course isn't feasible right now. At first the bins looked similar the way you constructed them. In effort to replicate the graph better, I changed the bin limits (unfortunately cannot recall with what code), which ended up being a better replication, but afterwards my bin data looked as you can see above. So I continued.

I did try out codes from the tutorial or from this website, and tried to put the bits together, which I thought would apply to my dataset, because at one point I got flustered. The graph results I got was through using forvalues at one point.

I do get the general idea of bootstrapping, after reading the manual and the overview, but there are many different ways to code it, with commands I havent been able to completely understand yet.

I guess it would be better if I try to carefully read the instruction manuals for bootstrapping again. If you want, I can delete this thread, since it probably won't be solvable without a better written data from me and due to time constraints I probably wont get into this exercise further.

Thanks again for your time. I try in future to ask questions using more understandable codes than this one.

Last edited by Pierre Kind; 18 Jul 2020, 06:46.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35444
#4

18 Jul 2020, 08:25

OK, but you can't delete a thread, as we do explain at https://www.statalist.org/forums/help#closure
Comment

Announcement

Bootstrapping Confidence Interval for bins/groups

Comment

Comment

Comment