bootstrap cinfidence interval for differences in median between two groups

sungwook kim

Join Date: Jul 2015
Posts: 62

bootstrap cinfidence interval for differences in median between two groups

28 Feb 2025, 07:58

Hello I want to get 95% bootstrap confidence interval for differences in median between two groups, using 'rank sum' (mann-whitney U test). The code doesn't work and not sure how to recall the stored median values for each group (i.e. median1 and median 2) after running tabstat command. Any help is welcome.

Code:

ranksum cost_5year_incQ6_new_adj, by(CA_binary)

// Calculate group medians
tabstat cost_5year_incQ6_new_adj, by(CA_binary) statistics(median)

bootstrap (median1 - median2), reps(1000) seed(12345): ///
    tabstat cost_5year_incQ6_new_adj, by(CA_binary) statistics(median)

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(cost_5year_incQ6_new_adj CA_binary)
  3924.18 0
 162.6696 0
 4928.931 0
12978.354 1
  984.852 0
   7820.9 0
  905.237 0
 4101.727 0
 10151.54 0
3420.6704 0
  147.312 1
 676.0845 0
  96.2562 0
 465.0858 0
 45124.98 0
 354.5962 0
 761.8788 0
 196.5998 0
  5965.98 0
        0 0
 16783.17 0
 113.6655 0
 131.7492 0
  2626.91 0
 2039.158 0
        0 0
4187.7344 0
 9091.484 0
  344.324 0
2027.5046 0
 5022.307 0
 65420.32 0
 135.3106 0
        0 0
   303.45 0
 7685.859 0
 12454.23 1
  624.994 0
 122.8616 0
 1212.978 1
 499.7064 0
 45853.01 0
   24.948 0
  22.1598 0
  660.852 0
        0 0
  2538.18 0
541.66724 0
 14546.33 0
  411.312 0
end

Thank you! BW Kim

Tags: None

Felix Bittmann

Join Date: Aug 2018
Posts: 662

28 Feb 2025, 08:25

This requires a custom program.

Code:

cap program drop meddiff
program define meddiff, rclass
    sum cost_5year_incQ6_new_adj if CA_binary == 0, det
    local med1 = r(p50)
    sum cost_5year_incQ6_new_adj if CA_binary == 1, det
    local med2 = r(p50)
    return scalar meddiff = `med1' - `med2'
end


bootstrap r(meddiff), seed(123): meddiff

Best wishes

(Stata 16.1 MP)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#3

28 Feb 2025, 08:43

Your command isn't working because -tabstat- does not return anything for -bootstrap- to work with: it just sends its results to the screen.

So you need to calculate the mediasn with a command that returns something, and you also need to calculate the difference between them in a program that returns that and then use that under -bootstrap:-. Also, although I do not know exactly what you are trying to calculate and how you will interpret the results, but in situations like yours, people often want the bootstrap sampling to be stratified by the grouping variable (CA_binary).

Code:

capture program drop median_diff program define median_diff, rclass sortpreserve syntax varname(numeric), by(varname) levelsof `by', local(by_values) capture assert `:word count `by_values'' == 2 if c(rc) != 0 { display as error "Variable `by' must take on exactly 2 values." exit 9 } forvalues i = 1/2 { centile `varlist' if `by' == `:word `i' of `by_values'' local median`i' = r(c_1) } return scalar median_diff = `median2' - `median1' exit end bootstrap r(median_diff), strata(CA_binary) reps(1000) seed(12345): /// median_diff cost_5year_incQ6_new_adj, by(CA_binary)

If you don't want stratified bootstrap sampling, remove the -strata(CA_binary)- option.

Added: Crossed with #2. The solution there is specific to the variables mentioned in your post. The code here is more general, and you can use it with any numeric variable in place of cost_5year_incQ6_new_adj, and any dichotomous grouping variable in place of CA_binary. In principle, however, the two solutions are the same. Well, almost: #2 does not call for stratified bootstrap sampling.

Last edited by Clyde Schechter; 28 Feb 2025, 08:45.
1 like
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 662
#4

28 Feb 2025, 10:16

Thanks to Clyde for providing this excellent and highly reusable program. One additional difference: my version used summarize while the second solution uses centile to compute the median. Both work the same but I have the impression that summarize is faster for this application. However, unless you have very large datasets or need many many resamples, this should not make a huge difference.

Best wishes

(Stata 16.1 MP)
1 like
Comment
sungwook kim

Join Date: Jul 2015

Posts: 62
#5

28 Feb 2025, 13:32

Dear Clyde and Felix,

Thank you very much indeed for your help. It is great! For Clyde's question, I need to estimate whether costs of treatment group is significantly different from the costs of control group. Due to the skewed distribution of costs, I was considering using median difference rather than mean difference. Thank you again! BW Kim
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 662
#6

28 Feb 2025, 13:39

For this aim, a quantile regression also works:

Code:

qreg depvar i.group

Best wishes

(Stata 16.1 MP)
1 like
Comment
Diya Ganguly

Join Date: Mar 2025

Posts: 4
#7

12 Mar 2025, 12:43

Hello,

I am working on an analysis in Stata and I am having difficulty calculating bootstrapped median values with confidence intervals for multiple grouping variables.

Specifically, I have data where the variable delay_weeks takes on the values 25, 52, and 103. I am trying to compute bootstrapped median values of the variable cons, which is defined as:
cons = sooner_choice / (sooner_choice + later_choice) For each combination of budget_number and delay_weeks, I would like to compute the bootstrapped median value of cons and generate confidence intervals for these values. My goal is to plot these median values against the variable pratio (gross interest rate).

Here are the key variables in my dataset:
budget_number (grouping variable 1)

delay_weeks (grouping variable 2, with values 25, 52, and 103)

cons (the variable to bootstrap)

pratio (gross interest rate)

type (participant type, coded as 0 and 1)

I’ve tried a few approaches, but I’ve been struggling to correctly group the data and calculate the bootstrapped medians with confidence intervals. I’d appreciate any help or suggestions on how to approach this in Stata.

I am attaching a sample dataset to provide further clarity on the structure of the data.

Thank you in advance for your assistance!

Attached Files

Sample_data_1.xlsx (11.7 KB, 1 view)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#8

12 Mar 2025, 13:04

Welcome to Statalist. Thank you for thinking of showing example data. But please read the Forum FAQ, as everybody is asked to do before posting, for excellent advice on how to get the most from your Statalist experience. Among the things you will learn there is that attachments are generally discouraged, and that the most helpful way to show example data is from your Stata data set using the -dataex- command. Please post back with your example data given in that way.

If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1119
#9

12 Mar 2025, 13:47

I did not notice this thread until today (Mar 12), and would like to add a couple of comments.

Comment on #1: The Wilcoxon-Mann-Whitney test does not test for a difference between medians. See Conroy (2012), for example.
Conroy, R. M. (2012). What hypotheses do “nonparametric” two-group tests actually test?. The Stata Journal, 12(2), 182-190.
Comment on #6: To get bootstrapped CIs, I think you would need to use either -sqreg- or -bsqreg-. See page 2 of the online help for -qreg-. For example:

Code:

// Modification of Example 5 in the help for -qreg- found here: // https://www.stata.com/manuals/rqreg.pdf use https://www.stata-press.com/data/r18/auto, clear // First, use -qreg- to estimate conditional median prices by foreign qreg price i.foreign // Now use -sqreg- set seed 1001 sqreg price i.foreign, reps(1000) // Finally, use -bsqreg- set seed 1001 bsqreg price i.foreign, reps(1000)

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
1 like
Comment

Diya Ganguly

Join Date: Mar 2025
Posts: 4

#10

12 Mar 2025, 19:54

Thank you for your prompt response Clyde and apologies for missing the dataex command in FAQ. Below please find the data generated from the command, I am using version STATA BE/17.0

Code:

* Example generated by -dataex-. For more info,    type help dataex
clear
input byte type int budget_number double(pratio    sooner_choice) int(later_choice    sdelay_weeks)    float    cons1
0 101 1 500   0 26 1
1 101 1 500   0 26 1
1 101 1 500   0 26 1
1 101 1   0 500 26 1
1 101 1   0 500 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 300 200 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 300 200 26 1
0 101 1 300 200 26 1
1 101 1 500   0 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 200 300 26 1
0 101 1 500   0 26 1
1 101 1 200 300 26 1
1 101 1 500   0 26 1
1 101 1 500   0 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 300 200 26 1
0 101 1 500   0 26 1
1 101 1 500   0 26 1
1 101 1 500   0 26 1
1 101 1   0 500 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 300 200 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 300 200 26 1
1 101 1 200 300 26 1
1 101 1 400 100 26 1
1 101 1 100 400 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 300 200 26 1
0 101 1 500   0 26 1
1 101 1 500   0 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
1 101 1 400 100 26 1
0 101 1 200 300 26 1
1 101 1 500   0 26 1
0 101 1 200 300 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
1 101 1 500   0 26 1
1 101 1 500   0 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
1 101 1 500   0 26 1
1 101 1 500   0 26 1
1 101 1   0 500 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
1 101 1 500   0 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
1 101 1 500   0 26 1
0 101 1 500   0 26 1
1 101 1 500   0 26 1
0 101 1 200 300 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 500   0 26 1
0 101 1 200 300 26 1
0 101 1   0 500 26 1
1 101 1 500   0 26 1
1 101 1 300 200 26 1
1 101 1 500   0 26 1
1 101 1 500   0 26 1
0 101 1 300 200 26 1
1 101 1 500   0 26 1
1 101 1   0 500 26 1
end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#11

12 Mar 2025, 20:52

OK. Your data example is a bit odd. In your original post you say that the variable cons is the one whose median you want to bootstrap and that it is defined as sooner_choice/(sooner_choice+later_choice). But your example data has no variable by that name. The closest thing to that is a variable cons1, but it is only sometimes equal to that expression, and its value is constantly 1, so its median is 1 with standard error 0 without any fancy calculations. Moreover you said youwant to group the analysis by budget_number and sdelay_weeks, but you provide only a single value of each of those, so there is no grouping to be done in the example. Finally, the distribution of cons, calculated by your formula, is very lopsided, with 75% of the values being 1, so the median is going to be 1 in pretty much every bootstrap sample you can draw with any reasonable probability, so the bootstrap itself is pretty vacuous even here.

Nevertheless, assuming your full data set is more suitable to the problem, I think the following is the simplest way to code a solution for this:

Code:

capture program drop one_centile_bootstrap program define one_centile_bootstrap bsqreg cons, quantile(.5) gen centile = r(table)["b", 1] gen std_err = r(table)["se", 1] exit end set seed 1234 runby one_centile_bootstrap, by(budget_number sdelay_weeks)

Notes: You can set the seed to any integer value you like; 1234 is just a simple example.
-runby- is written by Robert Picard and me, and is available from SSC.

If in your full data set, the distribution of cons is also so lopsided that calculating its median is a waste of time, you might choose to instead do this with some other quantile of the distribution. For example, in the example data, the 20th percentile produces some numbers that are non-trivial. You can just change the number in the -quantile()- option of bsqreg accordingly.

Last edited by Clyde Schechter; 12 Mar 2025, 20:57.
Comment
Diya Ganguly

Join Date: Mar 2025

Posts: 4
#12

14 Mar 2025, 10:38

Hi Clyde,

Thank you for sharing the code—I really appreciate it. I ran it throughout yesterday, but it was still running this morning, which I suspect may be due to the version of Stata I’m using.

Apologies for the issue with the data. Yes, cons1 is the variable I initially described; its value is always 1 because participants frequently chose corner solutions, consistently opting for a specific payment bundle (the later choice). Majority of the cons1 are reported as 0 and 1. Stata selected the first 100 observations, which is why you only saw the 26-week choices.

Based on your suggestion, it seems that running bootstrapped means instead of medians might be more appropriate. I’d appreciate any advice you have on this.

Thanks again
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#13

14 Mar 2025, 11:34

Based on your suggestion, it seems that running bootstrapped means instead of medians might be more appropriate. I’d appreciate any advice you have on this.

It really depends on a) the distribution of values in the full data set, not just the first 100 observations, and, b) how you will be using the results (what questions you want to answer with them).
Comment
Diya Ganguly

Join Date: Mar 2025

Posts: 4
#14

14 Mar 2025, 13:50

Thanks Clyde, my goal is to plot these meadian/mean values against the variable pratio (gross interest rate) for type = 0 and type = 1, separately for each delay_weeks (values 25, 52, and 103)and graphically present how they are or are not different from each other.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#15

14 Mar 2025, 14:32

my goal is to plot these meadian/mean values against the variable pratio (gross interest rate) for type = 0 and type = 1, separately for each delay_weeks (values 25, 52, and 103)and graphically present how they are or are not different from each other.

If that is really your goal, then I would suggest doing it both ways and going with whichever produces the more aesthetically pleasing graph. But I hope that isn't your goal.

I would like to think that creating graphs is not your goal, but is a possible action that will help you achieve your goal. I want to believe that your desire to make a comparison between some aspect of the distributions is based on something more substantive: that these numbers have some meaning in the real world and that differences between them could reasonably influence how we understand what causes these numbers to be what they are, or perhaps even influence decisions we make based on them. That is what I had in mind when I spoke of "what questions you want to answer with them." Can you elaborate on that?

Last edited by Clyde Schechter; 14 Mar 2025, 14:38.
1 like
Comment

Announcement

bootstrap cinfidence interval for differences in median between two groups

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment