bootstrap cinfidence interval for differences in median between two groups

sungwook kim

Join Date: Jul 2015
Posts: 62

bootstrap cinfidence interval for differences in median between two groups

28 Feb 2025, 08:58

Hello I want to get 95% bootstrap confidence interval for differences in median between two groups, using 'rank sum' (mann-whitney U test). The code doesn't work and not sure how to recall the stored median values for each group (i.e. median1 and median 2) after running tabstat command. Any help is welcome.

Code:

ranksum cost_5year_incQ6_new_adj, by(CA_binary)

// Calculate group medians
tabstat cost_5year_incQ6_new_adj, by(CA_binary) statistics(median)

bootstrap (median1 - median2), reps(1000) seed(12345): ///
    tabstat cost_5year_incQ6_new_adj, by(CA_binary) statistics(median)

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(cost_5year_incQ6_new_adj CA_binary)
  3924.18 0
 162.6696 0
 4928.931 0
12978.354 1
  984.852 0
   7820.9 0
  905.237 0
 4101.727 0
 10151.54 0
3420.6704 0
  147.312 1
 676.0845 0
  96.2562 0
 465.0858 0
 45124.98 0
 354.5962 0
 761.8788 0
 196.5998 0
  5965.98 0
        0 0
 16783.17 0
 113.6655 0
 131.7492 0
  2626.91 0
 2039.158 0
        0 0
4187.7344 0
 9091.484 0
  344.324 0
2027.5046 0
 5022.307 0
 65420.32 0
 135.3106 0
        0 0
   303.45 0
 7685.859 0
 12454.23 1
  624.994 0
 122.8616 0
 1212.978 1
 499.7064 0
 45853.01 0
   24.948 0
  22.1598 0
  660.852 0
        0 0
  2538.18 0
541.66724 0
 14546.33 0
  411.312 0
end

Thank you! BW Kim

Tags: None

Felix Bittmann

Join Date: Aug 2018
Posts: 627

28 Feb 2025, 09:25

This requires a custom program.

Code:

cap program drop meddiff
program define meddiff, rclass
    sum cost_5year_incQ6_new_adj if CA_binary == 0, det
    local med1 = r(p50)
    sum cost_5year_incQ6_new_adj if CA_binary == 1, det
    local med2 = r(p50)
    return scalar meddiff = `med1' - `med2'
end


bootstrap r(meddiff), seed(123): meddiff

Best wishes

(Stata 16.1 MP)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#3

28 Feb 2025, 09:43

Your command isn't working because -tabstat- does not return anything for -bootstrap- to work with: it just sends its results to the screen.

So you need to calculate the mediasn with a command that returns something, and you also need to calculate the difference between them in a program that returns that and then use that under -bootstrap:-. Also, although I do not know exactly what you are trying to calculate and how you will interpret the results, but in situations like yours, people often want the bootstrap sampling to be stratified by the grouping variable (CA_binary).

Code:

capture program drop median_diff program define median_diff, rclass sortpreserve syntax varname(numeric), by(varname) levelsof `by', local(by_values) capture assert `:word count `by_values'' == 2 if c(rc) != 0 { display as error "Variable `by' must take on exactly 2 values." exit 9 } forvalues i = 1/2 { centile `varlist' if `by' == `:word `i' of `by_values'' local median`i' = r(c_1) } return scalar median_diff = `median2' - `median1' exit end bootstrap r(median_diff), strata(CA_binary) reps(1000) seed(12345): /// median_diff cost_5year_incQ6_new_adj, by(CA_binary)

If you don't want stratified bootstrap sampling, remove the -strata(CA_binary)- option.

Added: Crossed with #2. The solution there is specific to the variables mentioned in your post. The code here is more general, and you can use it with any numeric variable in place of cost_5year_incQ6_new_adj, and any dichotomous grouping variable in place of CA_binary. In principle, however, the two solutions are the same. Well, almost: #2 does not call for stratified bootstrap sampling.

Last edited by Clyde Schechter; 28 Feb 2025, 09:45.
1 like
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 627
#4

28 Feb 2025, 11:16

Thanks to Clyde for providing this excellent and highly reusable program. One additional difference: my version used summarize while the second solution uses centile to compute the median. Both work the same but I have the impression that summarize is faster for this application. However, unless you have very large datasets or need many many resamples, this should not make a huge difference.

Best wishes

(Stata 16.1 MP)
1 like
Comment
sungwook kim

Join Date: Jul 2015

Posts: 62
#5

28 Feb 2025, 14:32

Dear Clyde and Felix,

Thank you very much indeed for your help. It is great! For Clyde's question, I need to estimate whether costs of treatment group is significantly different from the costs of control group. Due to the skewed distribution of costs, I was considering using median difference rather than mean difference. Thank you again! BW Kim
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 627
#6

28 Feb 2025, 14:39

For this aim, a quantile regression also works:

Code:

qreg depvar i.group

Best wishes

(Stata 16.1 MP)
1 like
Comment

Announcement

bootstrap cinfidence interval for differences in median between two groups

Comment

Comment

Comment

Comment

Comment