Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using bootstrapping to calculate ratios over sample subsets.

    Hi all,

    Trying to use bootstrapping to calculate a confidence interval of the ratio of two different predicted values by a third variable. The data below isn't exactly like my data, but it comes close enough to be practical in this instance, I think. In this data, I have observations at the person level. I am trying to calculate the confidence interval for the ratio of the predicted_weight/predicted height by race. So, ideally, for each race I would have a variable demonstrating the lower and upper bound of the confidence interval associated with the actual ratio of the average of the predicted_height/ average of the predicted_weight. I would be fine collapsing the data by race at some point, but I would prefer to have the confidence interval listed at the observation level as well (just denoting for each patient the confidence interval associated with the race.


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(race predicted_weight predicted_height)
    1 67 180
    1 65 175
    1 63 170
    2 62 173
    2 52 169
    2 51 180
    2 77 155
    3 63 170
    3 29 190
    3 18 200
    end


    Code:
    program define calc_ratio
        * Assuming the original dataset is loaded and the program is called in the context of bootstrap
        egen mean_predicted_weight = mean(predicted_weight), by(race)
        egen mean_predicted_height = mean(predicted_height), by(race)
        gen ratio = mean_predicted_height / mean_predicted_weight
    end
    
    * Run the bootstrap command
    bootstrap r(ratio), reps(1000) seed(12345): calc_ratio
    
    * Calculate the confidence intervals
    gen lower_ci = r(ratio) - invttail(r(df_r), 0.025)*r(se)
    gen upper_ci = r(ratio) + invttail(r(df_r), 0.025)*r(se)
    When I run the program itself, I get the actual averaged predicted_height / averaged-Predicted weight, as I would hope. However, when I run the bootstrapping command. I get an "Evaluation to missing" error. I am not 100% sure why, except that in the course of the bootstrapping some values are generated as missing because there are no samples within the specific race?

    As always, thanks in advance!

    Using Stata 18.0 on OS14.0

  • #2
    OK, lots of problems here.

    First and foremost, the one that is giving you the error message, is happening because r(ratio) is always missing. The reason it's always missing is because you never define it in program calc_ratio. And there are two reasons, apart perhaps from not realizing that you needed to, why you can't define it. First is that you didn't declare the program to be -rclass-, and the second is that there is no logical way to define it. Your program calculates 3 different ratios: one for each race. So which one of those would you choose to return as "the" ratio? Or would you average them in some way? To me, what seems most sensible is to return 3 different ratios, one for each race.

    Once you clear up those conceptual problems, there are other technical issues. The first time you call program calc_ratio, it creates new variables mean_predicted_weight, mean_predicted_height, and ratio. The second time calc_ratio gets called, those variables already exist, so -egen- and -gen- are all blocked. So you need to either drop those variables after you return the results, or, a more Stata-ish solution, create those as tempvars instead of regular variables--then they will make themselves disappear at the right time.

    Then there are those last two commands -generate-ing lower_ci and upper_ci. These make no sense for two reasons: first, we have already recognized that there is no such thing as r(ratio) in the first place. But even without having split into r(ratio1), r(ratio2), and r(ratio3), it's still a problem because everything that calc_ratio returns in -r()- gets clobbered by -bootstrap- and is no longer accessible. Fortunately, you don't even need to compute these confidence intervals yourself anyway. -bootstrap- has already done that for you and has saved them, along with numerous other statistics that may be of interest to you, in a matrix r(table). If you grab r(table) and store it as a real matrix immediately after the bootstrap command, you can then pull the statistics you want from it, or, alternatively, just leave them in that matrix and access them later directly from that matrix where and when you need them.

    So here's what I come up with:
    Code:
    program define calc_ratio, rclass
        * Assuming the original dataset is loaded and the program is called in the context of bootstrap
        tempvar mean_predicted_height mean_predicted_weight ratio
        egen `mean_predicted_weight' = mean(predicted_weight), by(race)
        egen `mean_predicted_height' = mean(predicted_height), by(race)
        gen `ratio' = `mean_predicted_height'/`mean_predicted_weight'
       levelsof race, local(races)
        foreach r of local races {
            summ `ratio' if race == `r', meanonly
            return scalar ratio`r' = r(mean)
        }  
    end
    
    * Run the bootstrap command
    bootstrap ratio1 = r(ratio1) ratio2 = r(ratio2) ratio3 = r(ratio2), ///
        reps(1000) strata(race) seed(12345): calc_ratio
    matrix M = r(table)
    matrix list M
    Added: The -strata(race)- option is not, strictly speaking, essential here. If you omit it, however, then the race-specific sample sizes will vary from one bootstrap replication to another, and there may be some samples in which some race is not represented at all. So it's a pretty inefficient way to go about it, and I recommend keeping -strata(race)- in.
    Last edited by Clyde Schechter; 04 Mar 2024, 18:57.

    Comment


    • #3
      I'd add one small thing to Clyde's thorough answer, namely that -bootstrap- by default produces normal-theory CIs. Given that statistics that are ratios commonly don't have Gaussian distributions, I'd prefer to use the percentile-based bootstrap CIs, which don't rely on the normality of the sampling distribution. These can be obtained by putting the command
      Code:
      estat bootstrap
      right after the -bootstrap- command. The resulting CIs might turn out to not differ much from the normal-theory CIs, but I'd consider them more valid, particularly if the subset samples are small.

      Comment


      • #4
        Thank you both! This worked excellently.

        Follow up question though. What if I had, say, a ton of races (on the order that it would be too difficult to write out all the ratios, or a different number of races with different datasets. How might I set up the bootstrap to obtain a ratio for all distinct values of races. Would that be another for loop?

        Comment


        • #5
          If your races are in different data sets, just -append- them all together (being sure to add a variable identifying the race to each data set) and then run as above.

          If the number of races is too large to, for practical purposes, write everything out, fear not. The only place where they are written out is in the -bootstrap- command itself. But that's easy to fix by building a list of arguments in a loop:
          Code:
          local arg_list
          levels of race, local(races)
          foreach r of local races {
               local arg_list `arg_list' ratio`r' = r(ratio`r')
          }
          
          bootstrap `arglist', ///
              reps(1000) strata(race) seed(12345): calc_ratio
          matrix M = r(table)
          matrix list M

          Comment

          Working...
          X