Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estimating the intersection of two lines

    Dear Statalist,

    I have two datasets, A and B, that correspond to the bootstrapped predictive margins from the interaction term, it_type3##c.Ischemia_Time_Min, in a logistic regression model (where A corresponds to it_type3==0, and B corresponds to it_type3==1). I'm able to compute the intersection of these two lines, (x*,y*). However, I'd like to compute a confidence interval around x*. How can I do this? Here is the code that gets me to the intersection of the two lines. What I'd like to get is the range corresponding to the dotted band in the attached figure.

    Code:
    capture program drop savemargins
    program savemargins, rclass
    logistic aki2 c.Preop_GFR c.log_avl i.it_type3##c.Ischemia_Time_Min i.agecat male i.race i.bmicat i.cci_cat i.renal c.Tm1Sz1 i.clavien_cat
    margins it_type3, at(Ischemia_Time_Min = (10(1)55)) post
    end
    
    bootstrap _b, saving(margins, replace) reps(100): savemargins
    
    marginsplot, ///
    xtitle("Ischemia Time (min)") ytitle("Predicted Probability of AKI") title ("") ///
    plotopts(ylabel(0.1 (0.1) 1) xlabel(10(5)55) lwidth(thick)) ///
    legend(order(1 "Cold Ischemia" ///
    2 "Warm ischemia"))
    
    /* Where do curves overlap */
    logistic aki2 c.Preop_GFR c.log_avl i.it_type3##c.Ischemia_Time_Min i.agecat male i.race i.bmicat i.cci_cat i.renal c.Tm1Sz1 i.clavien_cat
    disp -_b[1.it_type3]/_b[1.it_type3#c.Ischemia_Time_Min]
    Click image for larger version

Name:	intersection.png
Views:	1
Size:	11.6 KB
ID:	1362411




  • #2
    A solution might involve replacing your -display- call in the last line with a call to -nlcom-. Except, in so far as I follow your code, that line includes an observation-specific term. So you might have use -predictnl- and then look at particular obs? [I haven't investigated why the dotted line 'works']

    Comment


    • #3
      Hmm, is it as simple as estimating the regression's variance-covariance matrix using bootstrap and then calling nlcom to calculate the intersection point and se's of the interaction term? Do the 95% CI's generated by nlcom correspond to the CI around x* (which is the dotted line in the figure in #1)

      Code:
      logistic aki2 c.Preop_GFR c.log_avl i.it_type3##c.Ischemia_Time_Min i.agecat male i.race i.bmicat i.cci_cat i.renal c.Tm1Sz1 i.clavien_cat, coef vce(bootstrap, reps(100))
      
      nlcom (overlap: -_b[1.it_type3] / _b[1.it_type3#c.Ischemia_Time_Min])
      
      overlap: -_b[1.it_type3] / _b[1.it_type3#c.Ischemia_Time_Min]
      
      ------------------------------------------------------------------------------
      aki2 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
      -------------+----------------------------------------------------------------
      overlap | 25.52385 4.039919 6.32 0.000 17.60575 33.44194
      ------------------------------------------------------------------------------

      Comment


      • #4
        I doubt it's as simple as that too. One basic issue is that your nlcom output in post #3 shows an estimate of 25.5, and that value isn't on your graph. (The intersection of the two lines is at around x = 6, and y = 10.) What indeed are the variables plotted in that graph? I suspect you've done a screenshot cut/paste, but it's left out material.

        My earlier suggestion in post #2 was based entirely on your last line in post #1, and note also the caveats I gave. I suggest that it would help if you backed up and simplified your problem to make it easier to understand precisely what you're trying to do. (You are using data we don't have access to; you're using complex models with interactions; etc.) As part of this, I suggest you also explain more clearly and precisely how the lines are defined (conceptually, not simply in terms of Stata code) and, similarly, how "overlap" ("intersection"?) and the confidence interval that you are seeking, are defined. [How too does what you want take account of the fact that the marginal effects are calculated at particular values of the predictors?)

        Comment


        • #5
          Appreciate your responses. The value of 25.5 actually does match up visually. I apologize as the figure I attached in #1 was used to symbolically represent the goals of my question. I'm attaching a figure of the actual data below.

          We see visually and confirm objectively that that two curves that correspond to the interaction of it_type3##Ischemia_Time_Min at it_type3==0 (blue) and it_type3==1 (red) intersect at x* of 25.5. This can be calculated using the following formula, which I can run after a bootstrap:
          Code:
           
           -_b[1.it_type3] / _b[1.it_type3#c.Ischemia_Time_Min]
          I'm not sure if the 95% CI of that calculation, with values of 17.6-33.4, gives me the 95% range of possible intersection points, which is the goal here.

          I'd like to show where the two curves intersect because at any value greater than x*, there is a higher probability of the adverse outcome, aki2, and the risk of having that outcome can be mitigated by employing cold ischemia (blue curve; it_type3==0).

          Given that I can only be so sure of how well the sample intersection point predicts the population intersection point, I wanted to apply a bootstrap to resample and re-estimate the margins at representative values that were used to generate the curves in the figure. I can do this either by bootstrapping the regression with the vce(bootstrap) option or by bootstrapping the margins themselves (as I did in the program in #1), with slightly different SEs between the two methods. In either case, I'm given an SE for each predictive margin which thus gets translated into the vertical CIs in the graph. Therefore, I have new bootstrapped estimates of the confidence bands. However, the vertical CIs don't interest me, because they don't tell me the 95% CI of where along X that the curves will overlap.

          So while I have the black dot in the figure in #1, I'm not sure if I have the dashed line that corresponds to the estimated point of intersection. That is, unless the nlcom calculation does just that.





          Click image for larger version

Name:	Cold v Warm Ischemia traditional margins w bootstrap.jpg
Views:	1
Size:	348.6 KB
ID:	1362528


          Comment


          • #6
            I think I understand what you are trying to do. If I'm right, what you need to do is to write a short program that wraps the regression and the -nlcom- command and returns the estimated x-coordinated of the intersection point. Then you bootstrap that program. So something like this:

            Code:
            capture program drop intersection
            program define intersection, rclass
                logistic aki2 c.Preop_GFR c.log_avl i.it_type3##c.Ischemia_Time_Min  // OTHER VARS, ETC.
                nlcom -_b[1.it_type3] / _b[1.it_type3#c.Ischemia_Time_Min])
                tempname b
                matrix `b' = r(b)
                return scalar result = `b'[1,1]
                exit
            end
            
            bootstrap x_intersection = r(result), reps(1000) seed(1234): intersection // PERHAPS A -saving()- OPTION TOO
            Note: Not tested. Beware of typos, minor syntax errors.

            Comment


            • #7
              Hmm, is that not equivalent to #3? In #3, it is bootstrapping the regression first and then estimating the intersection point and 95% CI from those bootstrapped results. In #6 it is wrapping the regression and estimating the intersection point and 95% CI and then bootstrapping that result. When I run your program Clyde, I get a result that's essentially the same as #3 (assuming I use the same number of reps; pasted below). In both cases, would you agree that we are getting this dotted line from the figure in #1 that corresponds to the 95% CI of the x coordinate intersect of the resampled curves?

              Code:
              Bootstrap results                               Number of obs      =      1356
                                                              Replications       =       100
              
                    command:  intersection
              x_intersect~n:  r(result)
              
              --------------------------------------------------------------------------------
                             |   Observed   Bootstrap                         Normal-based
                             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
              ---------------+----------------------------------------------------------------
              x_intersection |   25.52385   4.018888     6.35   0.000     17.64697    33.40072

              Comment


              • #8
                I'm not surprised that the results of what I proposed are similar to #3, but this way you get a direct calculation of the confidence interval of the statistic you want. I think they are at least roughly equivalent if not theoretically the same. And at least for the one I proposed, I feel confident that it's what you want.

                But a couple of things. First, I would not rely on just 100 replications. Quotients can have very pathological sampling distributions, particularly if the denominator is not strongly bounded away from zero. The difference in slopes between your two graphs (which is the denominator your expression) is fairly small, and I would think that the bootstrap distribution cold be highly skew. and I worry that in 100 replications you have not adequately sampled that upper tail. I would go for at least 1,000 replications, maybe 10,000 or even 50,000 to be safe. For similar reasons, I would not settle for the Normal-based confidence interval. If you run -estat bootstrap- after the -bootstrap- itself, you will get percentile-based confidence intervals, which I would put more trust in under these circumstances.

                Comment


                • #9
                  Many thanks Clyde, and I recognize your concern. To better define the # of optimal reps in this case I investigated the sampling distribution produced by the bootstrap estimates using your code. Beginning at reps=500, I found it a bit odd that the resampling begins to introduce extreme outliers (i.e. min of -568 at n=1000) that drastically increase the normal-based CIs. If you drop those outliers, the distributions mirror a normal distribution very closely at at >500-1000. The percentile and bias corrected methods of bootstrapping estimate very similar CIs beginning at reps=500. Here are the results from a couple test runs showing that:

                  Variable n Mean S.D. Min .25 Mdn .75 Max
                  x_intersection 100 25.75 4.02 16.39 23.04 26.20 28.18 35.36
                  x_intersection 500 25.61 17.24 -18.00 22.40 25.35 28.01 390.66
                  x_intersection 1000 25.73 42.35 -568.56 22.59 25.42 28.13 1152.47
                  x_intersection 10000 24.98 30.76 -1053.16 22.48 25.43 28.02 2418.64

                  Reps Observed Coef Bias SE 95% CI
                  x_intersection (n=100) 25.523846 .2252369 4.0188881 17.64697 33.40072 (N)
                  16.87671 33.17704 (P)
                  16.63935 32.19231 (BC)
                  x_intersection (n=500) 25.523846 .087475 17.242698 -8.271222 59.31891 (N)
                  13.30016 32.98206 (P)
                  14.87385 33.75624 (BC)
                  x_intersection (n=1000) 25.523846 .2040197 42.353071 -57.48665 108.5343 (N)
                  12.88861 33.22404 (P)
                  13.57106 33.6357 (BC)
                  x_intersection (n=10000) 25.523846 -.5406799 30.756666 -34.75811 85.8058 (N)
                  13.08478 32.99335 (P)
                  13.37817 33.217 (BC)

                  I wasn't sure what to make of this phenomenon of increasing outliers & CI with increasing reps so I decided to try running the bootstrap on the regression itself and wrote the following program to capture the SEs and plot those results:

                  Code:
                  /* Testing bootstraps */
                   
                   postutil clear
                    capture program drop Accum
                    program Accum
                            postfile results se n nl using sim, replace
                            forvalues n= 500(500)5000{
                                    noisily display " `n'" _c
                                    quietly logistic aki2 c.Preop_GFR c.log_avl i.it_type3##c.Ischemia_Time_Min i.agecat male i.race i.bmicat i.cci_cat i.renal c.Tm1Sz1 i.clavien_cat, vce(bootstrap, reps(`n'))
                                    quietly nlcom -_b[1.it_type3] / _b[1.it_type3#c.Ischemia_Time_Min]
                                    local lparm  el(r(b),1,1)
                                    local se  sqrt(el(r(V),1,1))
                                    local n=e(N_reps)
                                    post results (`se') (`n') (`nl')
                            }
                            postclose results
                    end
                   
                  set seed 9999
                  Accum
                  use sim, clear
                  scatter se n, xtitle("replications") ytitle("bootstrap standard error")

                  Click image for larger version

Name:	BS_0_5000 vce(boot) SE scatter.jpg
Views:	1
Size:	115.1 KB
ID:	1362886


                  The SEs remain very similar beginning at 500 reps. Since with vce(bootstrap) I'm bootstrapping just the coefficient first, and then calculating the quotient, I would think fewer reps are reasonable, maybe 500+. I can't think of a theoretical reason why either method would not get me the 95% CI for the intersection point, but, practically, the latter method seems to produce SEs that are consistent with expectations, i.e. not increasing wildly with increasing reps.

                  Any thoughts as to whether the latter method is reasonable here, using 500+ reps? Or is the recommendation to bootstrap the intersection point and use percentile or bias correction?









                  Comment


                  • #10
                    If you drop those outliers, the distributions mirror a normal distribution very closely at at >500-1000.
                    Yes, but you shouldn't drop outliers. During data cleaning, looking at outliers is sometimes a way to find data errors that need to be corrected, not dropped. Here we are not even in that situation. There is no justification for removing outliers. In fact, the reason that I suggested using a much larger sample size was precisely because those "outliers" are potentially very important information that you would fail to pick up with a smaller sample size.

                    The normal theory based standard error will, indeed, be very sensitive to those outliers because their leverage gets squared! By contrast, the percentile based confidence interval is fairly robust in this regard. But normal theory is not really suitable for your situation in any case. The statistic you are bootstrapping is a quotient of two regression coefficients. We can treat each regression coefficient as being approximately normally distributed around the "true" value, due to asymptotics. But the quotient of two normally distributed is not itself normally distributed, and if the denominator distribution has substantial mass near zero, it can be quite "pathological." In fact, the ratio of a quotient of two normally distributed variables that both have mean zero has a Cauchy distribution--which doesn't even have a finite variance. So you have to be pretty cautious here. That is why I recommended using a very large sample size, to assure that you get at least some sampling of what is potentially a very long, very heavy tail. I also think that the percentile confidence interval makes more sense than the normal-theory based confidence interval in light of the distributional problems that ratios pose. It is this reasoning that underlies my suggestions in #8.

                    Note that when you vce(bootstrap) the regression, all you are doing is getting bootstrapped standard errors on the regression coefficients themselves: there is no particular reason to think that these regression coefficients have bizarre distributions that I can see. You are running the regression 500, 1000, etc. times, but only calculating the intersection point once in your loop, and using the bootstrapped regression coefficients' standard errors as input to -nlcom-. (Note that -nlcom- is not using the varying bootstrapped coefficients here.) That is not remotely like bootstrapping the calculation of the intersection point, which will be particularly sensitive to variations in the interaction term coefficient that is your denominator. In fact you aren't really sampling the distribution of the intersection point at all here, except in a very indirect way.

                    When you actually bootstrap the intersection point calculation, I think it is rather reassuring to see that the percentile based confidence intervals are pretty stable as you increase the replications, which gives me more "confidence" (no pun intended) to use them. The fact that the normal-theory based intervals blow up as the replications increase does not surprise me and only reinforces my sense that they should not be used for this application.

                    Comment


                    • #11
                      Thanks again. I was referring to "dropping outliers" only to more clearly display the central bins in the histogram of the sampling distribution. I'll proceed with your suggestions!

                      Comment


                      • #12
                        Julien Dagenais, although you have a solution to your problem now, you still might be interested in reading about the Rate Advancement Period (RAP):

                        -Brenner, Hermann, Olaf Gefeller, and Sander Greenland. "Risk and rate advancement periods as measures of exposure impact on the occurrence of chronic diseases." Epidemiology (1993): 229-236.

                        -Pfahlberg, A., and O. Gefeller. "RE:“ASSESSING THE IMPACT OF CLASSICAL RISK FACTORS ON MYOCARDIAL INFARCTION BY RATE ADVANCEMENT PERIODS”." American journal of epidemiology 154.5 (2001): 486-488.

                        -Discacciati, Andrea, et al. "On the interpretation of risk and rate advancement periods." International journal of epidemiology (2015): dyv320.

                        Comment

                        Working...
                        X