Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is there an arbitrary scaling factor in logistic regression in Stata?

    Hi,

    Norton et al state in a recent JAMA guide that the magnitude of the odds ratio from a logistic regression is scaled by an arbitrary factor (equal to the square root of the variance of the unexplained part of binary outcome). They say that adding more independent explanatory variables to the model will increase the odds ratio of the variable of interest (eg, treatment) due to dividing by a smaller scaling factor. They thus warn that different odds ratios from the same study cannot be compared when the statistical models that result in odds ratio estimates have different explanatory variables because each model has a different arbitrary scaling factor.

    I ran a simple logistic regression on the dataset below using the code below and the crude odds ratio is 2.33 and the adjusted odds ratio is 3.0 for hiv status both of which exactly match the stratified analysis without logistic regression. The arbitrary scaling factor does not surface here.

    I would appreciate any thoughts on this issue and if this is actually a valid concern in Stata and if anyone can share a contrary dataset (using categorical variables only)

    Thanks
    Suhail


    Code:
    logit risky i.hiv [fw=fw], or
    logit risky i.hiv i.nyc [fw=fw], or

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(risky hiv nyc fw)
    1 1 0 25
    1 0 0 75
    0 1 0 10
    0 0 0 90
    1 1 1 75
    1 0 1 25
    0 1 1 50
    0 0 1 50
    end
    Regards
    Suhail Doi

  • #2
    Norton et al state in a recent JAMA guide
    Many readers of this will likely want to see this guide to more fully understand what you summarize in your post. Can you post a link to this material, preferably one that does not require payment. As the Statalist FAQ tells us,

    13. How should I give literature references?

    Please give precise literature references. The literature familiar to you will be not be familiar to all members of Statalist. Do not refer to publications with just author and date, as in Sue, Grabbit, and Runne (1989).

    References should be in a form that you would expect in an academic publication or technical document. Good practice is to give a web link accessible to all or alternatively full author name(s), date, paper title, journal title, and volume and page numbers in the case of a journal article.

    Comment


    • #3
      The article in question appears to be

      Norton EC, Dowd BE, Maciejewski ML. Odds Ratios—Current Best Practice and Use. JAMA. 2018;320(1):84–85. doi:10.1001/jama.2018.6971

      I found the article online, not behind JAMA's paywall, at

      https://www.feinberg.northwestern.ed...ce-and-use.pdf

      It in turn justifies the assertion referenced in post #1 with a reference to

      Norton, E. C. and Dowd, B. E. (2018), Log Odds and the Interpretation of Logit Models. Health Serv Res, 53: 859-878. doi:10.1111/1475-6773.12712

      but I have not been able to get access to this paper.

      Comment


      • #4
        Thanks William, yes, that is the paper

        Just to add to my comments, if I were to create two variables predictive of "risky" in the dataset I posted previously as follows below, I still cannot get the hiv odds ratio to budge much from 3.0 unless the new variable is very highly predictive of risky and even then the uncertainty in the estimate for hiv increases so I see no clinical significance of this observation and the papers recommendations seem highly overstated - were the theory to be confirmed. Any thoughts on this would also be appreciated

        Code:
        expand fw
        gen x = rnormal(risky,1)
        gen y = rnormal(risky,0.5)
        logit risky i.hiv i.nyc x , or
        logit risky i.hiv i.nyc x  y, or
        Last edited by Suhail Doi; 16 Jan 2019, 13:37.
        Regards
        Suhail Doi

        Comment


        • #5
          There actually has been a great deal of discussion on this topic in various places. For my own take, see

          https://www3.nd.edu/~rwilliam/stats3/Nested01.pdf

          https://www3.nd.edu/~rwilliam/stats3/Nested02.pdf

          Some key takeways:
          • Comparisons of coefficients across nested models is problematic. Coefficients are not consistently scaled the same way. It could be like using income in dollars as your DV in one model, and then using income in thousands of dollars in another -- and not realizing you had done that. Potentially, wildly incorrect conclusions can be reached. For example, you might think there are really dramatic suppressor effects when there are no such effects at all. I give a hypothetical example to show this.
          • In practice, I've only found instances where the conclusions were mildly incorrect. But I haven't re-analyzed every data set in the world. Maybe Norton has better real world examples than I do. Then again maybe he is making the problem seem much more serious than it usually is in real world situations.
          • There are potential solutions. For one thing, just don't do comparisons across nested models. Just talk about your final model.
          • If you do want to make such comparisons, the KHB method (mentioned in my handouts) may be the best way to go.
          -------------------------------------------
          Richard Williams, Notre Dame Dept of Sociology
          StataNow Version: 19.5 MP (2 processor)

          EMAIL: [email protected]
          WWW: https://www3.nd.edu/~rwilliam

          Comment


          • #6
            Thanks as always to Professor Williams for a clear explanation of the issue, with handouts not behind a paywall.

            I was sure he'd have something helpful to say, but it didn't occur to me that he'd already said it , if I'd only looked in his invaluable repository about the analysis of categorical data at http://www3.nd.edu/~rwilliam/stats3/.

            Comment


            • #7
              Hi Richard,

              I agree with your observation that he may be making the problem seem much more serious than it usually is in real world situations. I believe that a majority of the change seen in adjusted estimates is quite easy to demonstrate to be due to the stratification and not any arbitrary factor.

              If we take the previous dataset in post #1 and create x and y as in post #4 then we can dichtomise both x and y at the median to create categorical variables predictive of the outcome. We can then compare logistic regression and stratification as follows:
              Code:
              logit risky i.hiv i.nyc i.x i.y, or
              cs risky hiv, by(nyc x y) or w
              Logistic regression results will show
              risky Odds Ratio Std. Err. z P>z [95% Conf. Interval]
              hiv 4.239133 1.678756 3.65 0.000 1.950695 9.212229
              nyc .535027 .1989655 -1.68 0.093 .2581258 1.10897
              fw .9846991 .0074828 -2.03 0.042 .9701418 .9994748
              x 4.742803 1.578448 4.68 0.000 2.470288 9.105896
              y 51.80127 18.03298 11.34 0.000 26.18312 102.4848
              _cons .1299183 .0787229 -3.37 0.001 .0396179 .4260383
              While stratification will show
              nyc x y OR [95% Conf. Interval]
              0 0 0 7.83 1.14 53.75 (Woolf)
              0 0 1 . . . (Woolf)
              0 1 0 11.2 .84 148.13 (Woolf)
              0 1 1 . . . (Woolf)
              1 0 0 1.45 .22 9.28 (Woolf)
              1 0 1 3.51 .68 18.07 (Woolf)
              1 1 0 8 .86 74.21 (Woolf)
              1 1 1 1.73 .14 20.63 (Woolf)
              Crude 2.33 1.54 3.51
              M-H combined 4.92 2.19 11.02
              The stratification can be done for each of the variables and It is clear that whichever variable we consider, the MH combined estimate will closely mirror the adjusted estimate from logistic regression. It does not matter how many variables we create or how correlated (or not) they are with the outcome this remains the situation. Therefore regardless of if there are different predictors in the model of different correlations with the outcome what is more important in the variability seen across methods is the fact that the adjusted estimate combines several heterogeneous stratum specific estimates and this is what leads to the variability between stratification and regression. This is not clinically significant enough variability to justify the paper's conclusions and thus, if I am right, the following four conclusions in Norton's paper are grossly overstated:

              a) there is no unique odds ratio to be estimated even from a single study - it may not be unique but they are close enough to be treated as such

              b) Different odds ratios from the same study cannot be compared when the statistical models that result in odds ratio estimates have different explanatory variables - this is not the case at all

              c) the magnitude of the odds ratio from one study cannot be compared with the magnitude of the odds ratio from another study, because different samples and different model specifications will have different arbitrary scaling factors - no evidence for this above.

              d) the magnitudes of odds ratios of a given association in multiple studies cannot be synthesized in a meta-analysis - clearly no evidence for this either.

              Any thoughts would be appreciated

              Thanks
              Suhail

              Last edited by Suhail Doi; 17 Jan 2019, 11:01.
              Regards
              Suhail Doi

              Comment

              Working...
              X