Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Adjusting p-values after stepwise regression models: how would you do it?

    Dear all,

    I have performed three stepwise regression models with backward elimination using a p-value threshold of >0.2.

    These are three models targeting the same predictors but with three different related outcomes.

    The results I obtain are as follows.

    Model #1
    Code:
    . xi: stepwise, pr(.2): logit mese  nihss etàstroke  lnagestroke lnimt lnnihss  esussospettoembolicoadorigin
    > esco   pregressoictustia sex fumo ipertensione dislipidemia diabete lnwmh lnvolume lacune, or
    
    Wald test, begin with full model:
    p = 0.9991 >= 0.2000, removing etàstroke
    p = 0.9868 >= 0.2000, removing lnnihss
    p = 0.9062 >= 0.2000, removing lnagestroke
    p = 0.8736 >= 0.2000, removing lnwmh
    p = 0.7877 >= 0.2000, removing pregressoictustia
    p = 0.5234 >= 0.2000, removing diabete
    p = 0.4993 >= 0.2000, removing esussospettoembolicoadoriginesco
    p = 0.3165 >= 0.2000, removing nihss
    p = 0.3173 >= 0.2000, removing lnimt
    p = 0.3219 >= 0.2000, removing lnvolume
    p = 0.2845 >= 0.2000, removing lacune
    p = 0.2334 >= 0.2000, removing fumo
    
    Logistic regression                                     Number of obs =    155
                                                            LR chi2(3)    =  22.90
                                                            Prob > chi2   = 0.0000
    Log likelihood = -60.232057                             Pseudo R2     = 0.1597
    
    ------------------------------------------------------------------------------
            mese | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
             sex |   2.406879   1.137686     1.86   0.063     .9530316    6.078565
    dislipidemia |    .173777   .0893694    -3.40   0.001     .0634219    .4761515
    ipertensione |   4.321445   2.217178     2.85   0.004     1.580914    11.81272
           _cons |    4.34823   2.039297     3.13   0.002     1.734217    10.90238
    ------------------------------------------------------------------------------
    Note: _cons estimates baseline odds.
    Model #2
    Code:
    . xi: stepwise, pr(.2): ologit gangli  nihss etàstroke  lnagestroke lnimt lnnihss  esussospettoembolicoadori
    > ginesco   pregressoictustia sex fumo ipertensione dislipidemia diabete lnwmh lnvolume lacune, or
    
    Wald test, begin with full model:
    p = 0.9817 >= 0.2000, removing dislipidemia
    p = 0.9150 >= 0.2000, removing pregressoictustia
    p = 0.8641 >= 0.2000, removing lnagestroke
    p = 0.8459 >= 0.2000, removing lnnihss
    p = 0.6806 >= 0.2000, removing ipertensione
    p = 0.6805 >= 0.2000, removing lnimt
    p = 0.6569 >= 0.2000, removing etàstroke
    p = 0.6656 >= 0.2000, removing lnvolume
    p = 0.6170 >= 0.2000, removing lacune
    p = 0.3279 >= 0.2000, removing fumo
    p = 0.2425 >= 0.2000, removing nihss
    p = 0.2312 >= 0.2000, removing diabete
    p = 0.2364 >= 0.2000, removing sex
    
    Ordered logistic regression                             Number of obs =    155
                                                            LR chi2(2)    =  16.43
                                                            Prob > chi2   = 0.0003
    Log likelihood = -182.07361                             Pseudo R2     = 0.0432
    
    --------------------------------------------------------------------------------------------------
                              gangli | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
    ---------------------------------+----------------------------------------------------------------
                               lnwmh |   1.371675   .1670724     2.59   0.009     1.080373    1.741523
    esussospettoembolicoadoriginesco |   3.767657   1.342956     3.72   0.000     1.873554    7.576637
    ---------------------------------+----------------------------------------------------------------
                               /cut1 |  -.1073645   .2569481                     -.6109734    .3962445
                               /cut2 |   1.694922   .2926389                      1.121361    2.268484
                               /cut3 |   3.879828    .461339                       2.97562    4.784036
    --------------------------------------------------------------------------------------------------
    Note: Estimates are transformed only in the first equation to odds ratios.
    Model #3
    Code:
    . xi: stepwise, pr(.2): ologit semiovali  nihss etàstroke  lnagestroke lnimt lnnihss  esussospettoembolicoad
    > originesco   pregressoictustia sex fumo ipertensione dislipidemia diabete lnwmh lnvolume lacune, or
    
    Wald test, begin with full model:
    p = 0.8844 >= 0.2000, removing dislipidemia
    p = 0.8313 >= 0.2000, removing lacune
    p = 0.8170 >= 0.2000, removing ipertensione
    p = 0.7794 >= 0.2000, removing nihss
    p = 0.6607 >= 0.2000, removing lnimt
    p = 0.5467 >= 0.2000, removing diabete
    p = 0.5989 >= 0.2000, removing lnagestroke
    p = 0.5083 >= 0.2000, removing sex
    p = 0.4098 >= 0.2000, removing lnwmh
    p = 0.4467 >= 0.2000, removing lnnihss
    p = 0.3303 >= 0.2000, removing fumo
    p = 0.3035 >= 0.2000, removing etàstroke
    
    Ordered logistic regression                             Number of obs =    155
                                                            LR chi2(3)    =   5.49
                                                            Prob > chi2   = 0.1394
    Log likelihood = -186.53199                             Pseudo R2     = 0.0145
    
    --------------------------------------------------------------------------------------------------
                           semiovali | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
    ---------------------------------+----------------------------------------------------------------
                   pregressoictustia |   2.072149   1.080448     1.40   0.162     .7457485    5.757707
                            lnvolume |   .8781811   .0819464    -1.39   0.164     .7314005    1.054418
    esussospettoembolicoadoriginesco |   1.636701   .5384571     1.50   0.134     .8588816     3.11893
    ---------------------------------+----------------------------------------------------------------
                               /cut1 |  -3.277187   .4679406                     -4.194333    -2.36004
                               /cut2 |  -.4312944   .2082017                     -.8393622   -.0232265
                               /cut3 |   1.485078   .2430798                      1.008651    1.961506
                               /cut4 |   4.577012   .7287992                      3.148592    6.005432
    --------------------------------------------------------------------------------------------------
    Note: Estimates are transformed only in the first equation to odds ratios.
    My question to you is: how should the obtained p-values be adjusted to account for multiple testing?

    P.S. I prefectly know that stepwise regression is not recommended. I carried out the analysis following the explicit request of a reviewer, and despite listing all the limitations of this method, there was no way to convince them otherwise.

    Thank you all, as always.

  • #2
    ssc install rwolf2

    otherwise you can do Holm's method by hand, but it does not bootstrap so less power

    Comment


    • #3
      I am reminded of some of Mike Babyak's comments in his nice article on overfitting. In particular, search for "phantom degrees of freedom". Notice too what he said about the "correct" model df for "stepwise" procedures (see below, emphasis added). (Note that Babyak uses the term "stepwise" in a rather general sense that includes all of the old-school automated variable selection methods--i.e., forward, backward, and stepwise.)
      I should not leave this topic without adding that procedures for correcting these particular overfitting problems have existed for many years (16)—it has been known for some time, for example, that the correct model degrees of freedom for stepwise procedures is really closer to the total number of all candidate predictors— but these corrections are apparently almost uniformly ignored by most researchers in our field.
      I hope this helps, and good luck with that reviewer!
      --
      Bruce Weaver
      Email: [email protected]
      Version: Stata/MP 18.5 (Windows)

      Comment


      • #4
        Thanks Maxence Morlet .
        In handmade Holm's procedure did you include only the p values ​​of the variables retained by the model or also the eliminated ones?
        Also, would you perform the procedure three times (once for each of the three models) or would you take all the p values ​​from the three models together?
        Thanks again
        Last edited by Gianfranco Di Gennaro; 07 Jan 2025, 17:33.

        Comment

        Working...
        X