Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stepwise logistic regression

    Dear all,

    I want to have stepwise logit estimation and after reading the manuals I couldn't find a way to have the selection criteria based on BIC or AIC.
    Is it possible or the only way is to have the significance level chosen?

  • #2
    Sara:
    just one step aside from your question.
    With stepwise estimation, you are going to obtain a model that, in all likelihood, has nothing to do with your original data and, as a consequence, its results, significant or not, are weakly reliable at best.
    Just to quote one of the most towering members of the "don't do it" party, you may want to take a look at Frank Harrel's Regression Model Strategy. 2nd edition. Springer: 67-72.
    However, if you can’t help from following that road, you may want to start off from -stepwise- entry in Stata .pdf manual that does not support the use of AIC or BIC criteria for this dangerously oversold statistical procedure (however, please take a look at http://www.stata.com/support/faqs/st...sion-problems/ ).
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      I agree with Carlo that stepwise selection is usually the work of Satan. Having said that, SJ did recently publish this article on the user-written gvselect command:

      http://www.stata-journal.com/article...article=st0413

      Here is an example:

      Code:
      . sysuse auto, clear
      (1978 Automobile Data)
      
      . gvselect <term> weight trunk length, nmodels(2): regress mpg <term> i.foreign
      
      Optimal models: 
      
         # Preds        LL       AIC       BIC
               1 -194.1831  394.3661  401.2783
               1 -196.7305  399.4609  406.3731
               2  -192.997  393.9939  403.2102
               2 -193.9518  395.9036  405.1198
               3 -192.9913  395.9827   407.503
      
      predictors for each model:
      
      1 : weight
      1 : length
      2 : weight length
      2 : weight trunk
      3 : weight length trunk
      The BIC and AIC values for the "winning" models are bolded. The "winning" model if you use BIC is

      Code:
      reg mpg weight i.foreign
      estat ic
      For AIC,

      Code:
      reg mpg weight length i.foreign
      estat ic
      They call it "best variables subset selection." I don't know if that is a way of making stepwise regression sound more respectable or if there really are merits to the approach that SW does not have. I am skeptical myself. But if you think that using stepwise is acceptable then using BIC or AIC may be at least as acceptable.
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      StataNow Version: 19.5 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        Incidentally, findit reveals two versions of the gvselect program. The version from the Stata Journal is apparently more current. (I always hate it when that happens; usually I trust the SSC version but occasionally it is not the most current version out there.)
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        StataNow Version: 19.5 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          Thanks for your comments Carlo and Richard!

          Usually, I do not go for stepwise selection though with my current project I have no information what variables show ( what those values stand for and show, economic meaning, they are just coded and I need to find the best variable subset that has predictive power blindly).

          Thank you for your hints. I will look through Frank Harrel's Regression Model Strategy and gvselect firstly!

          Comment


          • #6
            Sara:
            if you're running somehow blind with your project, probably the best approach is reporting different regression models (and discussing their results and possibly practical implications) via a sort of scenario analysis.
            This approach could outperform stepwise selection procedure as far as dealing with the uncertainty of your dataset is concerned.
            The fact that your variables are simply coded and not explained in their meaning cannot reduce the relevance of the drawbacks that affect stepwise procedure(s).
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              I'd be leery of running any kind of analysis where I had no idea what the variables were! Even with stepwise there should be some logical reason for thinking the variables could be/should be in the model. I wouldn't, for example, include x11 as a possible predictor of x10 if x11 came later in time. Is this a homework problem or something? I'm curious why you would be in this situation, or what you would do with the results once you had them.
              -------------------------------------------
              Richard Williams, Notre Dame Dept of Sociology
              StataNow Version: 19.5 MP (2 processor)

              EMAIL: [email protected]
              WWW: https://www3.nd.edu/~rwilliam

              Comment


              • #8
                Sara:
                i do share Richard's curiosity on that point.
                Usually "the best" (whatever it means) regression models is strictly related to what has been done in the past by others in a given research field (also to forestall the risk of re-inventing a perfectly running wheel and be rebounded by reviewers!).
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  This project is actually given to me as a recruiting process for data analyst job.
                  I only know that it is a claim data for insurance company and nothing else. I have about 92000 observations, a binary variable which means that it is 1 if the case is approved for payment and 0 when more information is needed. About 35% of the data is missing. The main aim is to predict probabilities of having 1s and have out of sample predictions.

                  For variables I tried to see the catplots (with frequencies or percent of dependent binary variable) of 1s having the continous variables binned to see how it changes in each bin.
                  Click image for larger version

Name:	X10.png
Views:	1
Size:	23.2 KB
ID:	1332734

                  Like this one I included in my model because there is a centain variance of 1s and trend that changes along with the variable X10, but for model I am still thinking what to do.

                  Thanks,
                  Sara

                  Comment


                  • #10
                    http://statisticalhorizons.com/predi...ssion-analysis

                    The above might be of interest.
                    -------------------------------------------
                    Richard Williams, Notre Dame Dept of Sociology
                    StataNow Version: 19.5 MP (2 processor)

                    EMAIL: [email protected]
                    WWW: https://www3.nd.edu/~rwilliam

                    Comment


                    • #11
                      Sara:
                      if
                      About 35% of the data is missing.
                      this affects your regression, as Stata applies listwise deletion for each observation with at least a missing value in any variable.
                      As the missingness might be informative, are you requested to deal with missing values, too?

                      If this were the case, the same Author quoted by Richard published an interesting (and lovely short) textbook on this topic: http://www.sagepub.in/textbooks/Book9419.
                      Kind regards,
                      Carlo
                      (Stata 19.0)

                      Comment


                      • #12
                        Carlo and Richard,

                        Thank you for your help very much!
                        With missing values I am trying to deal too. For those variables which had less than 100 missing values I have replaced with means and modes, and with the rest variables I am trying to have chained mi imputes, or mi impute pmm with knn option.

                        I will look through the articles and books that you shared too.

                        Comment


                        • #13
                          Sara:
                          replacing missing values with means (and even worse) with modes is no way the right approach.
                          Just consider that if you replace missing values with the mean of the existing data, the variance for that variable will unavoidably collapse, leaving you with biased statistics.
                          Besides http://www.sagepub.in/textbooks/Book9419, I would recommend you to take a look at http://www.missingdata.org.uk/ which is maintained by Jeremy Bartlett, whose posts appears on this list from time to time.
                          Kind regards,
                          Carlo
                          (Stata 19.0)

                          Comment


                          • #14
                            Carlo,

                            I am actually doing multiple imputations. Those variables that I replaced with means and modes have only about 60 missing values in average from 92000 and I thought that it will not play much role with variance and biaseness, isnt it? doing this gave me an opportunity to have some continous variables to use for MICE.

                            Now I am waiting my stata to do it since previous night, the code is
                            Code:
                            mi impute chained (pmm, knn(5))
                            I have got the book of Paul Allison yesterday, I will read it right now!

                            Thanks,

                            Comment


                            • #15
                              Sara:
                              the main problem is that replacing missing values with mean or modes (or whatever) is wrong at its roots; moreover, you cannot say how much the bias will affect your results (making them difficult to defend, especially if your research paper will be peer-reviewed).
                              Another point, that Paul Allison covers in his textbook, is the type of missingness that your data bring about: is it informative or not?
                              The answer to this questions implies different approaches to deal with missing data.
                              Kind regards,
                              Carlo
                              (Stata 19.0)

                              Comment

                              Working...
                              X