Stepwise logistic regression

Sara Zakaryan

Join Date: Mar 2016

Posts: 30
#1

Stepwise logistic regression

25 Mar 2016, 04:59

Dear all,

I want to have stepwise logit estimation and after reading the manuals I couldn't find a way to have the selection criteria based on BIC or AIC.
Is it possible or the only way is to have the significance level chosen?
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#2

25 Mar 2016, 07:38

Sara:
just one step aside from your question.
With stepwise estimation, you are going to obtain a model that, in all likelihood, has nothing to do with your original data and, as a consequence, its results, significant or not, are weakly reliable at best.
Just to quote one of the most towering members of the "don't do it" party, you may want to take a look at Frank Harrel's Regression Model Strategy. 2nd edition. Springer: 67-72.
However, if you can’t help from following that road, you may want to start off from -stepwise- entry in Stata .pdf manual that does not support the use of AIC or BIC criteria for this dangerously oversold statistical procedure (however, please take a look at http://www.stata.com/support/faqs/st...sion-problems/ ).

Kind regards,
Carlo
(Stata 19.0)
2 likes
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4994
#3

25 Mar 2016, 17:49

I agree with Carlo that stepwise selection is usually the work of Satan. Having said that, SJ did recently publish this article on the user-written gvselect command:

http://www.stata-journal.com/article...article=st0413

Here is an example:

Code:

. sysuse auto, clear (1978 Automobile Data) . gvselect <term> weight trunk length, nmodels(2): regress mpg <term> i.foreign Optimal models: # Preds LL AIC BIC 1 -194.1831 394.3661 401.2783 1 -196.7305 399.4609 406.3731 2 -192.997 393.9939 403.2102 2 -193.9518 395.9036 405.1198 3 -192.9913 395.9827 407.503 predictors for each model: 1 : weight 1 : length 2 : weight length 2 : weight trunk 3 : weight length trunk

The BIC and AIC values for the "winning" models are bolded. The "winning" model if you use BIC is

Code:

reg mpg weight i.foreign estat ic

For AIC,

Code:

reg mpg weight length i.foreign estat ic

They call it "best variables subset selection." I don't know if that is a way of making stepwise regression sound more respectable or if there really are merits to the approach that SW does not have. I am skeptical myself. But if you think that using stepwise is acceptable then using BIC or AIC may be at least as acceptable.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4994
#4

25 Mar 2016, 18:01

Incidentally, findit reveals two versions of the gvselect program. The version from the Stata Journal is apparently more current. (I always hate it when that happens; usually I trust the SSC version but occasionally it is not the most current version out there.)

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Sara Zakaryan

Join Date: Mar 2016

Posts: 30
#5

26 Mar 2016, 03:30

Thanks for your comments Carlo and Richard!

Usually, I do not go for stepwise selection though with my current project I have no information what variables show ( what those values stand for and show, economic meaning, they are just coded and I need to find the best variable subset that has predictive power blindly).

Thank you for your hints. I will look through Frank Harrel's Regression Model Strategy and gvselect firstly!
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#6

26 Mar 2016, 04:21

Sara:
if you're running somehow blind with your project, probably the best approach is reporting different regression models (and discussing their results and possibly practical implications) via a sort of scenario analysis.
This approach could outperform stepwise selection procedure as far as dealing with the uncertainty of your dataset is concerned.
The fact that your variables are simply coded and not explained in their meaning cannot reduce the relevance of the drawbacks that affect stepwise procedure(s).

Kind regards,
Carlo
(Stata 19.0)
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4994
#7

26 Mar 2016, 05:41

I'd be leery of running any kind of analysis where I had no idea what the variables were! Even with stepwise there should be some logical reason for thinking the variables could be/should be in the model. I wouldn't, for example, include x11 as a possible predictor of x10 if x11 came later in time. Is this a homework problem or something? I'm curious why you would be in this situation, or what you would do with the results once you had them.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#8

26 Mar 2016, 08:09

Sara:
i do share Richard's curiosity on that point.
Usually "the best" (whatever it means) regression models is strictly related to what has been done in the past by others in a given research field (also to forestall the risk of re-inventing a perfectly running wheel and be rebounded by reviewers!).

Kind regards,
Carlo
(Stata 19.0)
Comment
Sara Zakaryan

Join Date: Mar 2016

Posts: 30
#9

26 Mar 2016, 12:27

This project is actually given to me as a recruiting process for data analyst job.
I only know that it is a claim data for insurance company and nothing else. I have about 92000 observations, a binary variable which means that it is 1 if the case is approved for payment and 0 when more information is needed. About 35% of the data is missing. The main aim is to predict probabilities of having 1s and have out of sample predictions.

For variables I tried to see the catplots (with frequencies or percent of dependent binary variable) of 1s having the continous variables binned to see how it changes in each bin.

Like this one I included in my model because there is a centain variance of 1s and trend that changes along with the variable X10, but for model I am still thinking what to do.

Thanks,
Sara
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4994
#10

26 Mar 2016, 12:55

http://statisticalhorizons.com/predi...ssion-analysis

The above might be of interest.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#11

26 Mar 2016, 13:25

Sara:
if

About 35% of the data is missing.

this affects your regression, as Stata applies listwise deletion for each observation with at least a missing value in any variable.
As the missingness might be informative, are you requested to deal with missing values, too?
If this were the case, the same Author quoted by Richard published an interesting (and lovely short) textbook on this topic: http://www.sagepub.in/textbooks/Book9419.

Kind regards,
Carlo
(Stata 19.0)
Comment
Sara Zakaryan

Join Date: Mar 2016

Posts: 30
#12

26 Mar 2016, 13:42

Carlo and Richard,

Thank you for your help very much!
With missing values I am trying to deal too. For those variables which had less than 100 missing values I have replaced with means and modes, and with the rest variables I am trying to have chained mi imputes, or mi impute pmm with knn option.

I will look through the articles and books that you shared too.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#13

27 Mar 2016, 02:57

Sara:
replacing missing values with means (and even worse) with modes is no way the right approach.
Just consider that if you replace missing values with the mean of the existing data, the variance for that variable will unavoidably collapse, leaving you with biased statistics.
Besides http://www.sagepub.in/textbooks/Book9419, I would recommend you to take a look at http://www.missingdata.org.uk/ which is maintained by Jeremy Bartlett, whose posts appears on this list from time to time.

Kind regards,
Carlo
(Stata 19.0)
Comment
Sara Zakaryan

Join Date: Mar 2016

Posts: 30
#14

27 Mar 2016, 03:13

Carlo,

I am actually doing multiple imputations. Those variables that I replaced with means and modes have only about 60 missing values in average from 92000 and I thought that it will not play much role with variance and biaseness, isnt it? doing this gave me an opportunity to have some continous variables to use for MICE.

Now I am waiting my stata to do it since previous night, the code is

Code:

mi impute chained (pmm, knn(5))

I have got the book of Paul Allison yesterday, I will read it right now!

Thanks,
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#15

27 Mar 2016, 03:46

Sara:
the main problem is that replacing missing values with mean or modes (or whatever) is wrong at its roots; moreover, you cannot say how much the bias will affect your results (making them difficult to defend, especially if your research paper will be peer-reviewed).
Another point, that Paul Allison covers in his textbook, is the type of missingness that your data bring about: is it informative or not?
The answer to this questions implies different approaches to deal with missing data.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Stepwise logistic regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment