Predictive model building strategies in Stata

Matt Price

Join Date: Aug 2023

Posts: 43
#1

Predictive model building strategies in Stata

26 Sep 2024, 11:54

I am building a logistic regression model to describe variables that correlate with my dichotomous outcome. I'm aware that there are no "good" model building strategies, only less bad ones - or that seems to often be the hot take of my statistical colleagues.

My strategy thus far has been to compare each variable alone with the outcome (my "crude" analysis), and those that may be correlated (I used a generous p value < 0.20) I would put into a full model, something like this:

Code:

logistic outcomevariable covariate1 covariate2 covariate3 covariate4 covariate5 covariate6

I look over the results, and pick out the variable with the worst (highest) p value, remove it, and run the code again. I continue this, until 1) I have all "significant" (p<0.05) covariates remaining and/or 2) any "non significant" (p>0.05) covariates whose removal significantly affects my measure of effect may be kept in as a potential confounder (i.e., I compare the odds ratios for remaining covariates of the reduced model to the previous model. If any change by more than 10-15%, I keep the removed covariate, and go on to the next-worst p value)

Years ago, a statistician I worked with used R (I think? might have been SAS) to design a model that iteratively compared every combination of an intitial set of covariates until it came up with the "best" model. I can't recall how it defined "best", but I'm guessing some combination of the pseudo R2, covariate p values, and perhaps something else?

Does anyone here have any experience with something like this? Can Stata do this? I'm not a statistician, and don't even know how to begin to search on a topic like this. If anyone would be kind enough to point me in the right direction (what search keywords might you recommend? Any good reading, that's ideally not too math-heavy? - wishful thinking, I know)

Thanks in advance. I really appreciate these forums.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

26 Sep 2024, 13:02

Approaches to predictive modeling are controversial, and if you ask 10 statisticians how best to do it you will get a minimum of 12 answers.

I think there are some general principles that most would nevertheless agree on.
The most important attributes of a model predicting a dichotomous outcome are discrimination and calibration. Selection of a model should be primarily guided by measures of these. Area under the ROC curve is a widely accepted measure of discrimination. There is less consensus on how best to assess calibration.

It is extremely easy to end up overfitting noise in the data. So all predictive models require at the minimum some kind of cross-validation testing (split-sample, leave out one, etc.), and ideally should ultimately be tested in an entirely new, independently gathered, data set.

Within those broad guidelines there are many ways to proceed. I'm not going to even try to summarize all those possibilities, as I'm only really familiar with some of them, and even if I were expert in all, there is not enough time to deal with them all.

Because you are only considering 6 predictor variables, it is certainly feasible to fit models with all 64 possible subsets of this set. The following code illustrates how you can generate all 64 of those models and run them.

Code:

clear* sysuse auto local cov1 mpg local cov2 i.rep78 local cov3 headroom local cov4 weight local cov5 length local cov6 gear_ratio forvalues m = 1/`=2^6' { local rhs local mm `m' forvalues i = 1/6 { local in_out`i' = mod(`mm', 2) if `in_out`i'' { local rhs `rhs' `cov`i'' } local mm = floor(`mm'/2) } logistic foreign `rhs' // SOME POST-ESTIMATION STATISTICS FOR DISCRIMINATION & CALIBRATION // MIGHT GO HERE; MAYBE -lroc- AND -estat gof- // OR SOME GRAPHS OF, OR BASED ON, PREDICTED & OBSERVED // VALUES local predictors: subinstr local rhs " " "_", all local predictors: subinstr local predictors "i." "", all estimates save model_`predictors', replace }

As you work on your project you may decide on other measures you want to assess your models with, or perhaps compare models with tests applicable to pairs of them, so the code also saves the estimates as .ster files in your current working directory. That way you don't have to re-run the logistic regressions themselves: you can just -estimates use- the saved results.

Note that even though this code iterates through all 64 combinations of the 6 predictors, it does not look at alternative specifications for the variables, that is, polynomial terms or inverse terms, splines, or other functional transformations of the variables. Nor does it deal with interaction terms. So even after you have settled on a best model among the 64, you may want to pursue improvements to the selected model or some close "runners up" using these approaches.

I would point out that there is no role for coefficient p-values in assessing these models. And pseudo-R² after logistic regression is generally not well-regarded as a measure of model fit. I suppose it's better than nothing, but better choices are available. Hosmer-Lemeshow chi square analysis was at one time very widely viewed as the best, although other approaches have been gaining popularity. When you have a pair of models, with one nested in the other, the AIC and BIC statistics give some guidance about avoiding overfitting--but really they are no substitute for cross-validation.

I don't intend this response to be comprehensive or definitive. It's really just a very broad outline, with code to help you generate all 64 models. But I do hope and expect that others will contribute their views on how to assess the quality of different models and select among them.

Last edited by Clyde Schechter; 26 Sep 2024, 13:04.
1 like
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1119
#3

26 Sep 2024, 15:48

Hello Matt Price. You might find Babyak's (2004) article on overfitting helpful:
https://people.duke.edu/~mababyak/pa...regression.pdf

As he says, it largely summarizes material in Frank Harrell's book, Regression Modeling Strategies. I cannot speak for Mike Babyak, but I think that if he was revising his article today, he might change his "events-per-variable" (EPV) rule of thumb to say that one should have at least 20 EPV for a binary logit model (if one wishes to reduce the likelihood of overfitting). This is what Harrell currently suggests--e.g., here. He might also clarify that it is really "events per candidate predictor parameter (EPP)", as noted here (see 1st paragraph in the Moving beyond the 10 events per variable section).

I hope this helps.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 662
#4

27 Sep 2024, 00:06

What about machine learning? Is this not mainly used for prediction? I have never used this but someone can tell about their experience. Some implementations are available, see:

https://www.stata.com/meeting/uk23/s...23_Cerulli.pdf

Best wishes

(Stata 16.1 MP)
Comment

Announcement

Predictive model building strategies in Stata

Comment

Comment

Comment