Variable Selection in a logistic regression model with a small number of outcomes

William Hobart

Join Date: May 2014

Posts: 1
#1

Variable Selection in a logistic regression model with a small number of outcomes

17 May 2014, 19:44

Hello. I am a long time reader of this forum, and this is my first post
I am having an issue figuring out what to do with a logistic regression, and am hoping that someone can help me.

I have a relatively small data-set looking at a binary outcome of death after a medical procedure.

There are 102 patients total in the data-set, and 26 deaths.

I am interested in looking at correlates of death. I first calculated univariate odds ratios, and have a list of 11 factors which are statistically significant. Some of them are probably related to eachother (such as fluid intake during the procedure, fluid output during the procedure, and the net fluid during the procedure).

Since there are only 26 events, I'm not sure how to best approach choosing which variables to put in a multivariable logistic model. If I'm less conservative I guess I can choose 1 variable per 5 outcomes, but that only allows me to choose 5 variables to put in the model.

I have been exploring using various methods of forward/backward selection, but im not having much success.

Given the relatively small number of outcomes, is fitting a logistic model inappropriate? If not, how should I approach choosing which variables to include in a logistic model with a small number of outcomes?

Thanks!
Tags: None
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#2

18 May 2014, 07:13

Hi, I can't help answer your question, because I've never wondered, read, or heard, how the number of explanatory variables depends on the number of events in a binary variable. I have only understood that it depends on the total number of observations, but also never had a rule of thumb for it.

However, I have a question of my own after reading your description of the data. You mention that in your sample the number of deaths are 26 out of 102 observations, which implies a sample proportion of deaths (death ratio) of 25.5%, i.e. more of a quarter of the patients died in this sample when having this procedure. Does this reflect the population proportion or is it biased? Remember that the predicted in sample proportion by the logit model will be identical to the unconditional in-sample proportion, so if it's an inflated proportion, the inference on the coefficients will be invalid. That's why I'm asking. If it's a true reflection of the population proportion please let us know what procedure this is... I want to stay away from it as long as possible!

Alfonso Sanchez-Penalver
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#3

18 May 2014, 11:28

William & Alfonso,

The conventional rule of thumb for logistic regression used to be 10 events (death in this case) per covariate. Some more recent work has shown that this could be relaxed to five events (see attached). The reason why it is the number of events and not the total sample size is that it would be very difficult to see any variability in outcome by different covariates if there were only a few events, even if there was a very large sample size.

If you have covariates that are related to each other, you should certainly consider using only one of them, combining them in some clinically meaningful way, or using something like factor analysis to combine them.

Regards,
Joe
Attached Files

AJE_10events.pdf (217.8 KB, 1 view)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#4

18 May 2014, 12:28

In my view, it is unlikely that there is a satisfactory approach to predictor selection for your situation that is purely statistical. You have a pretty small data set, and clearly the outcome is over-determined. No purely statistical approach is likely to solve your problem. I think you need to focus on the clinical aspects here.

For example, you mention three predictors that are "probably related" to each other: fluid intake, fluid output, and net fluid. Those are beyond "probably related." They are multi-collinear: net fluid = fluid intake - fluid output. So at most you can use two of them. Now, you don't tell us what this procedure is or the type of patients undergoing it. If your sample is representative of such patients, then this procedure is being performed on desperately ill patients: the fatality rate is ballpark 25%. So what is it about them? If they are rapidly exsanguinating, then perhaps high net fluid indicates successful restoration of intravascular volume and is a very relevant prognostic factor. Or if they have poor cardio-renal function, high net fluid may indicate overload, and would also be very relevant as a negative prognostic factor. If, however, these are hemodynamically stable patients who are severely ill in some other way, you might look for associations between these fluid variables and other prognostic factors that might be driving the outcome, with the fluid variables just secondary indicators "going along for the ride." (For example, all of these variables might just be proxies for the duration of the procedure.) I would try to identify the predictors that are most clinically salient based on understanding the physiology and build the model around those, adding others only if your sample permits. (By the way, the AIC and BIC may give you some sense of when you are overloading your model as you explore.) There are other pragmatic considerations: if you only have room for one and only one fluid-related predictor after selecting predictors from other domains, and if net fluid isn't particular physiologically salient here, you might select fluid input, because it is something that can be controlled, whereas output and net fluid cannot.

If you are not sufficiently knowledgeable about the underlying physiology to carry this out, I advise you to consult with somebody who is. In a data set this size with so many candidate predictors, it is far too easy to overfit the noise in the data and end up with a model that neither makes sense, nor holds up in replicate studies. (By the way, my experience is that stepwise variable selection procedures are the surest way to come up with a bad model.)

Hope this helps.

Last edited by Clyde Schechter; 18 May 2014, 12:30.
2 likes
Comment
Joseph Luchman

Join Date: Mar 2014

Posts: 114
#5

19 May 2014, 08:31

Hi William,

Whereas it will not help with the over determination issue noted by Joe and Clyde, I have developed a Stata module miinc (SSC) to assist in variable selection/model averaging which can be used with logit and offers an option called pip that gives a posterior inclusion probability for each independent variable based on information criteria such as the AIC or BIC.

The use of the pipmay be of some use in your case and will undoubtedly be more reasonable than stepwise selection - of course this approach does move toward the purely statistical approach and, thus, should be used for exploratory purposes primarily if consultation with another or some other independent variable exclusion approach isn't fruitful.

- joe

Joseph Nicholas Luchman, Ph.D., PStat® (American Statistical Association)
----
Research Fellow
Fors Marsh
----
Version 18.0 MP
1 like
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#6

19 May 2014, 17:34

William might be looking for something like R's glmnet package. I don't think that Stata has such a capability as of yet, at least not officially (I believe that David Airey recently mentioned it, or something like it, on his wishlist for Stata 14).

But in the meantime, William can try using the user-written Stata package lars (SSC) in conjunction with the Hastie-Tibshirani-Friedman "trick" described here.
Comment

Announcement

Variable Selection in a logistic regression model with a small number of outcomes

Comment

Comment

Comment

Comment

Comment