Hi Statalisters. I’m analyzing some data, and wrestling with next steps. If you have time, interest, and are willing, I would be grateful to hear your thoughts. I’m having some difficulty getting concordance from my statistically-minded colleagues, and thought I would turn to the Wisdom of Statalist.
I am running a regression analysis (logistic command) to consider correlates of vaccine confidence in a cohort of persons living in rural Africa. This is a descriptive, exploratory analysis, we are not testing any hypotheses. We have a long list of independent variables to consider, including demographics, location, socio-economic status and household condition. Our outcome of interest is “low confidence in vaccines” and we observed 88 reports of low confidence, out of 835 interviewed. Previously, I had always assumed with 88 “events”, I can consider modeling with 8-9 covariates, but I’ve recently come across (via a helpful post here, at Statalist) this insightful post that suggests a better rule of thumb might be more like 20 events per covariate, or in our case modeling with ~4 covariates (!). I have quite a few more than that to consider…
Here I outline my modeling strategy:
- My goal is to build an etiologic model to describe the relationship of important independent variables with our outcome of vaccine confidence/hesitancy. This is an exploratory etiologic/explanatory model.
- Variables were selected for bivariate analysis with the outcome based on literature, plausibility, and previous knowledge as we designed our data collection tools.
- Those variables above that were also significant at p <0.2 were included in our multivariable analysis. All selected variables are categorical, most are dichotomous yes/no variables.
- I checked collinearity between variables, and we don’t see anything significant
- There were 20 covariates that met this p<0.2 threshold (and because some categorical variables are coded as dummy / indicator variables, there were a total of 23 covariates we initially considered in our model)
- At this point, I did not consider the “one covariate per 10 or 20 endpoints” rule of thumb, which would recommend including only 4-9 covariates, not 23…!
- We adopted an augmented backwards stepwise removal strategy (Dunkler et al 2014), starting with the full model and removing covariates (or sets of covariates, where some were comprised of 2 or more “dummy” variables) starting with the largest p value
- As I started, I had no theories or a priori knowledge to suggest a “main” covariate or to guide any removal criteria beyond the p value, although we knew our communities were very different: so the study district (recruitment area) was going to be in the model, regardless of the bivariate p value.
- As we removed covariates, we reviewed the odds ratios for remaining covariates. If the removal of a covariate changed any odds ratios by more than about 10%, we would put it back in, as this is evidence of possible confounding.
- This was done manually, I didn’t use any algorithms for the removal of variables.
- We observed no evidence of confounding, and our final model only includes covariates significant at p < 0.05. Log likelihood tests between our final and full model suggests the full model does not improve our fit, and the final model is adequate to explain our data. However, this still leaves us with 11 predictors (or 15, if you count each dummy variable individually). This is more than the number of model endpoints would recommend, with 88 endpoints. We do see some wide confidence intervals, but only for a few covariates, not all of them.
I considered using principal component analysis to combine covariates into an indicator of socioeconomic status (SES), but some covariates did not always make sense in their direction of association, and it felt more prudent to review them individually. E.g., some factors we might associate with lower SES were associated with vaccine confidence, while some were associated with hesitancy. I thought this was interesting (and maybe a bit confusing, too), and felt it best to leave these variables in their original form.
One strategy I have heard recommended is bootstrapping the model, so that this can avoid some of the problems of overfitting too many covariates. I’ve also come across the LASSO method, to help the analyst select variables.
If you’re still with me, I have a few questions if you’re willing to offer your thoughts…
- With regards to categorical variables that are defined by a set of dummy/indicator variables (e.g., a variable for district which includes four districts might be coded district1, district2, district3, district4; and district2, district3, district4 are entered into the model with district1 serving as the reference group)
- Does this count as one covariate or three covariates, when thinking about events per variable? I couldn’t find this information in any of the papers I read, and was a bit surprised by that.
- Do model building commands like lasso and stepwise take dummy variables into consideration? e.g., either remove none of the three dummy variables for district, or remove all of the dummy variables for district - but don’t remove just the “not significant” ones.
- I’ve read in a few places now that there is no statistical rationale for bivariate screening thresholds to include covariates. This doesn’t make intuitive sense to me. For a complex “etiologic” model that describes a psycho-social construct like vaccine confidence, I am struggling to limit the number of independent variables based on a priori knowledge and causal diagrams. This was inadequate to help me prepare for analyzing a data set with only 88 events (to be honest: bivariate testing didn’t get me there, either!). I’m not even sure I have a question here… Perhaps: why not initially pare down your a priori list of covariates (particularly if you have relatively few events) by some sort of bivariate testing? The reasons I’ve read in Dunkler et al. and Heinze and Dunkler didn’t feel satisfactory (and this may be related to my poor math skills…)
- If my initial model (or even my final model) exceeds our rule of thumb of ~10-20 events per variable, are there statistical ways to make this more palatable? Is bootstrapping my final model a way to reassure my audience that this is a reasonable model?
- Related to this, I have always been of the mind that, all else being equal, a parsimonious model is a good model. If I have a model full of covariates that a) are not associated with my outcome and b) do not impact the association of other covariates with the outcome (i.e., confounding) than those covariates are not helping me out and are OK to drop. I’d confirm this with some kind of assurance that my parsimonious model is not worse at fitting my study data (e.g., likelihood ratio test), but I’ve always thought that going with the parsimonious model is best. Some of the papers I mention below discuss this, but left me feeling unsatisfied with their explanation. Is my assumption wrong?
- Does Stata have any sort of augmented backwards stepwise model building commands? I see the stepwise command, but I don’t think that will do what I want it to do (i.e.,, it seems like I still need to manually review coefficients to make a decision about inclusion or exclusion of a covariate.)
- Given the limitations of stepwise model building, and the seemingly limitless boundaries of computers these days, does Stata have options to compare all combinations of covariates (selected a priori, based on causal diagrams and previous knowledge) to select the model that is the best predictor of the outcome? Wouldn’t that be the best option for this kind of exercise?
- Dunkler et al 2014 Augmented Backward Elimination
- Heinze & Dunkler 206 Five Myths About Variable Selection
- Heinze et al. Variable selection - a review and recommendations for the practicing statistician (this was a stretch for me, but still helpful particularly if you are a practicing statistician or aspire to be!)
Comment