regression over all possible combination of variables

Sandra AJ Alba

Join Date: Mar 2015

Posts: 15
#1

regression over all possible combination of variables

12 Mar 2015, 08:03

Hi all,

Hopefully the last post of the day

I want to find the best predictive model. I have 10 covariates. I would like to run all multivariate regression models on all possible combinations of my 10 variables. What would be the smartest way of going about this in Stata?

Thanks, Sandra
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3820
#2

12 Mar 2015, 08:09

What would be the smartest way of going about this in Stata?

There is nothing "smart" about this type of model selection (and I do in no way intend to attack you, personally!). Here is a nice summary of problems with such approaches. Note especially the statement that

“All possible subsets” regression solves none of these problems.

Also see a recent related discussion on the list.

Best
Daniel
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

12 Mar 2015, 09:31

Hello Sandra,

After the useful tips Daniel gave you, I wish to add just a few comments,

Modeling can be somewhat considered a sort of art. Ok. But that means rationale, background knowledge, checking postestimations, etc. I fear data mining without theoretical rationale risks building an oxymoron, so to speak.

Regarding your particular model, unfortunately you didn't say anything about your dependent variable, neither the number of observations, the rationale in your field, the patterns of distribution of the outcome variable as well as the covariates and so on.

Paucity of information makes it hard to entail correct recommendations. This notwithstanding, maybe you want to read about AIC and BIC and some interesting ways to perform comparisons between models : http://www.stata.com/manuals13/restatic.pdf

This being said, AIC and BIC alone won't replace the need of a well-rooted rationale and a careful interpretation of the models, before choosing "the best one".

Best,

Marcos

Best regards,

Marcos
Comment
Sandra AJ Alba

Join Date: Mar 2015

Posts: 15
#4

12 Mar 2015, 13:48

Many thanks for your inputs and useful links. I have quite some experience fitting explanatory models and never been a fan of stepwise regression,. preferring to get a deep feeling for the data and both causal relations and confounding patterns. However I am currently working on some predictive models, and my understanding is that now my main aim should be maximising predictive power. I intend to do some cross validations (leave one observation out or something like it - I only have 32 datapoints), but first I want to get a number of possible models and look at the AIC of each model. I actually have a mulitvariate model already which I feel quite comfortable about as an explanatory model, but I am concerned that it may not be the best predictive model. That is why I wanted to explore some "all subsets" models... Comments welcome!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#5

12 Mar 2015, 14:02

Seriously? You have 32 data points and 10 variables, and you want to fit 1,024 models in search of the best AIC? To me, a model identified in that way would have zero credibility. (Well, epsilon ) Count me as having joined the chorus of other responders here who find this approach seriously ill-advised.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3820
#6

12 Mar 2015, 14:11

Cross-validation seems a good idea, but with N=32 the possibilities seem very limited. I would never think about some (semi-) automatic fit procedure with such small sample. Even if your goal is predictive power, I guess it is pretty unlikely that the results you obtain will translate beyond your sample. Also, with N=32 I would not trust a model with more than 3 predictors.

Best
Daniel
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35405
#7

12 Mar 2015, 14:20

I am not saying anything because I am joint author of software that makes this easier but I agree strongly in this case that it is not advisable.

Do gun manufacturers sleep badly?
1 like
Comment
Sandra AJ Alba

Join Date: Mar 2015

Posts: 15
#8

12 Mar 2015, 14:53

Thanks for these useful inputs. That actually makes things a bit easier somehow! Although I realise this is now perhaps getting a bit off topic - with regards to cross validation, I wanted to try the "leave one observation out cross validation". The only statistics I see mentioned are R_sq and MSE. Can these be used to assess a logistic regression model?
Comment
Joseph Luchman

Join Date: Mar 2014

Posts: 114
#9

12 Mar 2015, 15:31

Sandra,

As the other author of the software that makes this easier I have a somewhat different perspective. "All subsets, then pick the best" is not directly advisable - but can be useful, depending on how you use the all subsets approach (though the potential utility is lower with 32 cases).

My first suggestion is to read a good book on model selection (Burnham and Anderson, 2002 is a favorite of mine; Claeskens & Hjort, 2008 is another good one).

Second, completely agree with the folks above - can you whittle your model space down? All subsets is data intensive and likely not a good approach with so few obs.

Finally, if you really do want to do an all subsets, picking the "best" model is probably not the best approach (here's where the "how" you use the all subsets comes in). Consider model averaging all the models together (the program miinc [SSC] does this with the AIC or BIC and can be used with logistic or logit). This will hedge against issues related to overfitting; Burnham and Anderson have an extended discussion of the similarity of AIC-based and bootstrap-based shrinkage methods in their book. lars (SSC; as noted by Stephen Jenkins on the discussion Daniel linked to above) does something similar with respect to coefficient shrinkage.

- joe

Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: a practical information-theoretic approach. Springer Science & Business Media.

Claeskens, G., & Hjort, N. L. (2008). Model selection and model averaging(Vol. 330). Cambridge: Cambridge University Press.

Joseph Nicholas Luchman, Ph.D., PStat® (American Statistical Association)
----
Research Fellow
Fors Marsh
----
Version 18.0 MP
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4941
#10

12 Mar 2015, 16:13

Here are some of Paul Allison's thoughts on predictive modeling:

http://www.statisticalhorizons.com/p...ssion-analysis

Whether he might have a little more sympathy for what Sandra is trying to do, I can't tell. I do think he would want more than 32 cases though.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Jesus Pulido

Join Date: Oct 2016

Posts: 30
#11

28 Oct 2016, 13:49

Hi everyone,

In a related topic to Sandra's question, I would like to have your opinion on the use of multivariate regression.

I want to obtain the determinants of adoption of a system of 3 agricultural practices (minimum-till, inorganic fertilizer and modern maize varieties). As dependent variables, farmers can use any possible combination from these practices (8 different possibilities). The independent variables include several economic, demographic and social factors.

My question is if multivariate regression would allow me to use combinations as dependent variables. I've seen several applications of multivariate regression, but they regress individual agricultural practices on the x's, not as combinations.

My other option is using a Multinomial-logit model. This I have seen allows me to use combinations as dependent variables but has the strong assumption of "independence of irrelevant alternatives", which might not be convenient when describing farmers' behaviour.

Any advice would be much appreciated, Jesús P.
Comment
Bram Hogendoorn

Join Date: Jun 2017

Posts: 31
#12

19 Jul 2018, 05:27

The question asked by Sandra may be relevant to other Stata users as well. There can always arise a situation in which you would like to mine for the best fit, even though in this particular situation there are definitely not enough degrees of freedom.
Sandra, you may want to check out -stepwise- and -bfit-.
Comment

Announcement

regression over all possible combination of variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment