Dear all,
I've a few questions and would appreciate it if any of you could share your expertise on these.
How do I get a 'confusion'/classification matrix showing observed vs. predicted categories after multinational logistic regression (outcome var has 3 categories)? (I need to calculate the mis-classification rate or an overall accuracy). I used 'crossfold' command but this gives no option to do that; found an online code written by a user which shows that one can do that by identifying the the category that had the highest value of the predicted probability for a subject and assigning that subject to that category.
Some other questions in general: I’m trying to build a multinomial logistic regression model for prediction purposes and asked to do cross-validation. A set of candidate independent variables (IVs) are given to me.I use all the IVs and do multinomial logistic regression using k-fold cross validation; I see that the model performs poorly: Pseudo R2 are very low and only a few IVs are significant. So I need to find a better model with reduced number of predictors/IVs. My question is:
1. Each time I try a new model (with a different set of IVs), do I do k-fold cross-validation? Then identify the best model and fit it on the entire data set?
2. Or do I 1st do these procedures (i.e. trying different models with different set of predictors) on a particular sub-set of the entire sample, identify the most relevant/useful model, and then use k-fold cross-validation on the remaining sample only for this model?
3. Or do I complete the steps of finding the final model on the entire sample and then do k-fold cross-validation only for the final model?
I've heard that data that are used to develop the model should not be used for testing/validating the model as it will give too “optimistic” results. Options #1 and #3 above will include a portion of the data set, that might overlap on both the model ‘development’ and ‘validation’ samples…in that regard, options #1 and #3 above might not be reasonable?...What is the right step?
Would appreciate your opinion. Thanks!
Sincerely
Musarrat
I've a few questions and would appreciate it if any of you could share your expertise on these.
How do I get a 'confusion'/classification matrix showing observed vs. predicted categories after multinational logistic regression (outcome var has 3 categories)? (I need to calculate the mis-classification rate or an overall accuracy). I used 'crossfold' command but this gives no option to do that; found an online code written by a user which shows that one can do that by identifying the the category that had the highest value of the predicted probability for a subject and assigning that subject to that category.
Some other questions in general: I’m trying to build a multinomial logistic regression model for prediction purposes and asked to do cross-validation. A set of candidate independent variables (IVs) are given to me.I use all the IVs and do multinomial logistic regression using k-fold cross validation; I see that the model performs poorly: Pseudo R2 are very low and only a few IVs are significant. So I need to find a better model with reduced number of predictors/IVs. My question is:
1. Each time I try a new model (with a different set of IVs), do I do k-fold cross-validation? Then identify the best model and fit it on the entire data set?
2. Or do I 1st do these procedures (i.e. trying different models with different set of predictors) on a particular sub-set of the entire sample, identify the most relevant/useful model, and then use k-fold cross-validation on the remaining sample only for this model?
3. Or do I complete the steps of finding the final model on the entire sample and then do k-fold cross-validation only for the final model?
I've heard that data that are used to develop the model should not be used for testing/validating the model as it will give too “optimistic” results. Options #1 and #3 above will include a portion of the data set, that might overlap on both the model ‘development’ and ‘validation’ samples…in that regard, options #1 and #3 above might not be reasonable?...What is the right step?
Would appreciate your opinion. Thanks!
Sincerely
Musarrat