Accuracy and goodness of fit for imbalanced data set

Benjamin Revell

Join Date: Feb 2024

Posts: 12
#1

Accuracy and goodness of fit for imbalanced data set

27 Mar 2024, 13:39

Hi all,

I'm currently trying to appraise my model and would like to:

1. Test the accuracy of its predictions
2. Test the goodness of fit of the model

The dependent variable in my model has just 63 positive observations and over 5000 negative observations and I have adopted a rare events regression.

When using estat class I find that none of observations are positively classified, so none of the positive observations are predicted correctly.

Is this just a feature of the imbalanced data set where the majority class bias skews the models ability to predict correctly? Or is there an argument for changing the Pr(D) >= .5 cut-off.
I also tried an roc curve using lroc and get an area under ROC curve of 0.8057 but I am unsure how this is interpreted.

Is there a goodness of fit test that is most suited to imbalanced data sets?

Would really appreciate any help, thanks !
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

27 Mar 2024, 18:09

Or is there an argument for changing the Pr(D) >= .5 cut-off.

Actually, there really isn't even any argument for using the .5 cut-off in the first place. Whenever you use -estat classification-, whether the event in question is rare or not, you should always specify a cutoff that makes sense in your data. The default to 0.5 is just silly. In fact, usually you will want to run -estat classification- several times with different cutoffs to get a good sense of what cutoff is sensible for your data.

The ROC area of .81 is quite good for most contexts. It means that your model is rather good at distinguishing the positives from the negatives. More specifically, you have a very high two-point forced-choice probability. That means that if I were to randomly select one positive observation, and one negative observation from your data set, and tell Stata the corresponding predictors for the two observations, and ask which one was the positive observation, Stata would pick the right one 81% of the time.

One conventional goodness of fit test for dichotomous outcomes is the Hosmer-Lemeshow procedure. (-estat gof- after -logit- or -probit-). It has been criticized, but for many purposes it works well. It will not, however, work very well with your data. In the best case scenario, all 63 of the positive observations will fall in the highest decile of predicted probability. But they will be only 63 out of about 500, so the overall number of expected observations in that decile is going to be dominated by the positive observations that make up the bulk of the decile's data. It's just too coarse-grained to work well for a data set of that size. You can probably improve on it by using -group(100)- instead of -group(10)-. That way you will get a better sense of the fit.

You might look into the -calibrationbelt- command, available from Stata Journal. (Run -search calibrationbelt- and then follow the link to the package.) It provides another approach to assessing calibration of models predicting dichotomous outcomes. Once you install it, the help file contains a link to the Stata Journal article that explains how it works.
1 like
Comment
Benjamin Revell

Join Date: Feb 2024

Posts: 12
#3

28 Mar 2024, 15:33

Seems like the area under ROC would be a great way of appraising my model. I've started looking into the calibration belt and its commonly used in the epidemiological field so I'll try and incorporate this as well.
Thanks very much Clyde.
Comment

Announcement

Accuracy and goodness of fit for imbalanced data set

Comment

Comment