Hi everyone,
I am using a logit model (attached below) to investigate the impact of minority status of borrowers on the loan approval probability, but both the Pearson's chi2 and HL test indicated a poor gof.
So I have the following questions,
1) Is the poor gof caused by the large sample, which is in a size of 2,491,476 ? I think my model has already included a rich set of controls that are in appropriate forms because I followed the controls recent studies used.
2) Despite the poor gof from Pearson and HL, the "percent correctly predicted" of the model is around 87%, which is very high. Can I regard my model as very predictive even though the poor gof from Pearson and HL?
Thanks!
Lei
The following is the test result:
I used Pearson's chi2 to exam the gof of the model and got :
which indicates a poor gof for the model.
In addition, I used HL test to exam the gof and got:
which also indicates a poor gof. But look at the table below, the observed and expected cell frequencies in each group are in very good agreement, at this point, I think the model's gof should be good.
The following is the logit model, with approval decision as the outcome variable, and a set of explanatory variables which are either dummy or continuous variables, there is no interaction or squared term:
logit approval income_w dti20 dti20_30 dti30_36 dti36_49 dti50_60 fico680_699 fico700_719 fico720_739 ltv80 ltv80_85 ltv85_90 ltv90_95 origination_2019 refinance minority female age62 lender_top100 shadowbank fintech aus tract_minority_population_percen tract_owner_occupied_units tract_one_to_four_family_homes tract_median_age_of_housing_unit cra fhfa_index
Here is the sample data, I divided it into two parts, due to the variables number limited by dataex:
I am using a logit model (attached below) to investigate the impact of minority status of borrowers on the loan approval probability, but both the Pearson's chi2 and HL test indicated a poor gof.
So I have the following questions,
1) Is the poor gof caused by the large sample, which is in a size of 2,491,476 ? I think my model has already included a rich set of controls that are in appropriate forms because I followed the controls recent studies used.
2) Despite the poor gof from Pearson and HL, the "percent correctly predicted" of the model is around 87%, which is very high. Can I regard my model as very predictive even though the poor gof from Pearson and HL?
Thanks!
Lei
The following is the test result:
I used Pearson's chi2 to exam the gof of the model and got :
Number of observations = 2,491,476
Number of covariate patterns = 1,636,678
Pearson chi2(1636649) = 2.48e+06
Prob > chi2 = 0.0000
Number of covariate patterns = 1,636,678
Pearson chi2(1636649) = 2.48e+06
Prob > chi2 = 0.0000
which indicates a poor gof for the model.
In addition, I used HL test to exam the gof and got:
Number of observations = 2,491,476
Number of groups = 10
Hosmer–Lemeshow chi2(8) = 260.64
Prob > chi2 = 0.0000
Number of groups = 10
Hosmer–Lemeshow chi2(8) = 260.64
Prob > chi2 = 0.0000
which also indicates a poor gof. But look at the table below, the observed and expected cell frequencies in each group are in very good agreement, at this point, I think the model's gof should be good.
Table collapsed on quantiles of estimated probabilities
+-----------------------------------------------------------------+
| Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
|-------+--------+--------+----------+--------+----------+--------|
| 1 | 0.6193 | 77335 | 77379.4 | 171813 | 171768.6 | 249148 |
| 2 | 0.7471 | 170945 | 172131.2 | 78204 | 77017.8 | 249149 |
| 3 | 0.8465 | 200800 | 198841.8 | 48346 | 50304.2 | 249146 |
| 4 | 0.8855 | 216880 | 216712.1 | 32268 | 32435.9 | 249148 |
| 5 | 0.9037 | 223495 | 223061.9 | 25652 | 26085.1 | 249147 |
|-------+--------+--------+----------+--------+----------+--------|
| 6 | 0.9166 | 227089 | 226821.4 | 22059 | 22326.6 | 249148 |
| 7 | 0.9275 | 229835 | 229762.4 | 19314 | 19386.6 | 249149 |
| 8 | 0.9378 | 232215 | 232373.8 | 16932 | 16773.2 | 249147 |
| 9 | 0.9492 | 234556 | 235019.0 | 14591 | 14128.0 | 249147 |
| 10 | 0.9900 | 237511 | 238557.9 | 11636 | 10589.1 | 249147 |
+-----------------------------------------------------------------+
+-----------------------------------------------------------------+
| Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
|-------+--------+--------+----------+--------+----------+--------|
| 1 | 0.6193 | 77335 | 77379.4 | 171813 | 171768.6 | 249148 |
| 2 | 0.7471 | 170945 | 172131.2 | 78204 | 77017.8 | 249149 |
| 3 | 0.8465 | 200800 | 198841.8 | 48346 | 50304.2 | 249146 |
| 4 | 0.8855 | 216880 | 216712.1 | 32268 | 32435.9 | 249148 |
| 5 | 0.9037 | 223495 | 223061.9 | 25652 | 26085.1 | 249147 |
|-------+--------+--------+----------+--------+----------+--------|
| 6 | 0.9166 | 227089 | 226821.4 | 22059 | 22326.6 | 249148 |
| 7 | 0.9275 | 229835 | 229762.4 | 19314 | 19386.6 | 249149 |
| 8 | 0.9378 | 232215 | 232373.8 | 16932 | 16773.2 | 249147 |
| 9 | 0.9492 | 234556 | 235019.0 | 14591 | 14128.0 | 249147 |
| 10 | 0.9900 | 237511 | 238557.9 | 11636 | 10589.1 | 249147 |
+-----------------------------------------------------------------+
The following is the logit model, with approval decision as the outcome variable, and a set of explanatory variables which are either dummy or continuous variables, there is no interaction or squared term:
logit approval income_w dti20 dti20_30 dti30_36 dti36_49 dti50_60 fico680_699 fico700_719 fico720_739 ltv80 ltv80_85 ltv85_90 ltv90_95 origination_2019 refinance minority female age62 lender_top100 shadowbank fintech aus tract_minority_population_percen tract_owner_occupied_units tract_one_to_four_family_homes tract_median_age_of_housing_unit cra fhfa_index
Here is the sample data, I divided it into two parts, due to the variables number limited by dataex:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float approval long income_w float(dti20 dti20_30 dti30_36 dti36_49 dti50_60 fico680_699 fico700_719 fico720_739 ltv80 ltv80_85 ltv85_90 ltv90_95 origination_2019 refinance) 1 208 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 190 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 132 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 127 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 171 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 125 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 152 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 150 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 208 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 208 0 0 0 1 0 1 0 0 0 0 0 0 1 0 end
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float(minority female age62 lender_top100 shadowbank fintech aus tract_minority_population_percen) int(tract_owner_occupied_units tract_one_to_four_family_homes) byte tract_median_age_of_housing_unit float cra double fhfa_index 0 0 0 0 0 0 1 46.07 13975 15386 8 0 4.47 0 0 0 0 0 0 1 46.07 13975 15386 8 0 4.47 0 0 0 0 0 0 1 46.07 13975 15386 8 0 4.47 0 0 0 0 0 0 1 46.07 13975 15386 8 0 4.47 0 0 0 0 0 0 1 46.07 13975 15386 8 0 5.11 0 0 0 0 0 0 1 46.07 13975 15386 8 0 4.47 0 0 0 0 0 0 1 46.07 13975 15386 8 0 5.11 0 0 0 0 0 0 1 11.43 6612 7636 12 0 11.99 0 0 0 0 0 0 1 3.55 6004 6742 12 0 5.76 0 1 0 0 0 0 1 34.96 6938 8788 13 0 6.11 end
Comment