Dear all,
I am currently working on a count model (negative binomial) and trying to choose the best model based on AIC or BIC model fit tests.
The issue that I am concerned is that I read some articles saying that these tests should not be used when using clustered / weighted data which are common for survey data. Because the way that I constructed my data is somewhat different from this statement, I am looking for any help to see whether these tests (AIC, BIC) are appropriate or not.
The model consists of two types of datasets. The dependent variable was obtained from the surveillance database, so neither weight nor sampling is matter. However, the independent variables were obtained from Demography and Health Survey (DHS) data where sample weights are required to use. The goal of this analysis is to find out statistically significant independent variables to explain variance of the dependent variable (as usual).
Prior to running a regression, the dataset for the independent variables was prepared by collapsing (by region) with the "sample weights" provided from DHS datasets. Thus, I do not have to use the "[iweight=weight]" option when running the regression (because the final dataset for independent variables was already weighted when collapsing, and no weight was required for the dependent variable). The regression and test outputs for one of the models are shown as below.
I was wondering if it would be okay to use AIC or BIC tests for model comparison in this context.
Thank you.
Jungseok Lee
. xi: glm inc1000 i.q3RF1 i.age_grp*inc_type, fam(nb)
i.q3RF1 _Iq3RF1_1-3 (naturally coded; _Iq3RF1_1 omitted)
i.age_grp _Iage_grp_1-5 (naturally coded; _Iage_grp_5 omitted)
i.age_~p*inc~pe _IageXinc_t_# (coded as above)
note: _IageXinc_t_1 omitted because of collinearity
Iteration 0: log likelihood = -228.99003
Iteration 1: log likelihood = -225.47151
Iteration 2: log likelihood = -225.43393
Iteration 3: log likelihood = -225.43391
Generalized linear models No. of obs = 84
Optimization : ML Residual df = 73
Scale parameter = 1
Deviance = 80.20500562 (1/df) Deviance = 1.098699
Pearson = 71.00797014 (1/df) Pearson = .9727119
Variance function: V(u) = u+(1)u^2 [Neg. Binomial]
Link function : g(u) = ln(u) [Log]
AIC = 5.629379
Log likelihood = -225.4339137 BIC = -243.2446
------------------------------------------------------------------------------
| OIM
inc1000 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Iq3RF1_2 | -.5920396 .3099836 -1.91 0.056 -1.199596 .0155171
_Iq3RF1_3 | .3788479 .3493202 1.08 0.278 -.305807 1.063503
_Iage_grp_1 | .9523037 .4453947 2.14 0.033 .079346 1.825261
_Iage_grp_2 | -4.379099 1.337118 -3.28 0.001 -6.999803 -1.758395
_Iage_grp_3 | -1.639636 .6600511 -2.48 0.013 -2.933312 -.3459597
_Iage_grp_4 | -3.685952 1.512576 -2.44 0.015 -6.650546 -.721358
inc_type | -3.278245 .6011499 -5.45 0.000 -4.456478 -2.100013
_IageXinc_~1 | (omitted)
_IageXinc_~2 | 5.701549 1.390779 4.10 0.000 2.975672 8.427426
_IageXinc_~3 | 2.357005 .7785509 3.03 0.002 .831073 3.882936
_IageXinc_~4 | 3.255374 1.596704 2.04 0.041 .1258912 6.384857
_cons | 4.277992 .5850281 7.31 0.000 3.131358 5.424626
------------------------------------------------------------------------------
. estat ic
-----------------------------------------------------------------------------
Model | Obs ll(null) ll(model) df AIC BIC
-------------+---------------------------------------------------------------
. | 84 . -225.4339 11 472.8678 499.6068
-----------------------------------------------------------------------------
Note: N=Obs used in calculating BIC; see [R] BIC note
I am currently working on a count model (negative binomial) and trying to choose the best model based on AIC or BIC model fit tests.
The issue that I am concerned is that I read some articles saying that these tests should not be used when using clustered / weighted data which are common for survey data. Because the way that I constructed my data is somewhat different from this statement, I am looking for any help to see whether these tests (AIC, BIC) are appropriate or not.
The model consists of two types of datasets. The dependent variable was obtained from the surveillance database, so neither weight nor sampling is matter. However, the independent variables were obtained from Demography and Health Survey (DHS) data where sample weights are required to use. The goal of this analysis is to find out statistically significant independent variables to explain variance of the dependent variable (as usual).
Prior to running a regression, the dataset for the independent variables was prepared by collapsing (by region) with the "sample weights" provided from DHS datasets. Thus, I do not have to use the "[iweight=weight]" option when running the regression (because the final dataset for independent variables was already weighted when collapsing, and no weight was required for the dependent variable). The regression and test outputs for one of the models are shown as below.
I was wondering if it would be okay to use AIC or BIC tests for model comparison in this context.
Thank you.
Jungseok Lee
. xi: glm inc1000 i.q3RF1 i.age_grp*inc_type, fam(nb)
i.q3RF1 _Iq3RF1_1-3 (naturally coded; _Iq3RF1_1 omitted)
i.age_grp _Iage_grp_1-5 (naturally coded; _Iage_grp_5 omitted)
i.age_~p*inc~pe _IageXinc_t_# (coded as above)
note: _IageXinc_t_1 omitted because of collinearity
Iteration 0: log likelihood = -228.99003
Iteration 1: log likelihood = -225.47151
Iteration 2: log likelihood = -225.43393
Iteration 3: log likelihood = -225.43391
Generalized linear models No. of obs = 84
Optimization : ML Residual df = 73
Scale parameter = 1
Deviance = 80.20500562 (1/df) Deviance = 1.098699
Pearson = 71.00797014 (1/df) Pearson = .9727119
Variance function: V(u) = u+(1)u^2 [Neg. Binomial]
Link function : g(u) = ln(u) [Log]
AIC = 5.629379
Log likelihood = -225.4339137 BIC = -243.2446
------------------------------------------------------------------------------
| OIM
inc1000 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Iq3RF1_2 | -.5920396 .3099836 -1.91 0.056 -1.199596 .0155171
_Iq3RF1_3 | .3788479 .3493202 1.08 0.278 -.305807 1.063503
_Iage_grp_1 | .9523037 .4453947 2.14 0.033 .079346 1.825261
_Iage_grp_2 | -4.379099 1.337118 -3.28 0.001 -6.999803 -1.758395
_Iage_grp_3 | -1.639636 .6600511 -2.48 0.013 -2.933312 -.3459597
_Iage_grp_4 | -3.685952 1.512576 -2.44 0.015 -6.650546 -.721358
inc_type | -3.278245 .6011499 -5.45 0.000 -4.456478 -2.100013
_IageXinc_~1 | (omitted)
_IageXinc_~2 | 5.701549 1.390779 4.10 0.000 2.975672 8.427426
_IageXinc_~3 | 2.357005 .7785509 3.03 0.002 .831073 3.882936
_IageXinc_~4 | 3.255374 1.596704 2.04 0.041 .1258912 6.384857
_cons | 4.277992 .5850281 7.31 0.000 3.131358 5.424626
------------------------------------------------------------------------------
. estat ic
-----------------------------------------------------------------------------
Model | Obs ll(null) ll(model) df AIC BIC
-------------+---------------------------------------------------------------
. | 84 . -225.4339 11 472.8678 499.6068
-----------------------------------------------------------------------------
Note: N=Obs used in calculating BIC; see [R] BIC note
Comment