fitting the effect of a continuous var (vs. categorising it) in a gee model

Nafeesa Dhalwani

Join Date: Mar 2015

Posts: 4
#1

fitting the effect of a continuous var (vs. categorising it) in a gee model

03 Mar 2015, 05:56

Hi All
I am using a GEE model for panel data analysis where I have data for 6 waves and I am looking at a dependant var by waves. I first included waves as a continuous var and the output is as follows

. xi:xtgee depvar wave , eform i(idauniq) fam(bin) link(logit) corr(exchangeable)

Iteration 1: tolerance = .64951125
Iteration 2: tolerance = .013703
Iteration 3: tolerance = .00064495
Iteration 4: tolerance = .00003776
Iteration 5: tolerance = 2.180e-06
Iteration 6: tolerance = 1.338e-07

GEE population-averaged model Number of obs = 57126
Group variable: idauniq Number of groups = 15783
Link: logit Obs per group: min = 1
Family: binomial avg = 3.6
Correlation: exchangeable max = 6
Wald chi2(5) = 3965.99
Scale parameter: 1 Prob > chi2 = 0.0000

-------------------------------------------------------------------------------
depvar | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
wave | 1.212814 .0044039 53.14 0.000 1.204214 1.221477
_cons | .6416439 .0204127 -13.95 0.000 .6028574 .6829258
-------------------------------------------------------------------------------

and then used it as a categorical variable with the following output

. xi:xtgee depvar i.wave , eform i(idauniq) fam(bin) link(logit) corr(exchangeable)
i.wave _Iwave_1-6 (naturally coded; _Iwave_1 omitted)

Iteration 1: tolerance = .55485767
Iteration 2: tolerance = .01454043
Iteration 3: tolerance = .00080836
Iteration 4: tolerance = .0000476
Iteration 5: tolerance = 2.803e-06
Iteration 6: tolerance = 1.711e-07

GEE population-averaged model Number of obs = 57126
Group variable: idauniq Number of groups = 15783
Link: logit Obs per group: min = 1
Family: binomial avg = 3.6
Correlation: exchangeable max = 6
Wald chi2(9) = 4238.64
Scale parameter: 1 Prob > chi2 = 0.0000

-------------------------------------------------------------------------------
depvar | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
_Iwave_2 | 1.002251 .0199846 0.11 0.910 .9638378 1.042196
_Iwave_3 | 1.698867 .0335438 26.84 0.000 1.634378 1.7659
_Iwave_4 | 1.799413 .0354028 29.86 0.000 1.731346 1.870156
_Iwave_5 | 2.217938 .0444731 39.73 0.000 2.132463 2.306839
_Iwave_6 | 2.42179 .0493268 43.43 0.000 2.327016 2.520424
_cons | .7771827 .0243593 -8.04 0.000 .7308763 .8264229
-------------------------------------------------------------------------------

I want to compare the two models to see whether it is better to fit 'wave' as a continuous or categorical variable. I know that one could use likelihood ratio test to see whether it should be fitted as a continuous variable or categorical one but you cant use LRT here within a GEE model. I am trying to use testparm command to check which model is better but not sure about the syntax and interpretation. Any help will be greatly appreciated.

Thanks,
Nafeesa

Last edited by Nafeesa Dhalwani; 03 Mar 2015, 05:59.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

03 Mar 2015, 08:58

With 57,126 observations on 15,783 groups, I would be hesitant to rely on p-value based criteria to choose a model, even if a likelihood-ratio test were possible. In so large a sample, even tiny, meaningless differences in model fit will show up as "statistically significant." I would probably go a different route: I would do something like a Hosmer-Lemeshow calibration analysis: divide the data into deciles (or, in a data set this size, perhaps vingtiles) of predicted probability and then graphically compare the predicted and observed number of successes in each decile. (I would not do a chi square test from this.) I would do this for each model and then make a visual judgment whether the discrete wave model is a substantially better fit.

As an aside, if you are using current Stata (you're supposed to tell us if you're not), then you should no longer be using -xi-. It has been superseded by factor variable notation (-help fvvarlist-) in almost all estimation commands, and it works brilliantly with -margins-. (Factor variables are also available in Stata 12.)
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4945
#3

03 Mar 2015, 09:12

I agree with the points Clyde made. But, in general, if you want/need to do a Wald test of categorical vs continuous, I think you can do something like

Code:

webuse nhanes2f, clear logit diabete c.health o(1 2).health testparm i.health

Basically, you include categorical and continuous versions of the same variable and then see whether the less restrictive categorical gains you anything.Note that you have to drop 2 categories of the categorical variable to avoid perfect multicollinearity.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Nafeesa Dhalwani

Join Date: Mar 2015

Posts: 4
#4

03 Mar 2015, 09:31

Thank you Clyde and Richard for your response. I will try the Hosmer-Lemeshow calibration analysis. But for clarification purposes Richard when you run the code above in your example you get a p-value of 0.66. So am I right in assuming that in this example the categorical health variable offers no better fit than the continuous health variable and so it could be used as a continuous one?
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4945
#5

03 Mar 2015, 11:20

I would also look at the p values for the dummies and see if they are insignificant. In my example it seems reasonable to treat health as continuous.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4945
#6

03 Mar 2015, 11:30

Also, if you run

Code:

logit diabete i.health

the pattern of coefficients for the dummies looks very close to a linear relationship.

I think you can also do something like

Code:

webuse nhanes2f, clear logit diabetes i.health test 3.health = 2 * 2.health test 4.health = 3 * 2.health, accum test 5.health = 4 * 2.health, accum

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Nafeesa Dhalwani

Join Date: Mar 2015

Posts: 4
#7

21 Apr 2015, 08:39

Thanks for This Richard. Is there a way to do a test for trend within a GEE model?
Comment

Announcement

fitting the effect of a continuous var (vs. categorising it) in a gee model

Comment

Comment

Comment

Comment

Comment

Comment