When using -margins- without the -expression()- option after glm binomial logit with grouped data, the predictive margins and their standard errors are much larger than they would be if the data were ungrouped. Playing with different datasets I noticed that the magnitude of the difference seems always to be a factor that stays the same across the model's predictors, suggesting the difference is some sort of scaling. My question is: how can I compute the "scaling" factor linking the grouped and individual margins results when I only have access to grouped data?
To illustrate what I mean, below is an example of individual data which I group, then I run a model on the individual and the grouped data, and finally I compare the predictive margins for the two models. In this example, the group data's margins and their standard errors are larger than the original ungrouped data's by a constant factor of .03338.
Here are the results compared:
As an aside, note that following Clyde Schechter's comment on this list from Sept. 20, 2019, specifying the -expression()- option as -expression(predict(mu)/d)- when running -margins- with the grouped data provides predictive margins and delta method standard errors that are very close to those obtained with individual data.
To illustrate what I mean, below is an example of individual data which I group, then I run a model on the individual and the grouped data, and finally I compare the predictive margins for the two models. In this example, the group data's margins and their standard errors are larger than the original ungrouped data's by a constant factor of .03338.
Code:
webuse nhanes2f, clear keep if !missing(diabetes, female, black, age, age2) // Create denominator variable to set up grouped data egen ycovpatt = group(diabetes female black age age2) egen d = count(ycovpatt), by(ycovpatt) // Create outcome variable for grouped data gen diabetesg = diabetes * d // Identify duplicate outcome-covariate patterns (so group data is dup == 1) bysort ycovpatt: gen dup = _n // Run model with full (ie individual) data and save results and predicted margins qui glm diabetes i.female i.black c.age c.age2, family(binomial) link(logit) eststo i mat lli = `e(ll)' qui margins female, at(age=(20 40 60)) mat mtabi = r(table) // Repeat with grouped data qui glm diabetesg i.female i.black c.age c.age2 if dup == 1, family(binomial d) link(logit) eststo g mat llg = `e(ll)' qui margins female, at(age=(20 40 60)) mat mtabg = r(table)
Code:
. // Compare results and statistics . esttab i g -------------------------------------------- (1) (2) diabetes diabetesg -------------------------------------------- main 0.female 0 0 (.) (.) 1.female 0.157 0.157 (1.66) (1.66) 0.black 0 0 (.) (.) 1.black 0.721*** 0.721*** (5.69) (5.69) age 0.132*** 0.132*** (4.55) (4.55) age2 -0.000703* -0.000703* (-2.55) (-2.55) _cons -8.150*** -8.150*** (-10.93) (-10.93) -------------------------------------------- N 10335 345 -------------------------------------------- t statistics in parentheses * p<0.05, ** p<0.01, *** p<0.001 . ** The coefficients and standard errors are the same . mat li lli symmetric lli[1,1] c1 r1 -1808.5522 . mat li llg symmetric llg[1,1] c1 r1 -1808.5522 . ** The log likelihoods are the same . mat li mtabi mtabi[9,6] 1._at# 1._at# 2._at# 2._at# 3._at# 3._at# 0.female 1.female 0.female 1.female 0.female 1.female b .00129158 .00151003 .01769071 .02057441 .17840235 .19903016 se .00063654 .00074138 .00260439 .00293701 .07697392 .08082237 z 2.0290627 2.0367778 6.7926462 7.0052315 2.3176986 2.4625627 pvalue .0424519 .04167232 1.101e-11 2.466e-12 .02046571 .0137948 ll .00004398 .00005695 .0125862 .01481798 .02753624 .04062122 ul .00253918 .00296311 .02279522 .02633084 .32926845 .3574391 df . . . . . . crit 1.959964 1.959964 1.959964 1.959964 1.959964 1.959964 eform 0 0 0 0 0 0 . mat li mtabg mtabg[9,6] 1._at# 1._at# 2._at# 2._at# 3._at# 3._at# 0.female 1.female 0.female 1.female 0.female 1.female b .03869134 .04523528 .5299521 .6163378 5.3443137 5.9622514 se .01906858 .02220923 .07801851 .0879825 2.3058709 2.4211572 z 2.0290627 2.0367778 6.7926462 7.0052315 2.3176986 2.4625627 pvalue .0424519 .04167232 1.101e-11 2.466e-12 .02046571 .0137948 ll .00131761 .00170598 .37703864 .44389526 .82488989 1.2168706 ul .07606507 .08876458 .68286556 .78878034 9.8637376 10.707632 df . . . . . . crit 1.959964 1.959964 1.959964 1.959964 1.959964 1.959964 eform 0 0 0 0 0 0 . ** The predictive margins are different but the z values are the same . . // Divide the elements of the individual data's margins results matrix by those of the group data's . mata: A = st_matrix("mtabi") . mata: B = st_matrix("mtabg") . mata: A:/B 1 2 3 4 5 6 +-------------------------------------------------------------------------------------+ 1 | .0333817126 .0333817126 .0333817126 .0333817126 .0333817126 .0333817126 | 2 | .0333817126 .0333817126 .0333817127 .0333817126 .0333817126 .0333817126 | 3 | 1 1 .9999999992 1 .9999999997 .9999999997 | 4 | .9999999996 .9999999998 1.000000036 .9999999967 1.000000002 1.000000002 | 5 | .0333817127 .0333817127 .0333817126 .0333817126 .0333817126 .0333817126 | 6 | .0333817126 .0333817126 .0333817126 .0333817126 .0333817126 .0333817126 | 7 | . . . . . . | 8 | 1 1 1 1 1 1 | 9 | . . . . . . | +-------------------------------------------------------------------------------------+ . ** As can be seen in rows 1 and 2, the difference between the models' predictive margins and their standard errors is a factor of .03338
As an aside, note that following Clyde Schechter's comment on this list from Sept. 20, 2019, specifying the -expression()- option as -expression(predict(mu)/d)- when running -margins- with the grouped data provides predictive margins and delta method standard errors that are very close to those obtained with individual data.