GLM and predicted margins for grouped data: "rescaling" to individual data?

Juan_Gonzalez

Join Date: Sep 2020
Posts: 16

GLM and predicted margins for grouped data: "rescaling" to individual data?

01 Jun 2021, 09:46

When using -margins- without the -expression()- option after glm binomial logit with grouped data, the predictive margins and their standard errors are much larger than they would be if the data were ungrouped. Playing with different datasets I noticed that the magnitude of the difference seems always to be a factor that stays the same across the model's predictors, suggesting the difference is some sort of scaling. My question is: how can I compute the "scaling" factor linking the grouped and individual margins results when I only have access to grouped data?

To illustrate what I mean, below is an example of individual data which I group, then I run a model on the individual and the grouped data, and finally I compare the predictive margins for the two models. In this example, the group data's margins and their standard errors are larger than the original ungrouped data's by a constant factor of .03338.

Code:

webuse nhanes2f, clear
keep if !missing(diabetes, female, black, age, age2)

// Create denominator variable to set up grouped data
egen ycovpatt = group(diabetes female black age age2)
egen d = count(ycovpatt), by(ycovpatt)

// Create outcome variable for grouped data
gen diabetesg = diabetes * d

// Identify duplicate outcome-covariate patterns (so group data is dup == 1)
bysort ycovpatt: gen dup = _n

// Run model with full (ie individual) data and save results and predicted margins
qui glm diabetes i.female i.black c.age c.age2, family(binomial) link(logit)        
eststo i
mat lli = `e(ll)'
qui margins female, at(age=(20 40 60))
mat mtabi = r(table)

// Repeat with grouped data
qui glm diabetesg i.female i.black c.age c.age2 if dup == 1, family(binomial d) link(logit)    
eststo g
mat llg = `e(ll)'
qui margins female, at(age=(20 40 60))
mat mtabg = r(table)

Here are the results compared:

Code:

. // Compare results and statistics
. esttab i g

--------------------------------------------
                      (1)             (2)   
                 diabetes       diabetesg   
--------------------------------------------
main                                        
0.female                0               0   
                      (.)             (.)   

1.female            0.157           0.157   
                   (1.66)          (1.66)   

0.black                 0               0   
                      (.)             (.)   

1.black             0.721***        0.721***
                   (5.69)          (5.69)   

age                 0.132***        0.132***
                   (4.55)          (4.55)   

age2            -0.000703*      -0.000703*  
                  (-2.55)         (-2.55)   

_cons              -8.150***       -8.150***
                 (-10.93)        (-10.93)   
--------------------------------------------
N                   10335             345   
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

. ** The coefficients and standard errors are the same
. mat li lli

symmetric lli[1,1]
            c1
r1  -1808.5522

. mat li llg

symmetric llg[1,1]
            c1
r1  -1808.5522    

. ** The log likelihoods are the same

. mat li mtabi

mtabi[9,6]
            1._at#     1._at#     2._at#     2._at#     3._at#     3._at#
         0.female   1.female   0.female   1.female   0.female   1.female
     b  .00129158  .00151003  .01769071  .02057441  .17840235  .19903016
    se  .00063654  .00074138  .00260439  .00293701  .07697392  .08082237
     z  2.0290627  2.0367778  6.7926462  7.0052315  2.3176986  2.4625627
pvalue   .0424519  .04167232  1.101e-11  2.466e-12  .02046571   .0137948
    ll  .00004398  .00005695   .0125862  .01481798  .02753624  .04062122
    ul  .00253918  .00296311  .02279522  .02633084  .32926845   .3574391
    df          .          .          .          .          .          .
  crit   1.959964   1.959964   1.959964   1.959964   1.959964   1.959964
 eform          0          0          0          0          0          0

. mat li mtabg

mtabg[9,6]
            1._at#     1._at#     2._at#     2._at#     3._at#     3._at#
         0.female   1.female   0.female   1.female   0.female   1.female
     b  .03869134  .04523528   .5299521   .6163378  5.3443137  5.9622514
    se  .01906858  .02220923  .07801851   .0879825  2.3058709  2.4211572
     z  2.0290627  2.0367778  6.7926462  7.0052315  2.3176986  2.4625627
pvalue   .0424519  .04167232  1.101e-11  2.466e-12  .02046571   .0137948
    ll  .00131761  .00170598  .37703864  .44389526  .82488989  1.2168706
    ul  .07606507  .08876458  .68286556  .78878034  9.8637376  10.707632
    df          .          .          .          .          .          .
  crit   1.959964   1.959964   1.959964   1.959964   1.959964   1.959964
 eform          0          0          0          0          0          0

. ** The predictive margins are different but the z values are the same
.
. // Divide the elements of the individual data's margins results matrix by those of the group data's
. mata: A = st_matrix("mtabi")

. mata: B = st_matrix("mtabg")

. mata: A:/B              
                 1             2             3             4             5             6
    +-------------------------------------------------------------------------------------+
  1 |  .0333817126   .0333817126   .0333817126   .0333817126   .0333817126   .0333817126  |
  2 |  .0333817126   .0333817126   .0333817127   .0333817126   .0333817126   .0333817126  |
  3 |            1             1   .9999999992             1   .9999999997   .9999999997  |
  4 |  .9999999996   .9999999998   1.000000036   .9999999967   1.000000002   1.000000002  |
  5 |  .0333817127   .0333817127   .0333817126   .0333817126   .0333817126   .0333817126  |
  6 |  .0333817126   .0333817126   .0333817126   .0333817126   .0333817126   .0333817126  |
  7 |            .             .             .             .             .             .  |
  8 |            1             1             1             1             1             1  |
  9 |            .             .             .             .             .             .  |
    +-------------------------------------------------------------------------------------+

. ** As can be seen in rows 1 and 2, the difference between the models' predictive margins and their standard errors is a factor of .03338

As an aside, note that following Clyde Schechter's comment on this list from Sept. 20, 2019, specifying the -expression()- option as -expression(predict(mu)/d)- when running -margins- with the grouped data provides predictive margins and delta method standard errors that are very close to those obtained with individual data.

Tags: None

Announcement

GLM and predicted margins for grouped data: "rescaling" to individual data?