I'm trying to make sense of the log-likelihood reported by Stata for a glm with family(gamma), which doesn't match that produced by R/statsmodels and seems to give a strange likelihood ratio test result.
As a minimal example, I am generating some random gamma-distributed response values y, whose mean varies with the predictor variable x:
The means are significantly different according to x: a t-test is significant, and fitting a log-normal GLM shows a significant coefficient for x, and a significant overall likelihood ratio test:
However, curiously, if I fit a gamma GLM, even though the coefficient for x is significant, the likelihood ratio test is very much not:
If I perform the same regression in R or statsmodels, the likelihood ratio test is significant (p < 0.001).
I am wondering if this is to do with what is on page 33 of the glm manual, which says Stata calculates the log-likelihood for the j-th observation (for the gamma family) as -(yj/μj) + ln(1/μj).
It's my understanding that the true log-likelihood for the gamma family would be -(k*yj/μj) + k*ln(k*yj/μj) - ln(Γ(k)) - ln(yj), given the shape parameter k, which is the formula statsmodels uses. So Stata's formula would be the case for k = 1. But because k in this case is >> 1 (it is 500), this would suggest that Stata's likelihood ratio statistic is underestimated. When I change the setup to instead have k ≈ 0, the opposite happens, and Stata's likelihood ratio test indicates significance when R/statsmodels does not, even when I draw values for y independently of x.
Have I missed something here? I could not find any previous discussion about likelihood ratio tests on Stata's gamma GLMs.
As a minimal example, I am generating some random gamma-distributed response values y, whose mean varies with the predictor variable x:
Code:
. set obs 1000 . generate x = mod(_n, 2) . set seed 12345 . generate y = rgamma(500, 2) . replace y = rgamma(500, 2.1) if x == 1
Code:
. ttest y, by(x) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 0 | 500 1001.813 1.94138 43.41059 997.999 1005.628 1 | 500 1049.878 2.138717 47.82317 1045.676 1054.08 ---------+-------------------------------------------------------------------- combined | 1,000 1025.846 1.631506 51.59274 1022.644 1029.047 ---------+-------------------------------------------------------------------- diff | -48.06466 2.888437 -53.73277 -42.39655 ------------------------------------------------------------------------------ diff = mean(0) - mean(1) t = -16.6404 Ho: diff = 0 degrees of freedom = 998 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000 . glm y x, family(gaussian) link(log) [...] Generalized linear models Number of obs = 1,000 Optimization : ML Residual df = 998 Scale parameter = 2085.767 Deviance = 2081595.82 (1/df) Deviance = 2085.767 Pearson = 2081595.82 (1/df) Pearson = 2085.767 Variance function: V(u) = 1 [Gaussian] Link function : g(u) = ln(u) [Log] AIC = 10.48277 Log likelihood = -5239.383583 BIC = 2074702 ------------------------------------------------------------------------------ | OIM y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | .0468623 .002818 16.63 0.000 .0413391 .0523854 _cons | 6.909567 .0020387 3389.14 0.000 6.905571 6.913563 ------------------------------------------------------------------------------ . estimates store m1 . glm y, family(gaussian) link(log) [...] . estimates store m2 . lrtest m1 m2 Likelihood-ratio test LR chi2(1) = 244.87 (Assumption: m2 nested in m1) Prob > chi2 = 0.0000
Code:
. glm y x, family(gamma) link(log) [...] Generalized linear models Number of obs = 1,000 Optimization : ML Residual df = 998 Scale parameter = .0019763 Deviance = 1.968478435 (1/df) Deviance = .0019724 Pearson = 1.972334751 (1/df) Pearson = .0019763 Variance function: V(u) = u^2 [Gamma] Link function : g(u) = ln(u) [Log] AIC = 15.87 Log likelihood = -7932.998033 BIC = -6891.971 ------------------------------------------------------------------------------ | OIM y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | .0468622 .0028138 16.65 0.000 .0413472 .0523772 _cons | 6.909567 .0019897 3472.72 0.000 6.905667 6.913466 ------------------------------------------------------------------------------ . estimates store m1 . glm y, family(gamma) link(log) [...] . estimates store m2 . lrtest m1 m2 Likelihood-ratio test LR chi2(1) = 0.55 (Assumption: m2 nested in m1) Prob > chi2 = 0.4587
I am wondering if this is to do with what is on page 33 of the glm manual, which says Stata calculates the log-likelihood for the j-th observation (for the gamma family) as -(yj/μj) + ln(1/μj).
It's my understanding that the true log-likelihood for the gamma family would be -(k*yj/μj) + k*ln(k*yj/μj) - ln(Γ(k)) - ln(yj), given the shape parameter k, which is the formula statsmodels uses. So Stata's formula would be the case for k = 1. But because k in this case is >> 1 (it is 500), this would suggest that Stata's likelihood ratio statistic is underestimated. When I change the setup to instead have k ≈ 0, the opposite happens, and Stata's likelihood ratio test indicates significance when R/statsmodels does not, even when I draw values for y independently of x.
Have I missed something here? I could not find any previous discussion about likelihood ratio tests on Stata's gamma GLMs.
Comment