log linear model and theteroskedasticity

Jasmin Passet

Join Date: Sep 2014

Posts: 25
#1

log linear model and theteroskedasticity

02 Mar 2015, 08:42

Dear statalisters,

I am using linear regression to investigate factors influencing my right skewed dependent variable. Because of issues of heteroskedasticity in residuals after performing regression, I log-transformed the dv which works much better (I checked graphically with rvfplot and additionally used estat hettest and estat imtest).

Still I am concerned with issues of heteroskedasticity in the relationship of the residuals with the right-hand-side variables of the model. This is because heteroskedasticity associated with the regressors can make correct inference from a log linear model quite problematic. (see Blackburn 2007 or Manning/Mullahy 2001).
The only way I know of for performing such a test is to use estat hettest, rhs mtest || estat hettest, rhs mtest(b)
It produces quite different results depending on correction for multiple testing or not. If I use bonferroni correction I only see heteroskedasticity in one variable (out of 15), which makes me hope that I will get consistent estimates of marginal effects. But if I dont adjust p-values six variables are significant. I am not sure how to interpret this big difference and hope you have any suggestions.
Since I have a lot of binary variables I assume that it makes no sense to investigate this graphically (something comparable to rvfplot)?!

I would prefer not to transform the dep var and use poison or glm with log link as suggested in the papers cited above (or suggested on previous posts on Stata List). To my knowledge because of the asymptotic properties of these models they are not feasible for my rather small N of 150 cases. Please correct me if I am wrong with this!

I highly appreciate any comments!

lit: Blackburn, McKinley L. 2007: Estimating wage differentials without logarithms. In: Labour Economics 14;1: 73–98.
Manning, Willard G.; Mullahy, John 2001: Estimating log models: to transform or not to transform? In: Journal of Health Economics 20;4: 461–494.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17678
#2

02 Mar 2015, 09:08

Jasmin:
-did you perform -ovtest- as well?
- what if you invoke robust standard errors for your predictors?
As a general rule, your chance of getting helpful replies is conditional on posting (via code delimiters: # icon among the Advanced editor options) what you typed and what Stata gave you back.

Kind regards,
Carlo
(Stata 19.0)
Comment
Jasmin Passet

Join Date: Sep 2014

Posts: 25
#3

02 Mar 2015, 11:08

Thank you Carlo for your reply (and sorry for the triplicate post)

Carlo suggested

what if you invoke robust standard errors for your predictors

My intention is to interpret linear effects. To achieve this I have to retransform the coefficients from the log-linear model. In order to use simple retransformation (exponentiate coeff.) and get consistent estimates of the linear effects there should be no heteroskedasticity associated with the rhs regressors in the model. According to Blackburn (2007) using robust standard errors is not the solution to this, as using this option usually implies that the researcher assumes that there is heteroskedasticity to deal with.

Carlo also aksed

did you perform -ovtest- as well?

Thanks for this suggestion. There seems to be no evidence for omitted variables.

Code:

estat ovtest

This resulted in:
Ramsey RESET test using powers of the fitted values of log_p_dauerf1f84korr
Ho: model has no omitted variables
F(3, 121) = 1,37
Prob > F = 0,2554

I am sorry I didnt post my syntax and results right away. I hope this is helpfull

reg log_p_dauerf1f84korr i.kwz_1 i.kwz_2 /*kwz_3*/ i.kwz_4 i.kwz_5 i.kwz_6 i.f_migbackzp_v2 i.f_isced_2katv2 c.pf_alterf84##c.pf_alterf84 ///
i.m_migbackzp_v2 i.m_isced_2katv2 c.pm_alterf84##c.pm_alterf84 ///
/*p_bezdauerf84*/ i.p_unmarriedf84 ///
i.p_prozf84v3 ///
if filter1==1,level(90)

* for the general test of heteroscedasticiy i used:

Code:

estat hettest,normal

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of log_p_dauerf1f84korr

chi2(1) = 0,29
Prob > chi2 = 0,5882

* the test for the rhs-variables (unadjusted) is
[CODE] estat hettest,rhs mtest

Variable chi2 df p

1.kwz_1 0,03 1 0,8677 #
1.kwz_2 2,39 1 0,1223 #
1.kwz_4 10,89 1 0,0010 #
1.kwz_5 4,72 1 0,0298 #
1.kwz_6 1,44 1 0,2297 #
1.f_migbac~2 0,01 1 0,9049 #
1.f_isced_~2 2,64 1 0,1043 #
pf_alterf84 3,90 1 0,0483 #
c.
pf_alterf84#
c.
pf_alterf84 3,12 1 0,0774 #
1.m_migbac~2 0,06 1 0,8014 #
1.m_isced_~2 0,05 1 0,8183 #
pm_alterf84 3,98 1 0,0460 #
c.
pm_alterf84#
c.
pm_alterf84 3,73 1 0,0533 #
1.p_unmar~84 2,86 1 0,0910 #
p_prozf84v3
Frau < Mann 0,05 1 0,8269 #
Frau > Mann 5,94 1 0,0148 #
ein Part..) 1,62 1 0,2038 #
ein Part..) 5,31 1 0,0212 #

simultaneous 39,90 18 0,0022

# unadjusted p-values

* the test for the rhs-variables (bonf. adjusted) is

Code:

estat hettest, rhs mtest(b)

Variable chi2 df p

1.kwz_1 0,03 1 1,0000 #
1.kwz_2 2,39 1 1,0000 #
1.kwz_4 10,89 1 0,0174 #
1.kwz_5 4,72 1 0,5370 #
1.kwz_6 1,44 1 1,0000 #
1.f_migbac~2 0,01 1 1,0000 #
1.f_isced_~2 2,64 1 1,0000 #
pf_alterf84 3,90 1 0,8686 #
c.
pf_alterf84#
c.
pf_alterf84 3,12 1 1,0000 #
1.m_migbac~2 0,06 1 1,0000 #
1.m_isced_~2 0,05 1 1,0000 #
pm_alterf84 3,98 1 0,8272 #
c.
pm_alterf84#
c.
pm_alterf84 3,73 1 0,9603 #
1.p_unmar~84 2,86 1 1,0000 #
p_prozf84v3
Frau < Mann 0,05 1 1,0000 #
Frau > Mann 5,94 1 0,2670 #
ein Part..) 1,62 1 1,0000 #
ein Part..) 5,31 1 0,3814 #

simultaneous 39,90 18 0,0022

# Bonferroni-adjusted p-values
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17678
#4

02 Mar 2015, 11:37

Jasmin:
as far as I can read the sintax you posted (for future messages, the best way to post is via code delimiters, ie: # button), your first test of heteroskedasticity seems to sponsor the null hypothesis of no heteroskedasticity. I would probably have stopped my regression postestimation there.
The opposite results (ie, heteroskedasticity) that you have obtained with -mtest(b)- may well be due to the high number of categorical predictors included in the righ-hand side of your equation.
Hence, I would think about these two opposite results also in the light of:
-the theory that underpins your research field;
-what others have done in dealing with your same research topic.
Unfortunately, as it is often occurs, there's no hard and fast rule: the final call about what test to trust is unavoidably yours.

Kind regards,
Carlo
(Stata 19.0)
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3001
#5

02 Mar 2015, 13:17

Dear Jasmin,

I may be a bit biased on this topic, but I think it is much safer to just use Poisson pseudo maximum likelihood. Your sample is not very large, but as long as you do not use too many regressors you should be OK. Notice that even if you use OLS all the inference you make generally is only asymptotically valid anyway and therefore you have little to lose. Also, keep in mind that the pattern of heteroskedasticity that matters is for the model in levels and therefore the results you are getting may be somewhat misleading.

Therefore, I suggest that you try PPML at least for comparison. If the results you get with the two methods (i.e., OLS in logs and PPML) are similar, that is reassuring; if they are different, then I would trust PPML!

All the best,

Joao
Comment
Jasmin Passet

Join Date: Sep 2014

Posts: 25
#6

03 Mar 2015, 02:06

Thank you Carlo for you comment on the number of variables, I will try to seize the number down and check if that makes any difference. I thought that the difference between the tests might have something do to with the adjustment of the p-values for multiple testing. I am wondering if the test for heterskedasticity with adjustment is more conservative in the sense that it need stronger evidence to be significant?! (but I have to admit for me this is the first time dealing with multiple testing yet)

Also much thanks to Joao for suggesting to use PPML despite the small sample and compare it with the log-linear model. I will try that, eventhough I learned that in small n situations OLS would be a more safe choice. Manning/Manning (2001) also suggest to try glm with log link and familiy(gaussian-/gamma) and use the park-test to decide on the appropriate model .
I know the question of how many cases with how many variables when using ml or pml is difficult to answer, but does anyone have any recommendations for references?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17678
#7

03 Mar 2015, 02:31

Jasmin:
as an add on to formal post-estimation tests, I would also visually inspect the pattern of regression residuals (via -rvfplot-, for instance) and see if a peculiar distribution pattern comes alive.

Kind regards,
Carlo
(Stata 19.0)
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3001
#8

03 Mar 2015, 15:32

Dear Jasmin,

Be careful with the use of the Park test because it is not valid for that purpose; we show that in The Log of Gravity. About the number of regressors to use, the general rule is that the square of the number of parameters divided by the number of observations should be "small" (Steve Portnoy has some well-known papers on this). So, with 150 observations I would try to have fewer than 10 parameters. Notice that this applies both to linear and to non-linear models.

All the best,

Joao
3 likes
Comment
Jasmin Passet

Join Date: Sep 2014

Posts: 25
#9

04 Mar 2015, 11:09

Thanks again for those helpfull remarks. The paper is helpfull. I have a better idea now what I have to do next.
Comment

Jasmin Passet

Join Date: Sep 2014
Posts: 25

#10

03 Apr 2015, 05:26

Dear stata listers, I have a follow-up question concerning model choice:
I compared the log-linear model with coeff. exponetiated (eform) with glm with family(poisson) and link(log); glm with family(gamma) and link(log); ppml

Code:

glm p_dauerf1f84korr i.f_migbackzp_v2 i.f_isced_2katv2 c.pf_alterf84##c.pf_alterf84 ///
i.m_migbackzp_v2 i.m_isced_2katv2 c.pm_alterf84##c.pm_alterf84  ///
/*i.p_childf84*/ c.p_bezdauerf84 i.p_unmarriedf84 if filter1==1, level(90) family(gamma) link(log)  vce(robust) eform

glm p_dauerf1f84korr i.f_migbackzp_v2 i.f_isced_2katv2 c.pf_alterf84##c.pf_alterf84 ///
i.m_migbackzp_v2 i.m_isced_2katv2 c.pm_alterf84##c.pm_alterf84  ///
/*i.p_childf84*/ c.p_bezdauerf84 i.p_unmarriedf84 if filter1==1, level(90) family(poisson) link(log)  vce(robust) eform

reg log_p_dauerf1f84korr i.f_migbackzp_v2 i.f_isced_2katv2 c.pf_alterf84##c.pf_alterf84 ///
i.m_migbackzp_v2 i.m_isced_2katv2 c.pm_alterf84##c.pm_alterf84  ///
/*i.p_childf84*/ c.p_bezdauerf84 i.p_unmarriedf84 if filter1==1, level(90) eform(var) vce(robust)

ppml p_dauerf1f84korr f_migbackzp_v2 f_isced_2katv2 pf_alterf84 pf_alterf84_sq ///
m_migbackzp_v2 m_isced_2katv2 pm_alterf84 pm_alterf84_sq  ///
/*i.p_childf84*/ p_bezdauerf84 p_unmarriedf84 if filter1==1

There is really no difference between the poisson model and the ppml, therefor I will go with the glm-poisson.
Comparison of coeff. and p-Values of log-linear model (1), glm-poisson (2) and glm-gamma (3) shows coeff. of rather similiar size and direction but significance changes from model to model.

Code:

(1)           (2)           (3)
log_p_daue~r  p_dauerf1f~r  p_dauerf1f~r

main                                                  
1.f_migbac~2        1,364+        1,319         1,222
(0,094)       (0,169)       (0,212)

1.f_isced_~2        1,125         1,107         1,109
(0,425)       (0,454)       (0,420)

pf_alterf84         0,892         0,745*        0,772
(0,518)       (0,048)       (0,184)

c.pf_alte~84        1,001         1,004+        1,004
(0,657)       (0,085)       (0,273)

1.m_migbac~2        0,694         0,823         0,867
(0,119)       (0,372)       (0,447)

1.m_isced_~2        0,747+        0,815         0,838
(0,052)       (0,152)       (0,152)

pm_alterf84         0,895         0,961         0,944
(0,329)       (0,687)       (0,527)

c.pm_alte~84        1,002         1,001         1,001
(0,269)       (0,650)       (0,508)

p_bezdaue~84        1,033         1,049*        1,053*
(0,106)       (0,007)       (0,003)

1.p_unmar~84        1,589*        1,730*        1,639*
(0,002)       (0,000)       (0,000)

N                     156           158           158
BIC                 438,9         503,6         469,2
ll                 -191,7        -224,0        -206,8
Exponentiated coefficients; p-values in parentheses
+ p<0.10, * p<0.05

Joao wrote

Therefore, I suggest that you try PPML at least for comparison. If the results you get with the two methods (i.e., OLS in logs and PPML) are similar, that is reassuring; if they are different, then I would trust PPML!

Do you have any suggestions on how to decide between glm models with different distributions (am I right that BIC is not valid because of diff. distributions/family)?
The GNR-test (Gauss-Newton-Regression-Test) proposed by Joao in the paper referenced above might be an option, but how can it be performed in Stata (I could not find any ado or syntax)?

Thank you!

Comment

Joao Santos Silva

Join Date: Apr 2014

Posts: 3001
#11

03 Apr 2015, 06:13

Dear Jasmin,

I am glad -ppml- and -glm- with the Poisson option give exactly the same results; that is how it should be (the only difference between the two commands is that -ppml- is likely to converge in cases where -glm- doesn't).

About how to chose between the models, you are right in saying that BIC is not useful in this case; the reason for this is that you are not estimating the models by Maximum Likelihood and the BIC is likelihood based. There are several ways you can use to choose between these models, including the GNR test you mentioned. However, it looks like the you lose 2 observations when doing the log-linear model, so I would start by dropping that one.

There is very little to choose between the Poisson and gamma based models, so you can almost toss a coin to pick one. One thing that you can do is to choose the one that gives you smaller standard errors for the coefficient of interest. If you want to be more formal, you can implement the GNR test quite easily; it is based on a simple OLS regression and that is why there is no specific code available.

All the best,

Joao
1 like
Comment
Jasmin Passet

Join Date: Sep 2014

Posts: 25
#12

15 Apr 2015, 10:56

Thanks again for those helpful remarks.

I was searching literature to underline Joao's argument that Gamma-PMLE and Poisson PMLE both are rather equivalent. I find a reference for that in Wooldridge (2010, p. 741) who writes that both are consistent and rather robust to missspecification other than the conditional mean. This might be helpful to others dealing with similiar problems. The choice seems to be rather a matter of precision as Manning/Mullahy 2001 put it.

Now in my case the conditional mean is defined by the log-link, as I understand it. I chose the log-link because coming from the log-linear model, which I chose because of the skewness of my dependent variable. In the GLM-literature, it is generally recommended to check the log-link (e.g. Hardin/Hilbe (2012), for example by using the command linktest.
I performed the linktest and found that _hatsq was significant, which should imply that link is not correctly specified. For some reason I am reluctant to choose a different link and I am actually wondering if I need to check the link at all, since it is motivated in the skewness of my dep var. I looked in different papers (Blackburn 2007; Manning/Mullahy 2001) that deal with similiar problems and both use the log-link and compare different families, but dont test if the log-link is correct.

Does anyone have any suggestions on how to proceed or arguments to motivate a certain choice.

References:
Wooldridge, Jeffrey M. 2010: Econometric analysis of cross section and panel data. 2nd ed. Cambridge, Mass: MIT Press.
Manning, Willard G.; Mullahy, John 2001: Estimating log models: to transform or not to transform? In: Journal of Health Economics 20;4: 461–494.
Blackburn, McKinley L. 2007: Estimating wage differentials without logarithms. In: Labour Economics 14;1: 73–98.
Hardin, James W.; Hilbe, Joseph M. 2012: Generalized linear models and extensions. A Stata Press publication. 3rd ed. College Station, Tex: Stata Press.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3001
#13

15 Apr 2015, 16:39

Hello again,

I do not know how you performed the link test, but make sure you use a method that is valid under heteroskedasticity/clustering.
Anyway, you do not have to abandon the log link, you may be able just to use a more flexible specification of the index function, for example including cross-products and squares of the regressors.

Joao
Comment
JFelipe PinedaG

Join Date: Aug 2017

Posts: 5
#14

17 Oct 2017, 13:47

Dear Joao,

I am trying to perform the Gauss-Newton Regression test to evaluate the adequacy of the PPML as you do in "The Log of Gravity". I already have the RESET test with 0.12 ( done as The Log of Gravity page says) but i really would like to perform de GNR could u help me to do it please. i already performed the Park-type test to the OLS.

Thank u very much.

Felipe
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3001
#15

17 Oct 2017, 14:17

Dear Felipe,

The test is performed as described in the paper; I am not sure how I can help you further. Anyway, 12 years after having written the paper, that test does not look that important.

Best wishes,

Joao
Comment

Announcement