Proper model for highly skewed dependent variable.

Steve Barry

Join Date: Nov 2020

Posts: 10
#1

Proper model for highly skewed dependent variable.

26 Dec 2020, 09:22

Hello,

I had previously asked a similar question, but it was partly unanswered, so I'm narrowing down the scope of this questions to two specific parts.
A bit of a background to my questions:
I'm coming from the field of strategic management, and the prevalent technique is large sample analysis is fixed-effects or random-effects regression whenever the structure of data allows.
Given that note, I see many papers in "A Journals" that treat dependent variables such as R&D intensity or R&D expenditure with either fixed-effects or random-effects regression.
My specific questions are as follows:

1. How "wrong" it is, technically, to use such models for positive dependent variables? Specifically, in case of R&D intensity (R&D expenditure divided by sales), the value rarely goes above 2, so I think xtreg should give somewhat of an incorrect estimation?

2. How "correct" it is, technically, to use xtpoisson or glm with a log link for such dependent variables? I've compared the results of xtpoisson and glm poisson family with xttobit (with ul and ll defined), and they're highly consistent, signaling to me that these (xtpoisson and glm) are far better estimators than xtreg.

My significant concern, leading me to asking the above questions, is that some relationships are significant using xtreg, but not xtpoisson, and vice versa. I see even the same authors picking methods for the same DV in different papers arbitrarily, making me very confused as a junior and inexperienced researcher.
Your answers are much appreciated in advance, and any reference on the "correctness" or "incorrectness" on such scenarios is highly appreciated.

Thanks.

Last edited by Steve Barry; 26 Dec 2020, 09:30.
Tags: fixed effects, GLM, panel data, regression, xtpoisson
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#2

26 Dec 2020, 21:51

Steve: I think fixed effects Poisson makes the most sense. I would apply it to R&D with log(sales) as an explanatory variable. This is the same as using R&D/sales if you include log(sales) as an explanatory variable. Contrast the linear model when they wouldn’t be the same.

I’ve seen examples with nonnegative outcomes, usually with a spike at zero, where the linear model gives crazy results and the exponential model using xtpoisson gives sensible results. The functional form can matter a lot here. Of course, you should always use robust standard errors.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#3

26 Dec 2020, 23:12

Originally posted by Steve Barry View Post

. . . some relationships are significant using xtreg, but not xtpoisson, and vice versa. I see even the same authors picking methods for the same DV in different papers arbitrarily . . .

Maybe you can write to the various authors and ask them why they chose one model type over another on different occasions. For the sake of the field of strategic management, I hope that their answers don't make it look like you've just answered your own question.
1 like
Comment
Steve Barry

Join Date: Nov 2020

Posts: 10
#4

27 Dec 2020, 19:02

Thank you so much for your input.

I actually ran two models for one of m: one with xtpoisson and the other with xtreg fixed-effects. The residual sum of squared for xtpoisson is about 128, while it is about 432 for xtreg. Sample size is about 15600.

If I may follow up with another question, as I'm mostly concerned about reviewers being very much used to OLS. Is there any reference discussion the use of poisson family with continuous data?
Also, how can I communicate to reviewers that using OLS is indeed less efficient? Maybe running the OLS and checking for diagnosis?

I'd highly appreciate your thoughts and insights.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#5

27 Dec 2020, 20:59

Originally posted by Steve Barry View Post

Is there any reference discussion the use of poisson family with continuous data?

Yes, there's one that I'm aware of.

What Jeff recommends is described in some detail on the StataCorp blogsite here.

Where Jeff says, "include log(sales) as an explanatory variable", I suggest using the exposure(varname) option of the xtpoisson , fe command to do so. That will give you the R&D/sales ratio that Jeff mentions.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#6

27 Dec 2020, 22:00

I proved consistency of the Poisson FE estimator without any assumptions other than the conditional mean is correct in my 1999 Journal of Econometrics paper, “Distribution-Free Estimation of Some Nonlinear Panel Data Models.” The leading case is Poisson FE with an exponential mean function.

You’re on the right track with your goodness-of-fit measures. I’ll have some other suggestions, too.
Comment
Steve Barry

Join Date: Nov 2020

Posts: 10
#7

28 Dec 2020, 05:08

Thanks so much Joseph and Jeff! I'm really grateful.

Jeff, I will be right on your publication, thanks so much for mentioning it. The other suggestions you mentioned, are they incorporated into the same paper? Or are they something you will be adding in this thread?

Thanks so much again.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#8

28 Dec 2020, 20:23

Hi Steve:

The only real issue is functional form of the mean -- given that you're controlling for the same explanatory variables and you are using FE estimation in both cases. This is a rare case where comparing sums of squared residuals makes sense when using FE because these are the only two estimators I know of where putting in dummy variables is the same as using an argument to remove the heterogeneity. Only linear FE and Poisson FE have this feature as far as I know.

I would use a kind of RESET test, properly adjusted for FE estimation. The key is not to include the estimated fixed effects in the nonlinear functions (as that cannot be justified). I'm assuming your N is pretty large and T not so large.

In the linear case, put in square and cubic terms and test joint significance:

Code:

xtset id year xtreg y x1 ... xK i.year, fe vce(cluster id) predict xbhat, xb gen xbhatsq = xbhat^2 gen xbhatcu = xbhat^3 xtreg y x1 ... xK i.year xbhatsq xbhatcu, fe vce(cluster id) test xbhatsq xbhatcu

Joint significance of the two terms indicates functional form misspecification.

The same may work with xtpoisson but I need to check how predict works in that case.

Jeff
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#9

29 Dec 2020, 02:16

Dear Steve Barry,

Jeff already provided the most important advice, but if you are still looking for an example of the use of Poisson regression with continuous data as mentioned in #4, I suggest you have a look at

Santos Silva, J.M.C. and Tenreyro, S. (2006), The Log of Gravity, The Review of Economics and Statistics, 88(4), pp. 641-658,

which has become a standard reference for this.

Best wishes,
Joao
1 like
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#10

29 Dec 2020, 06:37

Steve: One correction on terminology: you have a corner solution outcome, not a continuous outcome. The latter would not have mass at zero.
Comment
Steve Barry

Join Date: Nov 2020

Posts: 10
#11

29 Dec 2020, 07:14

Dear Joao,

Thank you very much for the reference. I was indeed looking for such a reference (I think it's very important to show it to reviewers). Many thanks.

Dear Jeff,
Thanks so much for the advice and the code on comparing these approaches. It sure helps a lot.
Also, (maybe not an important point, but) my understanding of corner solution outcome is rather limited, but R&D intensity does not have a lot of values at zero. In fact, more than 99% of my firms have spent some positive values on R&D. It's just that the upper bound is about 1.3 for about 99% of observations. Does this description not fit a continuous limited variable? I've actually used xttobin to compare the results and it's highly consistent with the results produced by xtpoisson, yet again signaling that xtreg is indeed not the best pick in my case.

Thank you very much again.

Last edited by Steve Barry; 29 Dec 2020, 07:19.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#12

29 Dec 2020, 07:28

Steve: I see. I must be remembering a different thread, where something like 10 or 15 percent were at zero.

In my view, if there is no natural upper bound -- such as a proportion being bounded above by one -- then you do not do anything special. You really shouldn't let the data determine the bounds. In every data set there are lower and upper bounds!

So calling it continuous in your case is essentially correct.

BTW, just so you are aware, Joao's paper does not consider panel data. When using the fixed effects version of Poisson quasi-MLE it's not as easy as saying "It can be easily extended to panel data." The argument for consistency is more complicated in the panel data case with fixed effects.
Comment
Steve Barry

Join Date: Nov 2020

Posts: 10
#13

29 Dec 2020, 07:48

Thanks Jeff.

I see. And you're right, there's an upper and lower bound for every dataset.

So, [my final question!], given that there's no natural upper bound here, do you still recommend using fixed-effects Poisson over OLS?
If so, are you aware of any application of panel regression Poisson to continuous variables?
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#14

30 Dec 2020, 02:06

Dear Jeff Wooldridge,

To be precise, we consider the panel data case in a footnote and we are able to say that "it can be easily extended to panel data" because you had proved that result and we could just cite your 1999 paper .

BTW, your paper on the fractional regression is also cited and that was a big inspiration for us, so thank you for that.

Best wishes,

Joao
Comment

Announcement

Proper model for highly skewed dependent variable.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment