xtreg re (and xtreg fe) with many different dummy types (i.firm, i.industry, i.month, i.year)

Richard Williams

Join Date: Apr 2014

Posts: 5026
#16

12 Oct 2014, 17:54

I'd be reluctant to impute 6/7ths of my data. Why is it missing? Is it because the info doesn't change that much within a year so no need to report it? Or does it change a lot so they only check it every several months? Would it be reasonable to view the info as being missing at random, so listwise deletion wouldn't be so bad? Knowing more about why the data are missing might help with the decision on what to do about it.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Victoria Rogers

Join Date: Oct 2014

Posts: 138
#17

12 Oct 2014, 19:19

@Richard, the industry information for mutual funds is only given for some quarters and I'm using monthly returns. So the maximum available industry data per 12 months of data is 4. However, lots of quarters of different funds don't have an industry code (Lipper classification) (on the famous economic data-site called WRDS). They rarely switch industries, however, sometimes the names of the industries are changed. So, it probably isn't reported because it doesn't change so much. Therefore, I tried to find a way to insert the missing values if the neighboring non-missing values (of the same fund) are the same.

I find an other imperfect decent way to do it. Not exactly what I wanted though because there isn't a restriction like missing values being filled in only when their neighboring non-missing values are the same and it's also not restricted by the beginning or end of a year. However, it might be better to use that imperfect method because now I have about 90,000 industries instead of 20,000 ....using the above mentioned restrictions would decrease the amount of industries to about 60,000 I assume. It's probably not a problem to use the imperfect method, because the companies rarely switch industries.

Code:

* carry backward gen forward = episode bysort id (time) : replace forward = forward[_n-1] if missing(forward) * carry backward gen backward = episode gsort id -time by id : replace backward = backward[_n-1] if missing(backward)

So, if someone knows the command of my previous post, I'd like to know how it works just out of curiousity

(now I only have 2 big problems left, 1 relatively easy and 1 very difficult)
(http://www.statalist.org/forums/foru...versus-areg-r2)
(difficult one: http://www.statalist.org/forums/foru...rror-code-2000)

Last edited by Victoria Rogers; 12 Oct 2014, 19:44.
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2611
#18

13 Oct 2014, 12:39

Victoria: Regarding the robust Hausman test, you can run a "regression-based Hausman test" as should be described somewhere in Wooldridge (2010): "Econometric Analysis of Cross Section and Panel Data". (I do not have the book at hand at the moment to check the page numbers.) You should also be able to find lecture slides on the regression-based test with google.

The idea is to create time averages of your variables and to add them to the regression, which can then be estimated with the RE estimator and robust standard errors. Simple example with a time-varying regressor x and a time-invariant regressor z (for instance gender):

Code:

by personID: egen xbar = mean(x) xtreg y x z xbar, re vce(cluster firm)

The regression-based Hausman test is then a simple Wald test on the (joint) significance of the coefficient(s) for xbar. If the coefficient is significant, you would reject the null hypothesis of the random effects assumption.

You have to generate and add a separate variable xbar for all time-varying regressors x that you are using in the regression. Also note that xbar is added but not zbar because zbar would be perfectly collinear with z. The test therefore implicitly assumes that the unobserved effects are orthogonal to the time-invariant regressors (after controlling for the xbar).

Last edited by Sebastian Kripfganz; 13 Oct 2014, 12:42.

https://www.kripfganz.de/stata/
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30194
#19

13 Oct 2014, 13:03

I hate to be nihilistic, but we have a dataset here where a key variable had missing values for 6/7 of the observations. We have a very sketchy description of this missingness mechanism, and the original poster has applied an imputation in which she seems to have only minimal confidence. Now, instead of 6/7 of the observations having a key missing variable, 6/7 of the observations have a potentially misclassified key variable (or are still missing)! In this context, worrying about fixed vs random effects and adjustments to R2 feels like rearranging the deck chairs on RMS Titanic. To be honest, I can't think of any analysis I would apply to a data set with 6/7 of the observations being specious.
I realize that this work has been assigned to your by your boss. But your boss might be well advised to consider Tukey's warning: "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." If the boss really needs an answer to this question badly enough, perhaps he/she might be persuaded to procure better quality data from other sources.

Last edited by Clyde Schechter; 13 Oct 2014, 13:06.
1 like
Comment
Mark Schaffer

Join Date: Mar 2014

Posts: 324
#20

14 Oct 2014, 09:54

Victoria/Sebastian: the regression-based Hausman-like test for fixed vs. random effects is reported by xtoverid after estimation by xtreg,re. xtoverid can be installed from SSC in the usual way.
1 like
Comment
Victoria Rogers

Join Date: Oct 2014

Posts: 138
#21

15 Oct 2014, 14:32

Originally posted by Clyde Schechter View Post

I hate to be nihilistic, but we have a dataset here where a key variable had missing values for 6/7 of the observations. We have a very sketchy description of this missingness mechanism, and the original poster has applied an imputation in which she seems to have only minimal confidence. Now, instead of 6/7 of the observations having a key missing variable, 6/7 of the observations have a potentially misclassified key variable (or are still missing)! In this context, worrying about fixed vs random effects and adjustments to R2 feels like rearranging the deck chairs on RMS Titanic. To be honest, I can't think of any analysis I would apply to a data set with 6/7 of the observations being specious.
I realize that this work has been assigned to your by your boss. But your boss might be well advised to consider Tukey's warning: "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." If the boss really needs an answer to this question badly enough, perhaps he/she might be persuaded to procure better quality data from other sources.

A very nice remark (I don't mean that sarcastic) I haven't worried about fixed vs random for a very long time. I also see the adjusted R-squared as something extra, even though it's still something I really would like to add, if that's even possible. I wouldn't call it specious, at least not for the 6/7 part of the whole dataset, due to the fact that the mutual funds rarely change the classification of their funds.

I've tried several methods to get the lowest bias, but none of them work exactly as I want.
If someone knows a better method than the 1 I use now, I would greatly appreciate that suggestion of course.

Wanted method: for example, if the industry (classification code) of 1 fund is small-cap growth funds on February 2000 and also on June 2002, with missing values between those specific months, then chances are very high that the classification code wasn't changed to e.g. large-cap value fund on 1 January 2001 and back to small-cap growth funds on 1 May 2002, so then it's quite safe and unbiased to convert the missing values into the code small-cap growth funds

Current method: for example, if the industry (classification code) of 1 fund is small-cap growth funds on February 2000 and a different code (e.g. large-cap) on June 2002, with missing values between those specific months, then currently all those missing values get the code small-cap growth funds even though it's unknown on which monthl exactly the fund changed it's small-cap code to something else e.g. large-cap

I've sent an e-mail to the source of the Lipper Classification codes a few weeks ago, however, I still haven't received a reply yet. So, if someone knows the wanted method, I would like to learn it.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment