Cluster using xtreg or vce

Hania ElBanhawi

Join Date: Jan 2022

Posts: 9
#1

Cluster using xtreg or vce

04 Jan 2022, 07:48

Hello,

I am analysing a dataset to understand the relationship of a variable ind on the outcome y, which may include clustering at state-level. I am confused about the difference between using xtreg re and xtreg re vce(cluster state). When I only use xtreg re (see below) the SEs are smaller than the adjusted OLS with cluster-robust SEs using vce(cluster state), which I understand should be the opposite? here's my code:

regress y ind smoke i.edu i.inc sex years, vce(cluster state)
xtset state
xtreg y ind smoke i.edu i.inc sex years

Should I be including vce(cluster state) again after my xtreg?
NOTE this is not panel data, it is cross-sectional.

Thank you,
Hania

Last edited by Hania ElBanhawi; 04 Jan 2022, 07:50.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

04 Jan 2022, 08:16

Hania:
welcome to this forum.
Are you dealing with a cross-sectional or panel datsets? Only the latter needs -xtset-ting your data and then go -xtreg-.
Conversely, if your dataset is cross-sectional, you should go -regress- (assuming that you regressand is continuous).
In addition, both commands allows clustered standard errors (which is in fact clustered-robust under -xtreg-); however, you shoud have at least 30 clusters to make it works properly.
The usual aside is to read and act on the FAQ when posting.

Kind regards,
Carlo
(Stata 19.0)
Comment
Hania ElBanhawi

Join Date: Jan 2022

Posts: 9
#3

04 Jan 2022, 11:25

Hello Carlo,

Thank you for your response! I am dealing with a cross sectional dataset with clustering (50 US states).
Thank you as well for the warm welcome and the pointer to the FAQ's.

Hania
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#4

04 Jan 2022, 11:32

Hania:
you should go -regress- with standard errors clustered on US States:

Code:

regress <depvar> <indepvars> <potential_controls>, vce(cluster US_states)

Kind regards,
Carlo
(Stata 19.0)
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2160
#5

05 Jan 2022, 08:23

Just to clarify: As a general rule, it is permissible to use random effects estimators when you have a cluster structure that is not a panel data structure. In fact, the random effects variance-covariance matrix is a bit more believable because there cannot be serial correlation as in a panel data setting. But if one does this, it is only fair to compare cluster-robust standard errors in both cases. In my experience, using vce(cluster id) for OLS but the nonrobust standard errors for RE leads to just what Hania found: the nonrobust standard errors from RE are smaller. It doesn't have to be that way, but assuming all of the GLS assumptions are true often biases the standard errors downward. So xtreg y x1 ... xK, re vce(cluster id) should be used with panel data or other clustering.

Having said that, if you have a large cross section with not so many states -- in the U.S., G = 50 states -- then you shouldn't use random effects. Ideally, you would include state fixed effects, but if you're studying a policy that changes only at the state level then you can't include stage fixed effects. In the end, you should use Carlo's suggestion. But be cautioned that N > 30 might not be enough for clustering to work well if you have large group sizes. How many individuals do you have per state, on average?
2 likes
Comment
Hania ElBanhawi

Join Date: Jan 2022

Posts: 9
#6

05 Jan 2022, 11:42

Thank you Carlo and Jeff for your insights!

I am using a cross-sectional dataset for just one year, with the average cluster size being 2,955.7. I have 50 states - would you say that's not enough to use random effects?

Just so I know (even if I'm not using RE here) you mention xtreg x1 ... xK, re vce(cluster id) should be used... should vce(cluster id) always be added to xtreg? is that the SE adjustment?

Thank you again
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#7

05 Jan 2022, 11:59

Hania:
1) with, on average, 2,955.7 observations per cluster, I would go -regress- with standard errors clustered at state level;
2) under -xtreg-, the -vce(cluster clusterid)- option for SE takes heteroskedasticity and/or autocorrelation into account. In your case (assuming that you want to apply an -xt- command to a cross-sectional dataset), I would go -vce(cluster clusterid)-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Hania ElBanhawi

Join Date: Jan 2022

Posts: 9
#8

05 Jan 2022, 12:32

That makes sense, thank you Carlo!
In general, is there a number of individuals per cluster, or a number of clusters (or a ratio between the two) that makes RE contraindicated?

Thanks again!
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#9

05 Jan 2022, 13:50

Hania:
It seems that you mixed up cluster-robust standard error with -re- specification.
Could you please clarify? Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Hania ElBanhawi

Join Date: Jan 2022

Posts: 9
#10

05 Jan 2022, 14:00

Hi Carlo--
Thank you for the note. Essentially I see evidence of heteroskedasticity (so wanted to use cluster robust SEs) but I also wanted to explore if there is any unobserved heterogeneity at state-level using RE.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#11

06 Jan 2022, 03:42

Hania:
1) if you detected heteroskedasticity in a cross-sectional study and your regressand is continuous, you shoud go -regress- with -robust- standard errors;
2) if you suspect that your cross-sectional study suffers from systematic error autocorrelation (I surmise this is what you mean by unobserved heterogeneity), you should go -regress, vce(cluster state)-.
3) if you detected heteroskedasticity in a panel dataset and your regressand is continuous, you shoud go -xtreg- with -robust- or vce(cluster clusterid) standard errors.

Kind regards,
Carlo
(Stata 19.0)
Comment
Hania ElBanhawi

Join Date: Jan 2022

Posts: 9
#12

06 Jan 2022, 12:22

Carlo,
That's very clear! Thank you so much.
Hania
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2160
#13

06 Jan 2022, 18:38

You have almost 3,000 observations and only 50 clusters. I doubt that the clustered standard errors work very well. If you want to allow for heteroskedasticity then just use vce(robust), as Carlo said.
1 like
Comment
Hania ElBanhawi

Join Date: Jan 2022

Posts: 9
#14

07 Jan 2022, 10:36

Hi Jeff,
Thank you for the helpful explanation. Understood. May I ask with a total of ~150,000 observations, how do we determine that 50 clusters is not enough?
Comment
Nitin Jain

Join Date: Apr 2022

Posts: 65
#15

19 Apr 2024, 15:46

Originally posted by Jeff Wooldridge View Post

Just to clarify: As a general rule, it is permissible to use random effects estimators when you have a cluster structure that is not a panel data structure. In fact, the random effects variance-covariance matrix is a bit more believable because there cannot be serial correlation as in a panel data setting. But if one does this, it is only fair to compare cluster-robust standard errors in both cases. In my experience, using vce(cluster id) for OLS but the nonrobust standard errors for RE leads to just what Hania found: the nonrobust standard errors from RE are smaller. It doesn't have to be that way, but assuming all of the GLS assumptions are true often biases the standard errors downward. So xtreg y x1 ... xK, re vce(cluster id) should be used with panel data or other clustering.

Having said that, if you have a large cross section with not so many states -- in the U.S., G = 50 states -- then you shouldn't use random effects. Ideally, you would include state fixed effects, but if you're studying a policy that changes only at the state level then you can't include stage fixed effects. In the end, you should use Carlo's suggestion. But be cautioned that N > 30 might not be enough for clustering to work well if you have large group sizes. How many individuals do you have per state, on average?

Dear Prof. Wooldridge, I have a similar question. My data is of 10 years for 300 firms. Hausman test indicated random effects (Re) is appropriate and intuitively also Re makes sense for my variables. However, if I cluster by firm id, I get a lot of significant results as compared to without clustering. How does one justify use of clustering by firm id? What if I had only 3 years of data? Please share your views.
Comment

Announcement

Cluster using xtreg or vce

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment