How to estimate cluster robust errors in the presence of fixed effects for groups within cross sectional data?

Laiy Kho

Join Date: Oct 2022

Posts: 48
#1

How to estimate cluster robust errors in the presence of fixed effects for groups within cross sectional data?

27 Nov 2022, 12:17

Hello,

I am dealing with cross sectional data that has groups within the observations. For example, the dataset of an online firm registered in one country that allows multiple banks from different countries to sell the loans originated on their platforms. Each row in the dataset is one observation (one loan) with several characteristics, the start date, the interest rate, if it has been paid back, etc. Hence, each observation is unique as it represent different loans originated by multiple banks. I do not regard the data as time series as Time-series would be the same loans on several observations over time, daily, monthly, or so. I do not consider the data as panel as well, because I am assessing the performance of one online business (there is only one wave of data). Am I correct in my inference?

Secondly, since different banks sell their loans from different countries, I need to control for their fixed effects. Hence, I included some characteristics of the banks such as size, age, etc. of each bank as control variables. I also control for the location of the banks. I have over 40 banks in the data spanning from 20 countries. Instead of including countries, I classified them based on geographical region such as asia, africa, etc. I have 4 regions in my data and I did this because inclusion of country as factor variable instead of geographical location causes specification errors. I want to control for serial correlation by including cluster robust errors. What variable should I be adding as cluster in vce (cluster clustervar)? I usually see people adding geography. Is it okay to choose any other bank characteristic variable that you controlled for, for example size in my case? Size is different for all banks.
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3148
#2

27 Nov 2022, 13:09

If you've only got four groups, then you can't use cluster errors (or, if you do, you need to -boottest- after). You have enough countries for clustering. Need at least 10 clusters (some say more).

look at xtpcse and newey2, which allows AR corrections for panel data.

ivreg2 with dkraay(#) might work too.

Last edited by George Ford; 27 Nov 2022, 13:34.
Comment
Laiy Kho

Join Date: Oct 2022

Posts: 48
#3

27 Nov 2022, 13:56

Thank you George! I have a few questions.

If you've only got four groups, then you can't use cluster errors

Can I not cluster at bank level by using the size of the bank variable? It is different for all banks used in the sample, forming 46 clusters. Is it necessary to cluster based on countries? As I have mentioned earlier, inclusion of country variable causes specification errors, which is why I stuck to geographic region?

look at xtpcse and newey2, which allows AR corrections for panel data.

Are you suggesting that my data is panel data instead of cross sectional?

I would appreciate your response.
Comment
Laiy Kho

Join Date: Oct 2022

Posts: 48
#4

28 Nov 2022, 08:48

George Ford Thank you George! I have a few questions.

If you've only got four groups, then you can't use cluster errors

Can I not cluster at bank level by using the size of the bank variable? It is different for all banks used in the sample, forming 46 clusters. Is it necessary to cluster based on countries? As I have mentioned earlier, inclusion of country variable causes specification errors and extremely high multicollinearity, which is why I stuck to geographic region.

look at xtpcse and newey2, which allows AR corrections for panel data.

Are you suggesting that my data is panel data instead of cross sectional?

I would appreciate your response.
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#5

28 Nov 2022, 08:51

Could cluster at bank.

If AR, then you've got to do something. If not, then not.
Comment
Laiy Kho

Join Date: Oct 2022

Posts: 48
#6

28 Nov 2022, 09:29

George Ford Thank you George.

If AR, then you've got to do something. If not, then not.

Sincere apologies, but I am not sure If I am following you right. You are referring to autoregressive errors by AR, right? My data is cross sectional, so how is it possible for errors to have AR structure? Do you have any reasons to believe that my data is panel instead (I have described the nature of data in the original question)?

How do I test if there is AR in my data on stata? I apologize , I do not have a strong background in econometrics.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2159
#7

28 Nov 2022, 12:40

It makes no sense to cluster on a variable such as bank size. If the key explanatory variable varies mostly at the bank level, then cluster at the bank level. No need to worry about serial correlation.
1 like
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#8

28 Nov 2022, 12:40

Sorry. Thought panel. Should be good clustering on bank.
Comment
Laiy Kho

Join Date: Oct 2022

Posts: 48
#9

28 Nov 2022, 13:50

Jeff Wooldridge ,

Thank you! I actually could not cluster at bank level, because when I add a factor variable to represent all banks, STATA omits few banks due to collinearity. Another widely used variable is "country" to cluster errors, but the inclusion of country factor variable also causes specification errors and high multicollinearity. This is why I added few observed effects such as bank size, age, geographic region etc. to control for heterogeneity across different banks in the data. This approach seemed to fit the data without any specification errors and multicollinearity. Is this a correct approach? I estimated serial correlation using the command "estat vce,corr". Only interaction term (log transformed bank age x loanportfolio) seems to have a high correlation coefficient ranging from 0.60 to 0.80.

Another reason I want to include bank level characteristics instead of just a dummy variable at bank level is that my research revolves around one unique independent variable X. So, I first determine what are the factors that determine "X" for each bank. I add bank characteristics to estimate this equation. Then I estimate the impact of X on interest rates and default, so I add bank characteristics here as well because otherwise it would cause endogeneity problems. In addition, I also found literature evidence to support the impact of bank level characteristics such as age and size on interest and default as well.

My question revolving around which variable to use for clustering errors in the vce (cluster clustervar) command is because I suspect that this regression output and the following ones would have serial correlation as there are groups(banks) within the data. I am confused as to what variable I should choose to cluster errors.
Comment
Laiy Kho

Join Date: Oct 2022

Posts: 48
#10

01 Dec 2022, 06:43

Dear Professor Jeff Wooldridge,

If the key explanatory variable varies mostly at the bank level, then cluster at the bank level.

The frequency distribution of the banks vary across the sample. The banks with lowest number of observations often get automatically dropped by stata due to collinearity and the ones with largest observations inflates the VIF of my key explanatory variable to 1000. Same happens when I include country as factor variable to control for fixed effects. I have some banks that are owned by one group but they have different firm size and country. Do you think it would be correct if I grouped those banks to one bank and added bank size and geographical region as separate control variables?
Comment

Announcement

How to estimate cluster robust errors in the presence of fixed effects for groups within cross sectional data?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment