Difference between clustering and fixed effects/adding control variables into the regression

Penelope Smart

Join Date: Oct 2021

Posts: 22
#1

Difference between clustering and fixed effects/adding control variables into the regression

09 Jun 2022, 13:50

I have observational data that was collected within the same year but on different days and in different European countries. To analyse the causality of a certain event (outcome) on the dependent variable I want to run a regression.

Additionally to the treatment variable I also want to control for individual characteristics but also country-dependent effects (e.g. previous events that might have happened such as election) that might bias the coefficient of the event of interest.

Now, my questions to this attempt:

1. For the individual characteristics I create a macro containing the single variables. I add the macro to the regression just as the treatment variable. For the country-dependent effect: should I just add a macro with the countries to the regression or should I use clustering (vce cluster or absorb). Is there any difference between these two approaches?
2. If individual effects are added to the regression as described in 1., do you still have to add a ",fe"-command to the regression code?
3. Online I have seen different explanations about panel models. Mostly it says that panel data is a combination of cross-sectional and time-series Dara. Since my data was collected in a single year but on different days and in different countries, is this still seen as panel data (different countries, days difference in data collection)? And how do you best control for the days difference in data collection? I was thinking about controlling for weekdays, and month of the year. Would this be a good solution?

Last edited by Penelope Smart; 09 Jun 2022, 14:00.
Tags: clustering, controls, econometrics, panel data, regression
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17676
#2

09 Jun 2022, 14:12

Penelope:
were the same ids measured on the very same set of variables at (theoretically) equally spaced time intervals?
If you have a panel dataset, do you plan to go -regress- or -xtreg,fe- (the latter approach is better if you actually have panel data).
Eventually, please note that:
1) you have to -xtset- your data only if you go -xtreg,fe-;
2) you do not have to add -fe- after comma if you go -regress-;
3) please, follow the FAQ in your future posts, sharing what you typed and what Stata gave you back. Thanks.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#3

09 Jun 2022, 14:25

1. Assuming you are planning to do a linear regression, you can use either -regress- and include the country variable in the varlist, or you can use -xtreg, fe- after you -xtset- the data with the country variable as the panel var. Note well that I say country variable, not variables. For either approach, you should not use separate "dummy" variables for each country. You should have a single numeric variable that identifies which country the observation pertains to. If you use -regress- then you can include it in the list of predictor variables as i.country (don't forget the i. prefix). If you use -xtreg, fe-, then you don't mention it at all: the -xtset- command that precedes it tells Stata to use it automatically.

The difference between these two methods, from the user perspective, is that -regress- will give you a regression coefficient for each country (except one, as a reference level), whereas -xtreg- takes the country effects into account but does not provide you with the corresponding coefficients. For most purposes, those country coefficients are not needed anyway. But if you need them, then only -regress- will give them to you directly. (There is a way to coax them out of -xtreg, fe-, but it's not worth the trouble.)

In either case you should probably use -vce(cluster country)- as long as we are dealing with enough different countries to make this OK. There is no universally agreed upon rule for how many you need, but if you are under 10 countries you probably should not use clustering, and if you are over 25 you almost certainly should. In between is kind of a gray area.

Notice that clustered standard errors and country-level effects are not mutually-exclusive. In fact, they tend to go together.

I emphasize that this advice applies only to linear regressions. If you plan to estimate non-linear models, things are different.

2. I'm not sure I understand this question. As long as you have multiple observations per country, you should probably be accounting for that with -xtreg, fe- or -regress ... i.country- and should probably be using -vce(cluster country)-. Whether the model also includes individual level variables, or which ones, doesn't affect that. If at some point you do an analysis in which you have only one observation per country, that's a different story.

3. Strictly speaking, you have panel data if a bunch of entities (people, countries, firms, households) as units of sampling and analysis are repeatedly assessed over time. The key here is that the same entities are resampled. Also, to be, strictly speaking, panel data, each entity should only be observed once at each time. If you have multiple observations of the same entity at the same time then you are not dealing with either of these types of data. I cannot tell from your description whether this is what you have or not.

If different entities are resampled at different times, then it is not panel data: it is cross-sectional time-series data. Your situation is a little complicated: it sounds like you have repeated surveys carried out in, but not on different countries. So you have perhaps the survey carried out in the same countries at each observation time, but each round of surveys sampled different people within those countries. If you are analyzing the data using person-level data, then this is not panel data. It is cross-sectional time-series data. If, however, the same people were surveyed each time, then it is a multi-level data set that one might want to analyze using multi-level statistical models like -mixed- (rather than -regress- or -xtreg-) if you are not working in a discipline that is unreceptive to such models. If you aggregate the data up to the country level and use only one observation per country per survey round, then it is country-level panel data. It is also possible to interpret your description as having had each country surveyed only once, with the different countries having been surveyed on different dates. In that case, you have neither panel nor any kind of time-series data: you just have survey data in different countries and the time of data collection is confounded with the country of collection. If this is the case and you are analyzing person-level observations, then you just have observations nested within countries. You would still either include country (or time, but not both as they are completely confounded) as a variable, or go the -xtset xtreg- route and cluster your standard errors.

To get the day of week from the date variable, use the -dow()- function. To extract the month from the date variable, use the -mofd()- function. See -help datetime functions- for details All of this, of course, presumes you are starting with a bona fide Stata internal format date variable. If not, you will need to convert what you have first.

Added: Crossed with #2.

Last edited by Clyde Schechter; 09 Jun 2022, 14:30.
3 likes
Comment
Penelope Smart

Join Date: Oct 2021

Posts: 22
#4

11 Jun 2022, 13:44

Originally posted by Clyde Schechter View Post

1. Assuming you are planning to do a linear regression, you can use either -regress- and include the country variable in the varlist, or you can use -xtreg, fe- after you -xtset- the data with the country variable as the panel var. Note well that I say country variable, not variables. For either approach, you should not use separate "dummy" variables for each country. You should have a single numeric variable that identifies which country the observation pertains to. If you use -regress- then you can include it in the list of predictor variables as i.country (don't forget the i. prefix). If you use -xtreg, fe-, then you don't mention it at all: the -xtset- command that precedes it tells Stata to use it automatically.

The difference between these two methods, from the user perspective, is that -regress- will give you a regression coefficient for each country (except one, as a reference level), whereas -xtreg- takes the country effects into account but does not provide you with the corresponding coefficients. For most purposes, those country coefficients are not needed anyway. But if you need them, then only -regress- will give them to you directly. (There is a way to coax them out of -xtreg, fe-, but it's not worth the trouble.)

In either case you should probably use -vce(cluster country)- as long as we are dealing with enough different countries to make this OK. There is no universally agreed upon rule for how many you need, but if you are under 10 countries you probably should not use clustering, and if you are over 25 you almost certainly should. In between is kind of a gray area.

Notice that clustered standard errors and country-level effects are not mutually-exclusive. In fact, they tend to go together.

I emphasize that this advice applies only to linear regressions. If you plan to estimate non-linear models, things are different.

2. I'm not sure I understand this question. As long as you have multiple observations per country, you should probably be accounting for that with -xtreg, fe- or -regress ... i.country- and should probably be using -vce(cluster country)-. Whether the model also includes individual level variables, or which ones, doesn't affect that. If at some point you do an analysis in which you have only one observation per country, that's a different story.

3. Strictly speaking, you have panel data if a bunch of entities (people, countries, firms, households) as units of sampling and analysis are repeatedly assessed over time. The key here is that the same entities are resampled. Also, to be, strictly speaking, panel data, each entity should only be observed once at each time. If you have multiple observations of the same entity at the same time then you are not dealing with either of these types of data. I cannot tell from your description whether this is what you have or not.

If different entities are resampled at different times, then it is not panel data: it is cross-sectional time-series data. Your situation is a little complicated: it sounds like you have repeated surveys carried out in, but not on different countries. So you have perhaps the survey carried out in the same countries at each observation time, but each round of surveys sampled different people within those countries. If you are analyzing the data using person-level data, then this is not panel data. It is cross-sectional time-series data. If, however, the same people were surveyed each time, then it is a multi-level data set that one might want to analyze using multi-level statistical models like -mixed- (rather than -regress- or -xtreg-) if you are not working in a discipline that is unreceptive to such models. If you aggregate the data up to the country level and use only one observation per country per survey round, then it is country-level panel data. It is also possible to interpret your description as having had each country surveyed only once, with the different countries having been surveyed on different dates. In that case, you have neither panel nor any kind of time-series data: you just have survey data in different countries and the time of data collection is confounded with the country of collection. If this is the case and you are analyzing person-level observations, then you just have observations nested within countries. You would still either include country (or time, but not both as they are completely confounded) as a variable, or go the -xtset xtreg- route and cluster your standard errors.

To get the day of week from the date variable, use the -dow()- function. To extract the month from the date variable, use the -mofd()- function. See -help datetime functions- for details All of this, of course, presumes you are starting with a bona fide Stata internal format date variable. If not, you will need to convert what you have first.

Added: Crossed with #2.

Thanks a lot for the explanations.

To question 2: I have individual characteristics variables such as age, gender, employment status etc.. My question is, whether it is necessary to include these control variables into the regression, since I understand that running a fixed effects model means that you already control for such variables? Using vce, fe, which controls variables does "fe" control for?

To question 3: Yeah, it seems complicated. I two waves of survey data that were conducted in the same year. Each wave was conducted (almost)simultaneously in multiple countries. Is this country-level panel data then? I am just getting very confused since I associate panel data with a time component (time series). (Note: The two waves were conducted close in time and we are not interested in the development of the dependent variable). The sources I found explaining panel data always have the time component (and sometimes also country) component in it.

Additional to that: If it's not panel data, why can I -xtset- the country-variable as the panel variable?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#5

11 Jun 2022, 14:54

I have individual characteristics variables such as age, gender, employment status etc.. My question is, whether it is necessary to include these control variables into the regression, since I understand that running a fixed effects model means that you already control for such variables? Using vce, fe, which controls variables does "fe" control for?

First I'll get my pedantic credentials out. There is no such thing as "controlling" for variables in observational studies. This is an abuse of language. We can adjust for covariates. But only experimental designs truly can control for things. I understand that pretty much everybody speaks of "control" variables, and I cannot change that. But I just want to be sure you are not under any illusions about what can be done.

The use of -xtreg, fe- causes Stata to calculate a model in which all time invariant attributes of the studied entities are automatically adjusted for. The variable used as panel variable in -xtset- defines what these entities are. If you -xtset panelvar- and run -xtreg, fe-, any aspect of an instance of panelvar that is constant over time is automatically adjusted for (and, by the way, the effects of such attributes are not estimable with -xtreg, fe-). Sex would be a typical example of an attribute of people that is constant over time. So if your -xtset- panel variable is a person-level identifying variable, you would not need to do anything more to adjust for sex. Age varies over time, so it is not automatically controlled for. Employment status, also, varies over time and is not automatically controlled for.

All of that said, you still have not disclosed whether the same people were re-surveyed at both waves, or whether the second round recruited a new sample of people. If the same people were re-surveyed at both waves, you would be able to -xtset personid- and then things like sex are automatically adjusted for. But if the two waves involve different samples of people, then you do not have person-level panel data. You have serial cross sections of countries. -xtset personid- would be inappropriate in this case. You could -xtset country- so that Stata knows that your observations are nested within countries, and in that case unchanging attributes of the countries would automatically be adjusted for: things like what unit of currency they use, official lanuage(s), variables describing their geographic locations, etc. (Yes countries occasionally change their currency or languages, but that rarely happens, and I'm assuming it didn't happen to happen during the brief timespan of your data.)

I two waves of survey data that were conducted in the same year. Each wave was conducted (almost)simultaneously in multiple countries. Is this country-level panel data then?

The date of the wave, serves as a time variable. Or you could equally just use a variable that designates wave 1 vs wave 2 as the time variable. But that would not make this country-level panel data, because you have person-level observations nested within countries. So this would is three-level data. If the same people were surveyed in both waves, then you could treat this as person-level panel data. If different people were surveyed in the two waves, then it is not panel data at any level: it is serial cross sections of countries over time.
1 like
Comment
Penelope Smart

Join Date: Oct 2021

Posts: 22
#6

11 Jun 2022, 16:52

Thanks for the thorough explanation, Clyde.

And thanks for pointing out my misuse of language. That's very helpful to know for the future. I will spread the knowledge!

Regarding the sample: for each wave a new sample is recruited. I guess it is a repeated cross-sectional data then and :

Code:

xtset country *adjusts for country-characteristics, but not individual characteristics since -xtset country- and not -xtset pid- xtreg dependent treatment, fe *adjusts also for individual characteristics of any kind added as variable xtreg dependent treatment `individual', fe

should be correct?

Follow up question:

1. Is there still the need to include the option - vce (cluster country) -?
2. I assume that clustering on country level also adjusts for major events (e.g. elections(not the treatment) that might have happened within a country at a certain time before the interview?

My goal is to find out whether there is causal effect of a certain type of event on someones opinion on/attitude towards e.g. discrimination. Hence, I want to make sure that if I e.g. find a significant effect, I can rule out that it is not because of any other event.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#7

11 Jun 2022, 18:09

First, I would amend your summary of my explanation. If you -xtset country- and then do -xtreg dependent treatment, fe-, you automatically adjust for characteristics of the countries that do not change over time. It would not adjust for changing characteristics such as unemployment rate, population, or GDP. If you run -xtreg dependent treatment i.pid, fe- after -xtset country-, then you would additionally adjust for any characteristics of the individual people that do not change over time. But, if I understand you correctly, each person is only surveyed once. So your -xtreg dependent treatment i.pid, fe- will just collapse and you will get no results at all. You could only do this regression if the same people were surveyed repeatedly.

Regarding question 1, ordinarily you would use -vce(cluster country)-, but with only two countries, the cluster robust standard errors are not valid. So just skip that. -vce(cluster whatever)- requires a sufficiently large number of clusters. There is no universal agreement on how many is enough, but nobody thinks 2 is.

Regarding question 2, even if you had enough countries to use -vce(cluster country)-, it does not adjust for anything. The purpose of -vce(cluster country)- is to account for the possibility that the residuals within a country are not independent of each other. It is a different way of calculating standard errors that does not rely on the assumption of independence of observations and allows them to be correlated within the clusters.

My goal is to find out whether there is causal effect of a certain type of event on someones opinion on/attitude towards e.g. discrimination. Hence, I want to make sure that if I e.g. find a significant effect, I can rule out that it is not because of any other event.

Any claim for identifying a causal effect from observational data must be made with great modesty and caution. When certain assumptions are met, some analytic approaches such as difference-in-differences or instrumental variables or propensity score methods might succeed. But only when those assumptions are met, and those assumptions are often quite difficult or impossible to verify. In your description, you do not propose any of these methods, nor any other method that purports to identify causal effects. Nor can I think of any approach that would do that with this data design.

What you can do is exclude the possibility that whatever effect you identify is attributable to country attributes that do not change over time. If you have additional variables measuring attributes of individuals that you wish to adjust for, then you can add those variables to the regression to do that. This still leaves the possibility that your results will be confounded by other individual attributes that you have not adjusted for.

Last edited by Clyde Schechter; 11 Jun 2022, 18:16.
3 likes
Comment

Announcement

Difference between clustering and fixed effects/adding control variables into the regression

Comment

Comment

Comment

Comment

Comment

Comment