How to first difference a panel data set with many dummy variables?

Alex Lukassen

Join Date: Aug 2015

Posts: 8
#1

How to first difference a panel data set with many dummy variables?

24 Aug 2015, 17:14

Dear all,

I am analyzing the impact of 3rd and 4th division soccer teams (1), their stadiums (2), and their affiliation to 1st & 2nd division teams VS. independence (3) on per capita GDP on county-level. My data set (strongly balanced) includes 266 counties from 1995-2012 with around 30 independent variables (many of them dummies). I am using a linear reduced form model:
y_it= β₁X_it+ β₂ Z_it+ ϑ_i+ μ_t+ ε_it

y_itis the per capita GDP in county i at time t; β₁is the corresponding vector of parameters to be estimated
X_itis a vector of local market variables for each county i at time t; β₂is the corresponding vector of parameters to be estimated
Z_itis a vector of third and fourth league team as well as stadium variables in county i at time t
ϑ_iis a county i specific fixed effect
μ_tis a time t specific fixed effect
ε_it is a random disturbance

Since the data set is heteroskedastic, autocorrelated, shows contemporaneous correlation and includes a lagged dependent variable, I thought that taking first differences would eliminate autocorrelation, explicit fixed effects and the correlation of the lagged dependent variable with the disturbances. Then I would run the command xtpcse which, I think, accounts for heteroskedasticity and contemporaneous correlation. As first differencing (and then symplifying) the model above doesn't change the parameters, I would just interprete them like before first-diffrencing.

Questions:
(a) Is there anything to argue about my approach from an econometrics (and/or statistics) point of view?
(b) Can first-differencing be done with binary variables? Intuitively, this isn't as easy as it seems. I did some research but couldn't find an entirely satisfying answer.
(c) What are the Stata commands to get first-differences? All I found seems to violate the boundaries of each panel; i.e. the last year of county 1 seems to be substracted from the first year of county 2 and so on.
(d) Concerning the command xtpcse, which of the options (correlation(ar1) and correlation(psar1) ) is suitable for which type of data? The Stata manual wasn't really a help to me here.

Best regards,
Alex

Note: I am using Stata 12.

Last edited by Alex Lukassen; 24 Aug 2015, 17:20.
Tags: None
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#2

24 Aug 2015, 21:37

Several comments:

1. Why are you including a lagged dependent variable? If you really want to do this, you should use xtabond or xtabond2.
2. Dummy variables are treated as all other variables. If you believe the equation written above, just use the differencing operator on the entire equation. Everything gets differenced. (I remain to this day puzzled as to why researchers think there is a problem differencing dummy variables.)

Here is what I would do, assuming no lagged dependent variables:

Code:

xtset countryid year xi: reg D.(y x1 ... xK z1 ... zK i.year), cluster(countryid)

Let the clustering account for any remaining serial correlation or heteroskedasticity.
Comment
Alex Lukassen

Join Date: Aug 2015

Posts: 8
#3

25 Aug 2015, 10:59

Thanks four your help!

1. The lagged dependent variable is meant to capture the self-perpetuating tendencies of local economies. I read that some researchers view the use of a lagged dependent variable as theoretically tenuous which leaves me a bit puzzled now as to whether include it or not.

2. I continued to work on the data set and came across another problem. I wanted to run the unit root test developed by Im, Pesaran, and Shin. Since the Stata manual recommended the demean option in case of cross-sectional dependence, I tried to compute Pesaran's CD test. However, I received an error message. The same thing happened when I used the Friedman or Frees test.

Code used for unit root test:

Code:

xtunitroot ips DepVar, lags (aic 4)

Since the data is serially correlated, I specified the lags option. Yet, I am not sure if AIC is the right criterion to use and if it makes sense to use 4 as the number of lags. As I said, I would add the demean option, if there is cross-sectional dependence.

Error message when trying to check for CD (Pesaran):

Code:

xtcsd, pes Error: The panel is highly unbalanced. Not enough common observations across panel to perform Pesaran's test. insufficient observations r(2001);

When specifying the panel data structure (xtset), the data was considered as strongly balanced.

Error message when trying to check for CD (Friedman and Frees):

Code:

xtcsd, friedman no observations r(2000); xtcsd, frees no observations r(2000);

If my comments are not well explained or further information is needed, please let me know.

Kind regards,
Alex

Last edited by Alex Lukassen; 25 Aug 2015, 11:53.
Comment

Announcement

How to first difference a panel data set with many dummy variables?

Comment

Comment