Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to first difference a panel data set with many dummy variables?

    Dear all,

    I am analyzing the impact of 3rd and 4th division soccer teams (1), their stadiums (2), and their affiliation to 1st & 2nd division teams VS. independence (3) on per capita GDP on county-level. My data set (strongly balanced) includes 266 counties from 1995-2012 with around 30 independent variables (many of them dummies). I am using a linear reduced form model:
    yit= β1 Xit+ β2 Zit+ ϑi+ μt+ εit

    yit is the per capita GDP in county i at time t; β1 is the corresponding vector of parameters to be estimated
    Xit is a vector of local market variables for each county i at time t; β2 is the corresponding vector of parameters to be estimated
    Zit is a vector of third and fourth league team as well as stadium variables in county i at time t
    ϑi is a county i specific fixed effect
    μt is a time t specific fixed effect
    εit is a random disturbance

    Since the data set is heteroskedastic, autocorrelated, shows contemporaneous correlation and includes a lagged dependent variable, I thought that taking first differences would eliminate autocorrelation, explicit fixed effects and the correlation of the lagged dependent variable with the disturbances. Then I would run the command xtpcse which, I think, accounts for heteroskedasticity and contemporaneous correlation. As first differencing (and then symplifying) the model above doesn't change the parameters, I would just interprete them like before first-diffrencing.

    Questions:
    (a) Is there anything to argue about my approach from an econometrics (and/or statistics) point of view?
    (b) Can first-differencing be done with binary variables? Intuitively, this isn't as easy as it seems. I did some research but couldn't find an entirely satisfying answer.
    (c) What are the Stata commands to get first-differences? All I found seems to violate the boundaries of each panel; i.e. the last year of county 1 seems to be substracted from the first year of county 2 and so on.
    (d) Concerning the command xtpcse, which of the options (correlation(ar1) and correlation(psar1) ) is suitable for which type of data? The Stata manual wasn't really a help to me here.

    Best regards,
    Alex


    Note: I am using Stata 12.
    Last edited by Alex Lukassen; 24 Aug 2015, 18:20.

  • #2
    Several comments:

    1. Why are you including a lagged dependent variable? If you really want to do this, you should use xtabond or xtabond2.
    2. Dummy variables are treated as all other variables. If you believe the equation written above, just use the differencing operator on the entire equation. Everything gets differenced. (I remain to this day puzzled as to why researchers think there is a problem differencing dummy variables.)

    Here is what I would do, assuming no lagged dependent variables:

    Code:
    xtset countryid year
    xi: reg D.(y x1 ... xK z1 ... zK i.year), cluster(countryid)
    Let the clustering account for any remaining serial correlation or heteroskedasticity.

    Comment


    • #3
      Thanks four your help!

      1. The lagged dependent variable is meant to capture the self-perpetuating tendencies of local economies. I read that some researchers view the use of a lagged dependent variable as theoretically tenuous which leaves me a bit puzzled now as to whether include it or not.

      2. I continued to work on the data set and came across another problem. I wanted to run the unit root test developed by Im, Pesaran, and Shin. Since the Stata manual recommended the demean option in case of cross-sectional dependence, I tried to compute Pesaran's CD test. However, I received an error message. The same thing happened when I used the Friedman or Frees test.

      Code used for unit root test:
      Code:
      xtunitroot ips DepVar, lags (aic 4)
      Since the data is serially correlated, I specified the lags option. Yet, I am not sure if AIC is the right criterion to use and if it makes sense to use 4 as the number of lags. As I said, I would add the demean option, if there is cross-sectional dependence.

      Error message when trying to check for CD (Pesaran):
      Code:
      xtcsd, pes
      Error: The panel is highly unbalanced.
      Not enough common observations across panel to perform Pesaran's test.
      insufficient observations
      r(2001);
      When specifying the panel data structure (xtset), the data was considered as strongly balanced.

      Error message when trying to check for CD (Friedman and Frees):
      Code:
      xtcsd, friedman
      no observations
      r(2000);
      
      xtcsd, frees
      no observations
      r(2000);
      If my comments are not well explained or further information is needed, please let me know.

      Kind regards,
      Alex
      Last edited by Alex Lukassen; 25 Aug 2015, 12:53.

      Comment

      Working...
      X