Panel data and regression with vs. without clustering standard errors

Sami Elawad

Join Date: Apr 2019

Posts: 4
#1

Panel data and regression with vs. without clustering standard errors

27 Apr 2019, 06:54

I am using panel data for a difference-in-difference model in which I attempt to see the affect the London Olympics had on house prices in London. To do this I am comparing house price differentials between host boroughs of London (boroughs that hosted an Olympic event) and non-host boroughs across the time period of 2009-2015 (with the Olympics taking place in 2012) by using a dummy variables (Host) indicating which boroughs are hosts and (Games) indicating whether the time period is before or after 2012. I have annual data for housing transactions for each borough and therefore thousands of observations for each borough in each year and many transactions for the same price in the same borough and same year creating duplicates. My first issue is when I try and tell stata that I am using panel data with xtset it says "repeated time values within panel". I have read that this is due the duplicates and use of annual data and was wondering whether it is possible to still use this data and run regressions without using xtset.

My second issue is that I attempted running regressions without xtset and initially the coefficients were all very significant. However when I cluster the standard errors at the borough level, the p-values become very large and the coefficients become largely insignificant. I was wondering why this is and whether it is essential to cluster the standard errors at the borough level?

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long price float borough_id byte host 225500 1 1 226000 1 1 343000 1 1 365000 1 1 375000 1 1 390000 1 1 570000 1 1 151000 2 0 151500 2 0 152000 2 0 152000 2 0 152000 2 0 152000 2 0 152000 2 0 152000 2 0 152000 2 0 152000 2 0 152500 2 0 152500 2 0 153000 2 0 153000 2 0 153000 2 0 153000 2 0 153000 2 0 153000 2 0 153000 2 0 153000 2 0 153333 2 0 153599 2 0 154000 2 0 154000 2 0 154000 2 0 154000 2 0 155000 2 0 155000 2 0 155000 2 0 155000 2 0 155000 2 0 155000 2 0 155000 2 0 155000 2 0 155000 2 0 155000 2 0 155000 2 0 155000 2 0 155000 2 0 155000 2 0 155000 2 0 155000 2 0 155500 2 0 156000 2 0 156000 2 0 156000 2 0 156000 2 0 156000 2 0 156000 2 0 156000 2 0 157000 2 0 157000 2 0 157500 2 0 157500 2 0 157500 2 0 157500 2 0 158000 2 0 158000 2 0 159000 2 0 159000 2 0 159000 2 0 159000 2 0 159000 2 0 159000 2 0 159500 2 0 159800 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 160000 2 0 end

Many thanks
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

27 Apr 2019, 07:03

Sami:
welcome to this forum.
1) you cannot run panel data regression (with -xt- commands, at least) without -xtset-ing your data beforehand. If you're sure that you do not have genuine duplicates (ie, mistakenly data entries), you can simply -xtset- your data with the -panelid- only. However, this fix does not allow you to use tijme-series commands, such as lags and leads.
2) without reading your results, the trivial advice is that cluster-robust standar errors (that accounts for both heteroskedastcity and/or autocorrelation) need a non-negligible number of clusters to work properly.

Kind regards,
Carlo
(Stata 19.0)
Comment
Sami Elawad

Join Date: Apr 2019

Posts: 4
#3

27 Apr 2019, 07:20

Thanks for your help Carlo,

I am fairly sure that the duplicates are not genuine, just to double check when you say -panelid- does this refer to the time variable, for example would the command be xtset year?

With the regression I run using the reg command with vce(cluster borough) the output indicates that the standard errors are adjusted for 30 clusters in borough, and I have 31 boroughs. Is this considered non-negligible and if not would I be able to draw conclusions from the results without accounting for heteroskedasticity?

Many Thanks,
Sami
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#4

27 Apr 2019, 09:18

Sami:
1) Not quite: -panelid- is the panel identifier (eg, borough_1).

Code:

xtset borough

2) thirty out of 31 clusters (probably, there's a missing values isssue in one of your borough) are pretty enough to invoke vce(cluster panelid).

Kind regards,
Carlo
(Stata 19.0)
Comment

Sami Elawad

Join Date: Apr 2019
Posts: 4

27 Apr 2019, 14:34

Thank you Carlo,

I have managed to set it to panel data now thanks

So since I have enough clusters I should cluster the standard errors?
When I run the regression with vce(cluster borough_id) the p-values for my estimates all increase with a lot of them becoming insignificant. Is this simply the results generated or does it indicate that it is not necessary to cluster the standard errors?

Here are my results with and without vce(cluster borough_id):

Without:

Code:

. xtreg lnprice crime traffic population stadium host games hostgames terraced det
> ached new old
note: old omitted because of collinearity

Random-effects GLS regression                   Number of obs     =    145,756
Group variable: borough_id                      Number of groups  =         30

R-sq:                                           Obs per group:
     within  = 0.1133                                         min =          7
     between = 0.0281                                         avg =    4,858.5
     overall = 0.0197                                         max =     12,717

                                                Wald chi2(10)     =   18495.56
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
     lnprice |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       crime |  -.0000191   .0000314    -0.61   0.543    -.0000806    .0000424
     traffic |   .0000883   .0000558     1.58   0.114    -.0000211    .0001977
  population |   9.03e-06   2.55e-07    35.44   0.000     8.53e-06    9.53e-06
     stadium |  -.0163524   .0091394    -1.79   0.074    -.0342653    .0015604
        host |   .2058706   .1896693     1.09   0.278    -.1658743    .5776156
       games |   .1784945   .0044132    40.45   0.000     .1698449    .1871442
   hostgames |  -.0619856   .0070678    -8.77   0.000    -.0758382    -.048133
    terraced |   -.107063   .0036134   -29.63   0.000    -.1141452   -.0999807
    detached |   .4398813    .006853    64.19   0.000     .4264497     .453313
         new |   .0815015   .0116493     7.00   0.000     .0586692    .1043337
         old |          0  (omitted)
       _cons |    10.8621   .1866905    58.18   0.000     10.49619      11.228
-------------+----------------------------------------------------------------
     sigma_u |   .3305107
     sigma_e |  .54424537
         rho |  .26942885   (fraction of variance due to u_i)
------------------------------------------------------------------------------

With:

Code:

. xtreg lnprice crime traffic population stadium host games hostgames terraced detac
> hed new old, vce(cluster borough_id)
note: old omitted because of collinearity

Random-effects GLS regression                   Number of obs     =    145,756
Group variable: borough_id                      Number of groups  =         30

R-sq:                                           Obs per group:
     within  = 0.1133                                         min =          7
     between = 0.0281                                         avg =    4,858.5
     overall = 0.0197                                         max =     12,717

                                                Wald chi2(10)     =    1627.89
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

                            (Std. Err. adjusted for 30 clusters in borough_id)
------------------------------------------------------------------------------
             |               Robust
     lnprice |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       crime |  -.0000191   .0000953    -0.20   0.841    -.0002058    .0001676
     traffic |   .0000883   .0002914     0.30   0.762    -.0004828    .0006595
  population |   9.03e-06   1.12e-06     8.06   0.000     6.83e-06    .0000112
     stadium |  -.0163524   .0205717    -0.79   0.427    -.0566722    .0239673
        host |   .2058706     .31171     0.66   0.509    -.4050697    .8168109
       games |   .1784945   .0232975     7.66   0.000     .1328322    .2241568
   hostgames |  -.0619856   .0284953    -2.18   0.030    -.1178354   -.0061358
    terraced |   -.107063   .0482859    -2.22   0.027    -.2017015   -.0124244
    detached |   .4398813   .0340163    12.93   0.000     .3732106    .5065521
         new |   .0815015   .0494967     1.65   0.100    -.0155102    .1785131
         old |          0  (omitted)
       _cons |    10.8621   .6206309    17.50   0.000     9.645684    12.07851
-------------+----------------------------------------------------------------
     sigma_u |   .3305107
     sigma_e |  .54424537
         rho |  .26942885   (fraction of variance due to u_i)
------------------------------------------------------------------------------

Many Thanks

Comment

David Benson

Join Date: Oct 2018

Posts: 489
#6

27 Apr 2019, 19:49

Sami,

I suspect you will need to cluster the standard errors (it is highly probably that housing prices within a borough are correlated). In fact, the difference between default and cluster-robust standard errors can be very large. The standard errors will inflate more the greater the correlation within a cluster (imagine tracking people's salary over time--it's very likely their salary in year 2 will be strongly correlated with their salary in year 1). The intuition is that each additional observation for a given person provides less than an independent piece of new information.

Cameron and Trivedi's book Microeconometrics using Stata shows you how to calculate the standard error inflation factor (see p. 250-253).

The standard-error inflation factor F = (1 + P_uP_x(T-1))^^1/2

Where:
Pu is the intraclass correlation of the error (in your case, how much home prices within a borough are correlated with each other from year to year),

Px is the intraclass correlation of the regressor (i.e. for your variable crime, how much is crime correlated within a borough from year to year) and

T is the number of years of data in your panel.

In Cameron & Trivedi's example, they have 7 years of data and a within-person correlation for salary for an individual of about 0.8 (a time-invariant variable like education has a Px=1). So for the regressor education, the standard error inflation factor will be F = (1 + 0.8 * 1 * 6)^{^1/2} = (5.8)^{^1/2} = 2.41 when they cluster your standard errors by individual.

Code:

To find the intraclass correlation of home prices within a borough: * xtset your data quietly regress lnprice crime traffic population stadium host games hostgames terraced detached new old, vce(cluster borough_id) predict uhat, residuals forvalues j = 1/6 { quietly corr uhat L`j'.uhat display "Autocorrelation at lag `j' = " %6.3 r(rho) } In the Cameron & Trivedi example (from p. 252) they got: Autocorrelation at lag 1 = 0.884 Autocorrelation at lag 2 = 0.838 Autocorrelation at lag 3 = 0.811 Autocorrelation at lag 4 = 0.786 Autocorrelation at lag 5 = 0.750 Autocorrelation at lag 6 = 0.729 The average of those 6 values is about 0.80

Last edited by David Benson; 27 Apr 2019, 19:51.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#7

28 Apr 2019, 03:38

Sami:
in addition to David's helpful insight, you should not decide about clustering or not in the light of the p-values of your coefficients: rather, you should cluster if you detect heteroskeadsticity and/or autocorrelation in your idiosyncratric error distribution.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Sami Elawad

Join Date: Apr 2019

Posts: 4
#8

29 Apr 2019, 05:43

Thank you David and Carlo for your help!
Comment

Announcement