Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel data and regression with vs. without clustering standard errors

    I am using panel data for a difference-in-difference model in which I attempt to see the affect the London Olympics had on house prices in London. To do this I am comparing house price differentials between host boroughs of London (boroughs that hosted an Olympic event) and non-host boroughs across the time period of 2009-2015 (with the Olympics taking place in 2012) by using a dummy variables (Host) indicating which boroughs are hosts and (Games) indicating whether the time period is before or after 2012. I have annual data for housing transactions for each borough and therefore thousands of observations for each borough in each year and many transactions for the same price in the same borough and same year creating duplicates. My first issue is when I try and tell stata that I am using panel data with xtset it says "repeated time values within panel". I have read that this is due the duplicates and use of annual data and was wondering whether it is possible to still use this data and run regressions without using xtset.

    My second issue is that I attempted running regressions without xtset and initially the coefficients were all very significant. However when I cluster the standard errors at the borough level, the p-values become very large and the coefficients become largely insignificant. I was wondering why this is and whether it is essential to cluster the standard errors at the borough level?

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long price float borough_id byte host
    225500 1 1
    226000 1 1
    343000 1 1
    365000 1 1
    375000 1 1
    390000 1 1
    570000 1 1
    151000 2 0
    151500 2 0
    152000 2 0
    152000 2 0
    152000 2 0
    152000 2 0
    152000 2 0
    152000 2 0
    152000 2 0
    152000 2 0
    152500 2 0
    152500 2 0
    153000 2 0
    153000 2 0
    153000 2 0
    153000 2 0
    153000 2 0
    153000 2 0
    153000 2 0
    153000 2 0
    153333 2 0
    153599 2 0
    154000 2 0
    154000 2 0
    154000 2 0
    154000 2 0
    155000 2 0
    155000 2 0
    155000 2 0
    155000 2 0
    155000 2 0
    155000 2 0
    155000 2 0
    155000 2 0
    155000 2 0
    155000 2 0
    155000 2 0
    155000 2 0
    155000 2 0
    155000 2 0
    155000 2 0
    155000 2 0
    155500 2 0
    156000 2 0
    156000 2 0
    156000 2 0
    156000 2 0
    156000 2 0
    156000 2 0
    156000 2 0
    157000 2 0
    157000 2 0
    157500 2 0
    157500 2 0
    157500 2 0
    157500 2 0
    158000 2 0
    158000 2 0
    159000 2 0
    159000 2 0
    159000 2 0
    159000 2 0
    159000 2 0
    159000 2 0
    159500 2 0
    159800 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    160000 2 0
    end
    Many thanks

  • #2
    Sami:
    welcome to this forum.
    1) you cannot run panel data regression (with -xt- commands, at least) without -xtset-ing your data beforehand. If you're sure that you do not have genuine duplicates (ie, mistakenly data entries), you can simply -xtset- your data with the -panelid- only. However, this fix does not allow you to use tijme-series commands, such as lags and leads.
    2) without reading your results, the trivial advice is that cluster-robust standar errors (that accounts for both heteroskedastcity and/or autocorrelation) need a non-negligible number of clusters to work properly.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      Thanks for your help Carlo,

      I am fairly sure that the duplicates are not genuine, just to double check when you say -panelid- does this refer to the time variable, for example would the command be xtset year?

      With the regression I run using the reg command with vce(cluster borough) the output indicates that the standard errors are adjusted for 30 clusters in borough, and I have 31 boroughs. Is this considered non-negligible and if not would I be able to draw conclusions from the results without accounting for heteroskedasticity?

      Many Thanks,
      Sami

      Comment


      • #4
        Sami:
        1) Not quite: -panelid- is the panel identifier (eg, borough_1).
        Code:
        xtset borough
        2) thirty out of 31 clusters (probably, there's a missing values isssue in one of your borough) are pretty enough to invoke vce(cluster panelid).
        Kind regards,
        Carlo
        (StataNow 18.5)

        Comment


        • #5
          Thank you Carlo,

          I have managed to set it to panel data now thanks

          So since I have enough clusters I should cluster the standard errors?
          When I run the regression with vce(cluster borough_id) the p-values for my estimates all increase with a lot of them becoming insignificant. Is this simply the results generated or does it indicate that it is not necessary to cluster the standard errors?

          Here are my results with and without vce(cluster borough_id):

          Without:

          Code:
          . xtreg lnprice crime traffic population stadium host games hostgames terraced det
          > ached new old
          note: old omitted because of collinearity
          
          Random-effects GLS regression                   Number of obs     =    145,756
          Group variable: borough_id                      Number of groups  =         30
          
          R-sq:                                           Obs per group:
               within  = 0.1133                                         min =          7
               between = 0.0281                                         avg =    4,858.5
               overall = 0.0197                                         max =     12,717
          
                                                          Wald chi2(10)     =   18495.56
          corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000
          
          ------------------------------------------------------------------------------
               lnprice |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                 crime |  -.0000191   .0000314    -0.61   0.543    -.0000806    .0000424
               traffic |   .0000883   .0000558     1.58   0.114    -.0000211    .0001977
            population |   9.03e-06   2.55e-07    35.44   0.000     8.53e-06    9.53e-06
               stadium |  -.0163524   .0091394    -1.79   0.074    -.0342653    .0015604
                  host |   .2058706   .1896693     1.09   0.278    -.1658743    .5776156
                 games |   .1784945   .0044132    40.45   0.000     .1698449    .1871442
             hostgames |  -.0619856   .0070678    -8.77   0.000    -.0758382    -.048133
              terraced |   -.107063   .0036134   -29.63   0.000    -.1141452   -.0999807
              detached |   .4398813    .006853    64.19   0.000     .4264497     .453313
                   new |   .0815015   .0116493     7.00   0.000     .0586692    .1043337
                   old |          0  (omitted)
                 _cons |    10.8621   .1866905    58.18   0.000     10.49619      11.228
          -------------+----------------------------------------------------------------
               sigma_u |   .3305107
               sigma_e |  .54424537
                   rho |  .26942885   (fraction of variance due to u_i)
          ------------------------------------------------------------------------------

          With:

          Code:
          . xtreg lnprice crime traffic population stadium host games hostgames terraced detac
          > hed new old, vce(cluster borough_id)
          note: old omitted because of collinearity
          
          Random-effects GLS regression                   Number of obs     =    145,756
          Group variable: borough_id                      Number of groups  =         30
          
          R-sq:                                           Obs per group:
               within  = 0.1133                                         min =          7
               between = 0.0281                                         avg =    4,858.5
               overall = 0.0197                                         max =     12,717
          
                                                          Wald chi2(10)     =    1627.89
          corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000
          
                                      (Std. Err. adjusted for 30 clusters in borough_id)
          ------------------------------------------------------------------------------
                       |               Robust
               lnprice |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                 crime |  -.0000191   .0000953    -0.20   0.841    -.0002058    .0001676
               traffic |   .0000883   .0002914     0.30   0.762    -.0004828    .0006595
            population |   9.03e-06   1.12e-06     8.06   0.000     6.83e-06    .0000112
               stadium |  -.0163524   .0205717    -0.79   0.427    -.0566722    .0239673
                  host |   .2058706     .31171     0.66   0.509    -.4050697    .8168109
                 games |   .1784945   .0232975     7.66   0.000     .1328322    .2241568
             hostgames |  -.0619856   .0284953    -2.18   0.030    -.1178354   -.0061358
              terraced |   -.107063   .0482859    -2.22   0.027    -.2017015   -.0124244
              detached |   .4398813   .0340163    12.93   0.000     .3732106    .5065521
                   new |   .0815015   .0494967     1.65   0.100    -.0155102    .1785131
                   old |          0  (omitted)
                 _cons |    10.8621   .6206309    17.50   0.000     9.645684    12.07851
          -------------+----------------------------------------------------------------
               sigma_u |   .3305107
               sigma_e |  .54424537
                   rho |  .26942885   (fraction of variance due to u_i)
          ------------------------------------------------------------------------------
          Many Thanks

          Comment


          • #6
            Sami,

            I suspect you will need to cluster the standard errors (it is highly probably that housing prices within a borough are correlated). In fact, the difference between default and cluster-robust standard errors can be very large. The standard errors will inflate more the greater the correlation within a cluster (imagine tracking people's salary over time--it's very likely their salary in year 2 will be strongly correlated with their salary in year 1). The intuition is that each additional observation for a given person provides less than an independent piece of new information.

            Cameron and Trivedi's book Microeconometrics using Stata shows you how to calculate the standard error inflation factor (see p. 250-253).

            The standard-error inflation factor F = (1 + PuPx(T-1))^1/2

            Where:
            • Pu is the intraclass correlation of the error (in your case, how much home prices within a borough are correlated with each other from year to year),
            • Px is the intraclass correlation of the regressor (i.e. for your variable crime, how much is crime correlated within a borough from year to year) and
            • T is the number of years of data in your panel.
            In Cameron & Trivedi's example, they have 7 years of data and a within-person correlation for salary for an individual of about 0.8 (a time-invariant variable like education has a Px=1). So for the regressor education, the standard error inflation factor will be F = (1 + 0.8 * 1 * 6)^1/2 = (5.8)^1/2 = 2.41 when they cluster your standard errors by individual.

            Code:
            To find the intraclass correlation of home prices within a borough:
            * xtset your data
            quietly regress lnprice crime traffic population stadium host games hostgames terraced detached new old, vce(cluster borough_id)
            predict uhat, residuals
            
            forvalues j = 1/6 {
                 quietly corr uhat L`j'.uhat
                 display "Autocorrelation at lag `j' = " %6.3 r(rho)
            }
            
            In the Cameron & Trivedi example (from p. 252) they got:
            Autocorrelation at lag 1 = 0.884
            Autocorrelation at lag 2 = 0.838
            Autocorrelation at lag 3 = 0.811
            Autocorrelation at lag 4 = 0.786
            Autocorrelation at lag 5 = 0.750
            Autocorrelation at lag 6 = 0.729
            The average of those 6 values is about 0.80
            Last edited by David Benson; 27 Apr 2019, 19:51.

            Comment


            • #7
              Sami:
              in addition to David's helpful insight, you should not decide about clustering or not in the light of the p-values of your coefficients: rather, you should cluster if you detect heteroskeadsticity and/or autocorrelation in your idiosyncratric error distribution.
              Kind regards,
              Carlo
              (StataNow 18.5)

              Comment


              • #8
                Thank you David and Carlo for your help!

                Comment

                Working...
                X