Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Clustering Standard Errors

    Dear all,

    I have a question regarding clustering standard errors on industry.

    I have a cross-sectional dataset of 94 observations (firms) with variables such as EBIT one year before the deal, EBIT one year after the deal, etc.

    Furthermore, I added industry and year dummies. The industry dummies are based on the NACE Rev. 2 industry division (for example industry division C consists of NACE Rev.2 codes between 1000-3300).

    When running the OLS multivariate regressions, want to cluster standard errors on industry to prevent industry shocks or influencing the standard errors (using vce(cluster variable). However, does this needs to be done on a four-digit NACE Rev.2 level (around 45-55 clusters depending on the dependent variable measure) or on a industry division level (as the dummies) (13 clusters)? I have read that one of the problems of using a few clusters is that OLS leads to “overfitting”, with estimated residuals systematically too close to zero compared to the true error terms. This leads to a downwards-biased cluster-robust variance matrix estimate.

    Kind Regards,

    Arno Meijer

  • #2
    want to cluster standard errors on industry to prevent industry shocks or influencing the standard errors (using vce(cluster variable).
    That is not with clustered standard errors do. They adjust the standard errors to allow for the within-industry correlation of the residuals. They have nothing to do with industry shocks. Industry shocks are accounted for by industry-level fixed effects, or, since any firm stays in the same industry at all times, those shocks in a firm-level panel data set are accounted for by the firm-level fixed effects.

    However, does this needs to be done on a four-digit NACE Rev.2 level (around 45-55 clusters depending on the dependent variable measure) or on a industry division level (as the dummies) (13 clusters)?
    It is unwise to use cluster robust standard errors with a small number of clusters. The results are typically less valid than just using ordinary standard errors. There is no firm consensus on how many clusters suffice. 13 is rather borderline; some reviewers will accept it and others will criticize it.

    In general, you want to choose the clustering variable so that the errors are independent between clusters, but not necessarily within. So ordinarily if you have a hierarchy, you would choose the clustering variable at the top of the hierarchy. In this case, that puts you into risky territory in terms of the number of clusters being small.

    I think I would try it both ways and see if the results differ much. If you are lucky, they won't.
    Last edited by Clyde Schechter; 01 Sep 2017, 12:01.

    Comment


    • #3
      Dear Clyde,

      Thanks for the clear explanation regarding cluster robust standard errors.

      In the attachment, I added the results of the EBIT Margin regressions (standard, cluster NACE Rev.2 Division and NACE Rev.2 four digits). Only thing missing in the picture are industry dummies and the constant, but they are included in the regressions. As seen in the results the industry division and the four digits differ pretty much in the standard errors. What would be the explanation for this and what is wise to implement (with these results)?

      Standard

      Click image for larger version

Name:	Regression standard.jpg
Views:	1
Size:	273.4 KB
ID:	1408818


      Cluster NACE Rev. 2 industry division
      Click image for larger version

Name:	Regression cluster Nace Rev. 2 industry division.jpg
Views:	1
Size:	218.6 KB
ID:	1408819


      Cluster NACE Rev. 2 four digits (here 42 clusters)
      Click image for larger version

Name:	Regression cluster Nace Rev. 2 Four-digit.jpg
Views:	1
Size:	218.2 KB
ID:	1408820


      I hope these results provide some clarification of the issue.

      Kind regards,,

      Arno Meijer

      Comment


      • #4
        Well, clustering the VCE never affects the regression coefficients: they always remain unchanged. All of the difference is in the standard errors (and the confidence intervals, t-statistics and p-values, all of which derive from the standard errors.)

        It turns out that you actually have only 12 clusters, not 13 when you use the four-digit codes. So that pushes things a bit further away from accepting that analysis.

        Looking at the clustered outputs, I would actually say that the results are mostly similar. There are a few variables where the standard errors are more than just a little different between the two clustered analyses, but most of them are about the same. You have never said in this thread what the goal of your analysis is. I assume that most of the variables are included simply to adjust for their potential confounding effects and that only one or two of the variables are actually of direct interest. If the standard errors are largely similar for the variables of direct interest, that is all that matters. Nobody cares about the standard errors of the variables that were included just for adjustment purposes. So, if I were you, I would focus my attention on the focal variables and ignore the others.

        Probably the way I would approach this is to look at the confidence interval(s) for the focal variable(s). Ask yourself: would anybody care, or do anything differently in the world, if it were one of these confidence intervals instead of the other. If the answer to that is no, then you can regard the two sets of results as essentially equivalent, and you could report either and simply say that you also did it the other way and the results were not materially changed.

        Comment


        • #5
          Arno:
          as an aside to Clyde's helpul advice, I suspectv that you're asking too much out of your regression model.
          The rule of thumb states that 20 observations per predictor are advised for multiple linear regression (Katz MH. Multivariable Analysis. Second Edtion. NY: Cambridge University Press, 2006: 81), even though 10 obs per predictor may sound wise enough.
          In your case, you have, at best, two observations per each predictor.

          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Thanks for the clarfication Clyde and Carlo.

            PEI firm rank, earlier acquisitions, industry/total deals, cash and secondary buyouts are the focal variables in this analysis, the variables are control variables.

            I have another question, is it necessary that the industry dummies are the same as the variable that you want to cluster the standard errors? Or is it fine to have 12 industry dummies and to cluster the standard errors on a 4-digit level or 2-digit level? n my analysis I use industry dummies on the basis of industry divisions (total 12 clusters).

            Kind regards,

            Arno

            Comment


            • #7
              Arno:
              clustering on the 4 digits code brings about an issue concerning an off-balance between the number of observations and the number of clusters (as youcan discover yourself by clicking on the missing F-test.)
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment

              Working...
              X