Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I fix colinearity with fixed effects and how do I fix missing F statistic?

    Using Stata I ran regressions with n=1101 observations. ID is a unique firm identifier and Year is a year variable.

    I ran two different regressions, simple OLS and Company+Year fixed effects and clustered standard errors.

    regress Y1 Dummy1 Dummy2 Dummy3 Dummy4 Dummy5 Dummy6 X1 X2 X3 PROBLEM1 COUNTRY
    reghdfe Y1 Dummy1 Dummy2 Dummy3 Dummy4 Dummy5 Dummy6 X1 X2 X3 PROBLEM1 COUNTRY, absorb(ID Year) cluster(ID Year)

    If I run the simple OLS regression, everything looks fine.
    But when I include fixed effects and standard error clustering, Stata warns me about:
    1. dropped 6 singleton observations
    2. PROBLEM1 is probably collinear with the fixed effects (all partialled-out values are close to zero; tol = 1.0e-09)
    3. COUNTRY is probably collinear with the fixed effects (all partialled-out values are close to zero; tol = 1.0e-09)
    4. Warning: VCV matrix was non-positive semi-definite; adjustment from Cameron, Gelbach & Miller applied.
    5. missing F statistic; dropped variables due to collinearity or too few clusters
    1. How can I fix the design to include firm and year fixed effects?
    Problem is variable that is equal to 1 if the firm is audited by a BIG4 company, a variable that usually is not changing across time.
    Country is equal to 1 if the firm is located in a specific country. From all the 1101 observations, 420 are coded with COUNTRY=1. How can I fix my design to not exclude these variables?

    2. Is it a big issue that I included so many dummy variables, apart from interpretation?

    3. Alternatively, I could include industry fixed effects, but I would love to include Year and Firm fixed effects/Understand why my model right now is not working.

    Best regards
    Last edited by Luca Haseney; 27 Dec 2022, 12:42.

  • #2
    1. How can I fix the design to include firm and year fixed effects?
    Well, you've already done that. You're just not happy with the results. And you're hoping that you can also get estimates for the PROBLEM1 and COUNTRY variables' effects on Y1. Sadly, you can't do that, at least not within the framework of a fixed effects model. It is mathematically impossible to estimate the effects of variables that do not vary within groups of observations defined by the fixed effects.

    So you should stop and think about why you want PROBLEM1 and COUNTRY in the model in the first place. If they are truly variables whose effects are important to your research, then you will have to abandon fixed-effects modeling. You might analyze the data using a mixed-effects model. On the other hand, if PROBLEM1 and COUNTRY are being included only because you want to account for their confounding effects on the relationships between the important independent variables and Y1, then the fixed effects model does that for you automatically--not only is it impossible to include them in the fixed-effects model, there is no need to do so because the fixed effects themselves carry that information and the results are thereby already adjusted for PROBLEM1 and COUNTRY effects.

    Is it a big issue that I included so many dummy variables, apart from interpretation?
    I'm not sure what you're asking here. If you are responding to the "too few clusters" to calculate an overall F-statistic message, then, indeed, this is happening because the number of variables exceeds the number of clusters, thereby using up all the degrees of freedom. But it's not because they are indicator ("dummy") variables: the same thing happens with any kind of independent variable. Each independent variable uses 1 df, and the available df are limited by the number of clusters in the model. That said, why do you care about the overall model F-statistic? That statistic just tests the null hypothesis that all of the coefficients of the independent variables are simultaneously zero. Is testing that hypothesis one of your research goals? Usually it isn't, because usually these models include at least some variables that we don't really care about but include because we need to adjust for their confounding effects. So normally we have no interest in whether their coefficients are zero or not. So the overall F test is usually irrelevant, and its absence is not a problem. Only if that omnibus null hypothesis test is part of your research goals would it matter.

    Alternatively, I could include industry fixed effects, but I would love to include Year and Firm fixed effects/Understand why my model right now is not working.
    So what you really have is a model with more than just two levels. You have yearly observations nested within firms, which are, in turn, nested within industries and countries, and the industries are, I assume, crossed with countries. So an attempt to do this as an ordinary fixed-effects regression is a mis-specification of the data generating process. This is a problem one frequently runs into because fixed-effects models cannot represent more than two levels of nesting and in economics and finance there is a strong preference for fixed-effects models. But if you want to get effects of time-invariant variables we have already established that it cannot be done with fixed-effects models. So now you have two reasons to prefer a mixed-effects model. The fact that you have two good reasons won't necessarily protect you from the criticism that mixed-effects models may provide inconsistent estimates if the random effects are correlated with the predictors. So, in the end, there may be no totally satisfactory way to model this data. A compromise position like using industry fixed-effects is still wrong, but it might, in fact, be the most useful model you can get. But if I were in your shoes, I'd first look at a correlated random effects analysis (Mundlak model) here, retaining firm as the fixed effect. The -xthybrid- command, available from SSC estimates these. It will provide you with within-firm estimates for the time-varying variables that are the same as you would get from a pure fixed effects model, but it will also give you between-firm effect estimates for the time-invariant ones.

    Comment


    • #3
      Cross-posted at https://stats.stackexchange.com/ques...ng-f-statistic

      Please note our policy on cross-posting, which is that you are asked to tell us about it. https://www.statalist.org/forums/help#crossposting

      Comment


      • #4
        Dear Mr. Schechter,

        thank you very much for the explanations!

        Originally posted by Clyde Schechter View Post
        Well, you've already done that. You're just not happy with the results.
        You are completely right about this.
        To make it more concrete, my thought process regarding COUNTRY was to test the hypothesis if the juidical setting has an impact on the outcome variable.
        And PROBLEM 1 is a variable that explains if the specific company is audited by a specific auditor.

        I am still not sure why other studies can include the auditor variable. Could a possible reason be that these studies use a larger time horizon and some of the firms change their auditor within this timeframe? This acutally would be plausible because auditor tenure is 5 years, and my study only includes 5 years.

        Originally posted by Clyde Schechter View Post
        So the overall F test is usually irrelevant, and its absence is not a problem. Only if that omnibus null hypothesis test is part of your research goals would it matter.
        Thank you again for the explanation, my question was foggy but you are right.
        I am hesitant to include a table with missing f statistics. So far I learned that it describes the "goodness" of the whole model, and if it is missing it must mean that my model is useless? If I understand you correctly, the F-statistic is irrelevant and I can still interprete my independent variable statistics?

        Originally posted by Clyde Schechter View Post
        The -xthybrid- command, available from SSC estimates these. It will provide you with within-firm estimates for the time-varying variables that are the same as you would get from a pure fixed effects model, but it will also give you between-firm effect estimates for the time-invariant ones.
        I tried the xthybrid and it also omitted the COUNTRY and PROBLEM1 variable. And to be honest, I have problem interpreting this data.

        I tried now different models, but I am unsure about which to take. Please comment if my thoughts are flawed:

        1. Include Time and Firm fixed effects, clustered by industry.
        This would require me to drop both COUNTRY and PROBLEM1 because of collinearity with fixed effects. Also, F-statistics would be dropped because the dof are used up. This is my least favorite option, as I have to exclude COUNTRY and PROBLEM 1.

        2. Include Industry and Year fixed effects, clustered by firms.
        This allows me to have a large enough number of clusters and to include COUNTRY and BIGN. But on the other hand, clustered by firm is the lowest level I can do and I read that actually the largest possible cluster should be used.

        3. Include Country and Year fixed effects, clustered by firms.
        This seems like a rather stupid model, because of course I want to test for the country influence, and then removing it via the fixed model seems counterproductive.

        4. Completely abandon fixed effects and use a simple OLS regresssion.




        Comment


        • #5
          I am still not sure why other studies can include the auditor variable. Could a possible reason be that these studies use a larger time horizon and some of the firms change their auditor within this timeframe? This acutally would be plausible because auditor tenure is 5 years, and my study only includes 5 years.
          That is almost certainly the case. Or the analyses may not have been based on fixed-effects models.

          If I understand you correctly, the F-statistic is irrelevant and I can still interprete my independent variable statistics?
          Yes.

          Please comment if my thoughts are flawed
          Your understanding of the various models you propose is correct. Look, there is no ideal solution to your problem. You have to choose whatever approach you find to be least bad. The problem arises because you have (at least) 3-level data, and there are no identifiable 3-level fixed-effects models. So no matter what you do, something will be misspecified. You can escape this problem by using a random-effects model, but that has drawbacks of its own. So you will have to weigh the pros and cons and decide which compromise with these limitations you can best live with.

          By the way, I would not go to a simple OLS regression model here. An OLS model is subject to the same inconsistency issues as a random effects model, but does not account for the hierarchical structure of your data. So you pay the price of a mixed effects model and gain none of its benefits. So if you are going to step outside the confines of fixed-effects models, go to a random-effects model. (If the results of that model suggest that the random effects are ignorable, then, yes, go back to OLS for its simplicity, but that is unlikely to happen.)
          Last edited by Clyde Schechter; 30 Dec 2022, 09:08.

          Comment


          • #6
            Dear Mr. Schechter,

            what I am just wondering about is the three-level data. Does not every econometric study have 3 level data?
            I mean many different studies examine companies across countries, and these are always crossed with/part of industries. Still, firm and year fixed are the most common usage. Is it that all the models are misspecified then? Or does this problem mainly arise because I use the country variable in my data and want to include it as a independent variable?

            Comment


            • #7
              Does not every econometric study have 3 level data?
              I mean many different studies examine companies across countries, and these are always crossed with/part of industries. Still, firm and year fixed are the most common usage. Is it that all the models are misspecified then?
              As I don't follow the econometrics literature, I don't know what fraction of studies are international. Suffice it to say that if a study collects repeated data on firms, and if the firms are in turn nested within countries and industries, but a two-level analysis is done, then, yes, the analysis is a mis-specification of the real world data generating process. If, in reality, the country and industry effects are negligible, then the mis-specification is not a problem. And even in the face of substantial country or industry effects, a model that (implicitly) averages over them and provides an automatic, though silent, adjustment for them, though still wrong, may nevertheless be useful. Or not, depending on the use to which results will be put.

              But as you note in your final sentence, the problem becomes glaringly obvious when one of the research goals is to estimate the higher-level effects. Then one is forced to make difficult choices and not simply sweep the difficulties under the rug.

              Comment

              Working...
              X