Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to deal with multicollinearity when adding fixed effects dummies in regression with cross sectional data?

    Hello,

    I have cross sectional data with 26 groups. I estimated a probit and fracreg regression for my two research questions. Since my key explanatory variables varies at the group level, I added the dummies for group and clustered errors at group level as well. However, I estimated vif by running reg command with the same variables I used for probit and fracreg regression. The VIF is very high for my key explanatory variable (around 28,000) when I add group dummies. However, when I remove the dummies it is within the acceptable range (like 4 etc). Why is the multicollinearity high when I add dummies? How can I fix the issue?

  • #2
    There are two views on the multicollinearity so called problem.

    1) That is is not a problem, and it is not a problem because nothing is false or misleading in a regression with high multicollinearity -- the coefficient estimates have the usual properties, the standard errors and covariances of estimates are correct, everything is correct.

    2) That is is a problem because you cannot estimate anything precisely (that is, your variables of interest are not significant in your multicollinear regression). Then the only solution is to go out, and to collect more data, which is not as multicollinear as your current data.

    Comment


    • #3
      Thank you Joro Kolev
      Do you have any references to the first option so that I can back my decision with evidence?


      Thank you

      Comment


      • #4
        Originally posted by Laiy Kho View Post
        Thank you Joro Kolev
        Do you have any references to the first option so that I can back my decision with evidence?


        Thank you
        The references are

        1) Gueorgui I. Kolev said so on Statalist. Everyone has to trust science and the experts; if an expert such as Gueorgui I. Kolev said it, it must be true :P.

        2) There is a fun reference in the late's Arthur Goldberger textbook "A course in Econometric" where Prof Goldberger is mocking the imaginary "problem of multicollinearity" calling it "the problem of micro numerosity". In short when you have small sample and you try to estimate the mean say (linear regression is just that, it estimates the conditional mean), then you have the problem of micro numerosity. The problem of micro numerosity is that your standard errors are large. When you have a sample of 1 observation, the problem of micro numerosity is severe: the standard error of the estimate is not even defined.

        3) You can read any textbook on Econometrics that has a section/chapter on the "problem of multicollinearity", and just observe that nobody claims that it is causing any other problems, apart from your standard errors being "too big", too big relative to what you want them to be, and not relative to the correct ones. That is, the standard errors are just as big as they are supposed to be, but you the econometrician do not like this fact.

        Comment


        • #5
          Since my key explanatory variables varies at the group level, I added the dummies for group and clustered errors at group level as well.
          You don't show the exact code you ran, but this description sounds like a fatal error in analysis. Look at your -probit- and -fracreg- output closely. Before the coefficient table, do you see a message saying that some value(s) of the group variable were omitted due to colinearity? (If not, there is no need to read the rest of this post.)

          If so, your analysis is invalid with respect to the group variables themselves and any variable that varies at the group level, including your key explanatory variable. This is because if you have a variable that varies at the group level and you also have the indicators ("dummies") for the groups themselves, then you have, not multicolinearity, but exact colinearity. Which means that the model is unidentifiable. When confronted with such a situation, Stata (or any other statistical package) will identify the model by imposing some constraint(s) to remove the exact colinearity. Usually this is done by omitting one of the variables involved, which is equivalent to imposing a constraint that its coefficient is zero. But the choice of constraint is arbitrary, and the coefficients of all of the remaining variables are meaningless artifacts of the particular choice of constraint. Had some other variable been omitted, the results could be entirely different. In fact, mathematically, it is possible to devise a constraint that will result in your key variable's coefficient being any value whatsoever.

          If you have experience with fixed-effects regression, you know that you cannot estimate a group-level variable's effects in a fixed effect regression (with group-level fixed effects). When you use the -xt- commands for that, they never fiddle with those group-level fixed effects: they just drop the group-level explanatory variable(s) for you. Since everybody looks for the output for the key variable, you can't help but notice that something has gone wrong. But because you are using -probit- and -fracreg- with i.group variables, the problem is still there, but you are not protected in that way. Stata just finds an exact colinearity among the key variable and the group indicators and picks one or more to remove. It tells you it is doing so, but if you do not pay explicit attention to all the details, it is easy to overlook the fact that Stata has just handed you meaningless results for your key variable.

          Comment


          • #6
            Clyde Schechter Thank you

            Before the coefficient table, do you see a message saying that some value(s) of the group variable were omitted due to collinearity?
            Yes, you are right. I did have some dummies dropped due to collinearity. However, I added the group level dummy after the following post:

            https://www.statalist.org/forums/for...sectional-data

            If the key explanatory variable varies mostly at the bank level, then cluster at the bank level. No need to worry about serial correlation.
            Also,

            https://www.statalist.org/forums/for...ata-regression

            If you have many region/industry categories I would find a way to obtain a unique identifier and then use it in xtreg or reghdfe. Or, you can construct the dummies and use a (long) regression:
            From my understanding, both these posts suggest including a factor variable for group to account for group level differences (please correct me if I'm wrong). I have 28 groups.

            This is because if you have a variable that varies at the group level and you also have the indicators ("dummies") for the groups themselves, then you have, not multicolinearity, but exact colinearity. Which means that the model is unidentifiable.
            However, your point makes sense. Do you suggest I remove the factor variable for groups?

            Comment


            • #7
              From my understanding, both these posts suggest including a factor variable for group to account for group level differences (please correct me if I'm wrong).
              I don't have time today to review those posts and see if you are misinterpreting them, or taking those sentences out of context, or if the posts are wrong. Suffice it to say that if you want to, in one stroke, adjust for all group-level variables, the way to do that is by including group-level indicators ("dummies"). BUT, once you do that, you cannot also then include particular group-level variables and get meaningful results.

              Do you suggest I remove the factor variable for groups?
              It depends on your specific research goals. You described the variables in question as key explanatory variables. I infer that you mean that estimating their effects is necessary in order to achieve your research goals. If that is the case, then, yes, you must remove the factor variable for groups. On the other hand, if these variables' effects are not important to your research goals, I would leave the factor variable for groups in, but remove those other variables.

              Comment

              Working...
              X