Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Var != 0 predicts failure perfectly (Logit-Model)

    Dear STATA community,

    I am running a Logit Model, and I am facing a problem that had been discussed here already: variable != 0 predicts failure perfectly. However, the discussions around it have not yet solved my problem.

    When including country fixed effects, I face the aforementioned problem which leads to a massive drop of my observations (nearly half of it). Some answers in this forum were suggesting to use -firthlogit- or -exlogistic-, however these commands do not work in my case as I am using time-series operators and factor variables. Other answers were suggesting to keep the model as it is, since the data does not provide enough information about how changes in the predictor variables are associated with changes in the outcome.

    Before this model, I run three other models (starting with a bivariate, then including control variables, then time fixed effects). In these three models, I do not face the problem and my observations are not dropped. I want to compare my results across these models, and I am wondering if this is at all possible since so many observations were dropped?

    Here you can see the observations numbers:
    Bivariate Model: 18488 | Model with all controls: 9981| Time Fixed effects: 9840 |Time and Country Fixed effects: 4353

    And here you can see the “error” message from STATA (as an example):
    note: 41.country_id != 0 predicts failure perfectly; 41.country_id omitted and 29 obs not used.
    Country_id is a unique ID designed for every country.

    Notably, a few years have also been deleted: note: 1901.year != 0 predicts failure perfectly; 1901.year omitted and 35 obs not used.

    My idea right now is to include country&time fixed effects for all models so I have a comparable observation number for all models.

    Thank you very much for your help in advance!

  • #2
    I think, I am getting closer to the problem/solution. I just realised that it is not an explanatory variable that predicts the failure perfectly, instead it is my country identifier as well as year identifier. Maybe this refers to some issues in my dataset?

    Comment


    • #3
      I am still struggling with this problem. Is there anyone who can help me?

      Comment


      • #4
        What you are discovering in your data is that for certain countries in certain years, all of the outcomes were 0. There are several possible reasons for this:
        1. For real structural reasons, outcome 1 never can occur in those countries in those years. This problem would persist even if you had infinite data at your disposal.
        2. Outcome 1 can occur in those countries in those years, but perhaps due to limited sampling, your data set just didn't include any such instances.
        3. Your data are incorrect.
        Situation 1 is not fixable, and requires a different approach to modeling this data. Logistic regression is estimated by maximum likelihood, and the likelihood for these countries and years is negative infinity. Stata (and all other statistical packages I know of) anticipates that there is no possibility for the estimation to converge, and seeks out problems like this before estimating. It resolves them by removing the offending cases. This simply means that the logistic model does not apply to those country-year combinations. And no model is needed for those country-year combinations because the very simple model of "outcome = 0" is a sufficient and perfectly accurate model for them. It should also be pointed out that the loss of data in the estimation sample is not actually a problem, because the excluded observations are uninformative about the regression coefficients anyway. You are losing data, but not losing information.

        Situations 2 and 3 are, ideally, fixed by improving the data set. Of course, in reality this may be infeasible. More data may simply not be available, and erroneous outcome assessments may not be recognizable. For situation 2, at least, you can get around the problem by using -firthlogit- or -exlogistic-, which do not rely on maximum likelihood estimation. -firthlogit- uses penalized maximum likelihood, which attenuates the negative infinite likelihood to a finite value. -exlogistic- uses a combinatorial approach that is analogous to the Fisher exact method of analyzing contingency tables. The fact that you are using factor-variable notation is only a minor problem: you can use the mostly-obsolete -xi- command to create your own indicator variables. Similarly, for time-series operators, you can simply create your own lagging and leading variables as separate variables. It's clumsy and inconvenient, and you will not be able to use -margins- for post-estimation, but you will still get results from firthlogit.

        Given the size of your data set, I do not recommend using -exlogistic- here. It is likely to "blow up" in terms of the amount of memory and computation required, and it is questionable whether your great-grandchildren will live to see it reach results. (I'm being hyperbolic,or maybe not!)

        If you are in situation 3, if you are confident that very little of your data is incorrect, you can pretend you are in situation 2, act accordingly, and just acknowledge that your model is based on flawed data and this is the best you can do. Most data sets, after all, contain some erroneous data--we just aren't aware of exactly where the errors are.

        Comment


        • #5
          Thank you Clyde Schechter. Your explanation helps to understand the problem better. I am facing situation 1, so it make sense that in my data outcome 1 never occurs in some countries and years. However, what I do not understand is that the "perfect failure prediction" only occurs when including fixed effects for country and years. When running the logit model without fe, I do not face this issue. This is why I am thinking of running my models without fe first, and then include fe as a robustness check. Would you agree with this approach?

          Comment


          • #6
            Well, the variables that are doing perfect prediction, the combinations of year and country wherein success is impossible, simply don't appear until you include them in the model, i.e. when you introduce fixed effects.

            I can't realize advise you about the suitability, or not, of running the models without fe as a robustness check. In general, ignoring heterogeneity among the countries or over time is not a good idea, and the "robustness check" can indeed end up telling you that your findings are very sensitive to fe vs non fe analysis. In fact, it is not even a legitimate robustness check because plain coefficients from logistic regression and fixed-effects logistic regression estimate two different things. In fixed-effects regression you get estimates of effects purely within country: if a country has two different values of X at two different points in time, what is the expected difference in probability of success outcome. By contrast, there are also between country effects: if country A and country B have different values of X, what is the expected difference in probability of success outcome? These two effects can be different--very different, and even opposite in sign. Plain logistic regression coefficients estimate a blend of the between- and within-country effects. So -fe- is not a good robustness check.

            Whether a non-fe analysis of this data is of any use at all depends on what the context is, the variables mean, and the specific research question is. It would be an uncommon situation where a non-fe and an fe analysis are both appropriate to the research goals.

            Comment


            • #7
              Thank you very much for all the explanations. It helps me a lot!

              Comment

              Working...
              X