Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interpretation of Dummy Variables as a predictor variable

    Hi,

    I'm running a logistic regression and both my DV and main IV are dummy variables.
    My dep variable represents is coded as 0 when there is an improvement of performance, and 1 when there is a failure in improving.

    My predictor variable represents past failure (dummy is coded as 1) and past successes (dummy is coded as 0).

    What I'm trying to see is how past failures affect future failures.

    I attached my results and hope someone can help me to understand how to read them! Thank you
    Attached Files

  • #2
    The interpretation of these results is made complicated by two factors: 1) you have output the coefficients instead of the odds ratios, and 2) this is a conditional fixed effects regression. If you re-run these models with the -or- option specified, you will get the odds ratios instead of the coefficients. Since you didn't, we can calculate them by applying the exp() function. OR = exp(coefficient).

    You don't say which of your variables is the predictor of interest--you have several predictors and the names are not suggestive. To illustrate the approach, in the first model, I'll assume that it's the first variable listed in the output, previouslea~3. When we exponentiate the coefficient and the lower and upper confidence limits we get an OR of 0.79 with a 95% CI from 0.67 to 0.93 (all rounded to 2 decimal places). Now, with a conditional logistic regression, the inferences are strictly within the grouping variable. So given a single idistituzione, the odds of failure for those with previouslea~3 = 1 are estimated to be 0.79 times the odds of failure for those with previouslea~3 = 0. Note that with a conditional fixed effects model you cannot obtain any estimate of the actual odds in either condition, only the odds ratio between the two conditions. With a 95% CI of 0.79 to 0.93, we can say that this is a random interval generated by an algorithm that produces an interval containing the true population odds ratio in 95% of random samples from that population.

    An analogous interpretation would apply to each of the variables in either model. To save yourself the trouble of calculating the exponentials yourself, I suggest you rerun both models specifying the -or- option to get the odds ratios and their confidence intervals directly.

    Note also that while you can exponentiate coefficients and confidence limits to get the OR and its confidence limits, exponentiating the standard error, or the z-statistic or p-value will not get you the corresponding statistic for the odds ratio. The z-statistic and p-value are the same for both coefficient and OR. The standard error of the odds ratio is a bit more complicated to calculate and I won't go into it here since Stata will do it for you when you re-run with -or- specified.

    Comment


    • #3
      Thank you, Clyde. I'll run again all the regressions adding "or" at the end and it will be much better. Your explanation on how interpreting the dummy variable in conditional fixed effect was great, now I finally understand.
      This blog is the most precious resource for a Ph.D student!

      Comment


      • #4
        Ok, now with OR is much better and yes, my predictor variable is previous3 and previous1 (I'm using different lags).
        Do you think I should prefer the predictions of the models with the better log likelihood? (which should be the highest one if I've understood it)

        In these two example I attached I have:

        1) In one case the log likelihood is -1413,69 and the prob>chi2 = 0.0015 with a LR chi2 of 52.54

        2) In another case the log likelihood is lower -1504,87 but the prob>chi2 = 0.0000 with a LR chi2 186.74 (which seems to be the value with changed the most)

        Should I prefer the number one even if in the number 2, prob>chi2 = 0.0000?

        Attached Files

        Comment


        • #5
          The comparison of log-likelihoods between different models is not meaningful, except when those models are nested, which yours aren't. (Even then, I wouldn't select a model just based on the log-likelihoods.)

          In any case, if your purpose is to select the model in which previouslearn* has the largest effect, then the statistic to look at is its associated odds ratio. The second model has an OR of about 0.45, whereas the first has an OR of 0.79. So in the second model, previouslearn* has more of an effect (OR further from 1) than the first. Not only that, if you look at the associated confidence intervals, each confidence interval excludes the estimated OR in the other model. That's a pretty clear signal that the difference between those ORs is not just noise.

          So, if that's your goal, I'd go with the second model. Now, of course, that isn't necessarily your goal. I just guessed that based on the context of your post and the earlier ones in this thread. If the purpose at hand is different, post back explaining it, and I'll try to guide you accordingly.

          Comment


          • #6
            Thank you, Clyde!

            Sorry for not being too clear about the model: so i'm trying 3 models which represents 3 different lags. I previously just attached just two of them. In one case I want to explore how previous experience of failure at t-1 affect experience at time t. In the second one independent variable is lagged t-2 and in the third one at t-3.
            ​​
            You wrote "not only that, if you look at the associated confidence intervals, each confidence interval excludes the estimated OR in the other model. That's a pretty clear signal that the difference between those ORs is not just noise" which sounds really important but I'm not sure I got it. Could you explain it to me?
            many thanks as usual!
            Attached Files

            Comment


            • #7
              I attached a table of results (table of results 3) complete with all the models but with no OR and then the 3 models (each of it has a different lag) with OR

              Comment


              • #8
                So, the first thing I see as that within each of those separate files, the coefficient for the lagged variable doesn't change much even as you change the other variables in the model. That does simplify things a bit, because we don't have to really pay attention to those other variables and can just pick one version and compare the different lags.

                Clearly the odds ratio of 0.42 using t-1 is the most different from 1. The t-2 odds ratio of 0.87 is much closer to 1, and the T-3 odds ratio of 0.79 is in the same ballpark as the t-2 odds ratio. So if the purpose of the comparison is to identify which lag is most influential, it looks like t-1 wins.

                The point about the confidence intervals is this. The confidence interval around the 0.42 OR is 0.36 to 0.49. The confidence interval around the 0.87 OR is 0.76 to 1.0. Notice that 0.42 is not between 0.76 and 1.0; nor is 0.87 between 0.36 and 0.49. So to the extent that we can think of confidence intervals as showing us a range of imprecision around our estimates, we see that neither of these ORs is within the range of uncertainty of the other. So it's not just that the models provide only vague, fuzzy estimates of odds ratios that might really not differ. The estimates are pretty precise, and the difference between the estimates is appreciably larger than the imprecision of the estimates. Exactly the same thing can be said about the contrast between the 0.42 OR estimate and the 0.79 OR estimate from the T-3 model.

                So I think you are on pretty solid grounds in concluding that the T-1 lag provides stronger prediction than the T-2 or T-3 lag. And, as noted in the first paragraph here, this remains the case regardless of which other variables you include in the model.

                Comment


                • #9
                  Hi Clyde, hope you're well!

                  I have a questions about the fixed-effect regression model I showed you in this post (see above). The referee asked me to run a post-estimation check and I was wondering if you could help me to chose the appropriate one. Thank you so much!

                  Comment


                  • #10
                    Did the reviewer say what kind of post-estimation check he/she wants you to do? I really don't know what he/she has in mind.

                    Comment


                    • #11
                      He wrote to chose something like "curve under ROC, false negatives, false positives, etc.." but what is supposed to be the best one?
                      Thank you!

                      Comment


                      • #12
                        Well, none of those things really apply after a fixed-effects regression. The main issue is that a fixed-effects regression does not predict the probability of the outcome. Rather, within each group, it predicts, for each observation, the probability that this particular observation, as opposed to the others in the group, will have a 1 outcome conditional on there being the actually observed total number of positives outcomes in the group. So, while you could do something like an ROC curve calculation, it would not really have the same meaning that it does when applied to an unconditional logistic regression. If you look at the options for -predict- available after -xtlogit, fe-, you will see that outcome probability is not among them--precisely because it is not estimable from this kind of model. You can choose from two probabilities: pc1 and pu0. pc1 is the probability of a positive outcome conditional on the group having exactly one positive outcome (which in your data is probably counterfactual). And pu0 is the probability of a positive outcome conditional on the group's fixed-effect being zero. This could be used to do an ROC-like calculation. (i.e. -predict pu0, pu0- followed by -roctab outcome pu0-). What it would give you is a measure of the ability of the model to discriminate positive from negative outcomes within a single group. But it would say nothing about cross-group prediction. So it's ROC-like, but it doesn't have the usual interpretation of an ROC curve. For example, the two-point forced-choice probability interpretation of the ROC area would not apply, except in the limited circumstance where you know the two "points" come from the same group.

                        Comment


                        • #13
                          Thank you so much! is there a way I can cite you in the paper and with the referee? Is it a thing that it's possible to do to cite the helpful things coming out of this blog?

                          Comment


                          • #14
                            It is, of course, possible to cite a blog. The journal's "Information for Authors Page" should provide you with instructions regarding the format for that. Most journals require that you obtain the consent of the person you wish to acknowledge, and different people have different preferences about this sort of thing. I will respond to you by a private message regarding this particular circumstance.

                            Comment

                            Working...
                            X