Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interpreting the constant and very high SE in logistic regression

    Dear all,

    I'm using cross section logistic regression analysis to find out if an organization signs a UN initiative (=1) or not (=0) given a variety of independent variables, some of them binary and some continuous. Here is a screenshot of my results for the final year of the analysis:
    Click image for larger version

Name:	screenshot statalist.png
Views:	2
Size:	36.0 KB
ID:	1313192

    The model fares well in terms of area under ROC curve, linktest and estat gof which are ways of testing the goodness of fit and specification of a logistic model known to me. However the _cons measure is very high. It's similar to the above in years 2-5 of the sample. In year one it looks different, like this:
    Click image for larger version

Name:	screenshot statalist2.png
Views:	1
Size:	46.0 KB
ID:	1313193

    I can't find an explanation of the significance of the _cons measure that I can understand. Why are the values so high? Does it matter? What does it mean? What does it mean that the value is significant in all years aside from year 1? Also - why might the SEs be so very high in the first year although they are all very similar and low in the other years?
    And also should I be worried about the ''Note: 98 failures and 2 successes completely determined.'' that only appeared in year 1? Are the constant and the SEs and the note related? If so, how?

    I would really appreciate any, possibly simple, explanations, or maybe if you point me to something I can read that will help me understand the high _cons and the one rogue year. I can upload any other information if you need it, summary statistics etc.
    Many many thanks in advance
    Sue

  • #2
    The constant is the predicted value when all the X variables = 0. This may not even be possible, e.g. you can't weigh 0 pounds; you can't get a score of zero on a scale that runs from 400 to 1200. If you want a more sensible/possible value you might do something like center all the continuous X variables, i.e. subtract the mean from each case. Then the constant would represent the predicted score for a person who had average values on all the Xs. In general you don't get too worried about the value of the constant unless you have good reason for thinking it is implausible.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    StataNow Version: 19.5 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      I agree with Richard.

      A good reason for not saying much about the intercept would be that it is often not of substantive interest. Recall that the prediction of the response would be based on the intercept alone if and only if all predictors were set to 0. Without knowing about all of your variables it is entirely possible that this case is way outside the range of the data. (Do all the predictors attain 0 ever?) Also, it is on an inverse logit scale and so a large negative value is really very very small, i.e. very near zero. I get that invlogit(-50) is about 2e-22, i.e. minute. At that end of the scale, the SE could represent minor numeric difficulties as much as anything substantive.

      Otherwise put, it's a feature that the model (almost) predicts zero probability for extreme predictor values.

      I'd worry much, much more about finding a more parsimonious model, but that's largely statistical taste.

      Last edited by Nick Cox; 15 Oct 2015, 11:35.

      Comment


      • #4
        Okay, there is no possibility for all the x variables to be 0, in every case there will always be variables that aren't zero (there are no organizations with zero assets under management, there are no organizations with a GLOBE score of zero). So therefore the constant is of little relevance?

        I do have a reduced model in which the SE inconsistency disappears, part of the output is here:
        Click image for larger version

Name:	reduced model for statalist.png
Views:	1
Size:	50.8 KB
ID:	1313207

        Is there any way that I can find out the reason for the high SEs in the full model? I use the full model in which every predictor corresponds to an element in my theoretical framework (the predictors are proxies for sources of salience) and then the reduced model as a robustness test for the full model, keeping only the variables that were significant in the full model output in the last year of the data. I've been told by my supervisor that a reduced model can be a robustness test for a full model in that way. I've had real trouble finding robustness tests for logistic regression and that's all I've got for the time being. Any advice or pointers to resources on that would also be extremely helpful.

        best wishes,
        Sue

        Comment


        • #5
          Dear Sue,

          I agree with the comments above, but I also note that you have some regressors with very large positive coefficients. I do not know what these regressors are, so it may just be that you have regressors in an inappropriate scale, or a model that generates some predictions that are essentially zero and others that are essentially one.

          All the best,

          Joao

          Comment


          • #6
            To echo Joao, the SEs themselves may just reflect the different magnitudes and units of measurement. Just as in regression this is one reason for not worrying too much about them and why z (cf. t) values are there to wash that all out for you.

            Comment


            • #7
              There is a discussion on interpreting the constant in logistic regression here: http://www.stata-journal.com/sjpdf.h...iclenum=st0251
              ---------------------------------
              Maarten L. Buis
              University of Konstanz
              Department of history and sociology
              box 40
              78457 Konstanz
              Germany
              http://www.maartenbuis.nl
              ---------------------------------

              Comment


              • #8
                Dear all,
                okay, thank you Maarten for the link - having read up on the constant, it appears that it shouldn't be meaningful in this case, as 0 is not a meaningful value for all the regressors and most are factor variables.

                However, I would like to ask some questions on the large coefficients. In the earliest year (2007) the IVs with the large coefficients are factor variables, which excludes the possibility of it being down to different magnitudes. Moreover, these variables happen to be time-invariant. So they do not change between the years, but the coefficients are normal in all years except 2007. Why could this be?

                Nick, thank you for drawing my attention to the z values. I have found this explanation of the z values in logistic regression: http://logisticregressionanalysis.co...ic-regression/
                Is it accurate? If so, do I understand correctly that the z value is another measure of significance, the value being below -2 or above 2 meaning that the regressor is significant? And is what you're saying effectively that the variables that are problematic appear to have no explanatory power anyway so the z values should 'wash them out' out the target, more parsimonious model?

                Joao, could you perhaps elaborate on the suggestion that the large coefficients may be due to predictions that are essentially 0 or 1? What does that mean exactly? For example, two regressors with very high coefficients are the leftvotes and greenvotes variables. They are the % of votes in a country cast for its green party and its leftwing party. They are a continuous variable in percentages. Although there is a strong preference of positive outcomes (1 dependent variable) to occur in cases where leftvotes = 0, not all positive outcomes occur when leftvotes = 0. Some also occur at 30% and even 50% leftvotes. Same goes for greenvotes. There is no inflated VIF score. In this case - why are the coefficients so high compared to all the other ones in the model?

                Again thank you all for your answers and sorry it took me a few days to process what you said and read up on things and come back with more questions.
                Best wishes,
                Sue

                Comment


                • #9
                  I strongly recommend working through any text on logistic regression by authors with solid reputation that is congenial to you.
                  Last edited by Nick Cox; 19 Oct 2015, 12:02.

                  Comment


                  • #10
                    Nick, that's definitely a fair point! Can I just ask you to clarify what you mean by the z values 'washing it all out'? Is it that ''the variables that are problematic appear to have no explanatory power anyway so the z values should 'wash them out' of the target, more parsimonious model?''

                    Comment


                    • #11
                      I meant this: As estimates and their SEs are on the same scale, the ratios estimate/SE wash out any side-effects of using particular units, or equivalently any particular conventions about reporting magnitudes. (An example of the latter is whether you use proportions (0-1) or percents (0-100) for reporting something when there is a choice.)

                      This is all on a par with 10 feet/5 feet = 3.04 m/1.52 m = 2. If you are comparing heights or lengths, you need not worry about units of measurement if you look at ratios. That is (much of) the rationale [pun intended] for z and t statistics.
                      Last edited by Nick Cox; 19 Oct 2015, 12:33.

                      Comment


                      • #12
                        Okay, so I did understand, I just can't explain it as expertly as you, thank you again, Nick!

                        I've managed to scale down the coefficients by generating a greenvotes and leftvotes veriable where the percentage values are multiplied by 100

                        I've also managed to fix the rogue variables in year 2007. It appears that the issue was that the two variables described the vast majority of the data in that year and the remaining observations were vastly uniform. So I split the noncorpgovcombined into two different variables and used one of them, which still explains what I need it to, but gives me more comparable coefficients.

                        Comment


                        • #13
                          Dear Sue,

                          From what I understand, the large coefficients were caused by the scale of the regressors and the problem is solved, right?

                          Joao

                          Comment


                          • #14
                            Hello Joao,
                            yes, they were caused by two different things in the two cases, one being indeed the scale and thank you all for pointing me to that one! The other one was an unfortunate distribution of outcomes between the variables I chose and didn't choose, which I've also found a solution to now.
                            Thank you all again for your inputs.

                            best wishes,
                            Sue

                            Comment

                            Working...
                            X