Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multinomial logistic regression

    Hi all,
    I am working with a data set (sample below) that contains, among other things, four binary variables (stage1 - stage4) that indicate whether or not a particular theoretical construct occurred. These are my dependent variables. It also contains sex, a binary variable of 0 for male, 1 for female, the nominal variable country (1=USA, 2=Korea, 3=Japan, 4 = Jamaica, 5=El Salvador), and the ordinal variable ageGroup, which is four different age groups (given the values 1 - 4) that go with the theory. I will not be using age as a continuous variable as this would deviate from the theory.
    Here is an example of my command:
    Code:
    mlogit stage1 i.country i.sex ageGroup
    and the output:
    Code:
    Iteration 0:   log likelihood = -562.03133  
    Iteration 1:   log likelihood =  -503.4806  
    Iteration 2:   log likelihood = -502.02042  
    Iteration 3:   log likelihood = -502.01685  
    Iteration 4:   log likelihood = -502.01685  
    
    Multinomial logistic regression                 Number of obs     =        949
                                                    LR chi2(6)        =     120.03
                                                    Prob > chi2       =     0.0000
    Log likelihood = -502.01685                     Pseudo R2         =     0.1068
    
    ------------------------------------------------------------------------------
          stage1 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    0            |  (base outcome)
    -------------+----------------------------------------------------------------
    1            |
         country |
              2  |   .0360518   .2247894     0.16   0.873    -.4045272    .4766309
              3  |  -.1831853   .3148928    -0.58   0.561    -.8003637    .4339932
              4  |   .4990308   .2364217     2.11   0.035     .0356527    .9624088
              5  |   -.282688   .3400538    -0.83   0.406    -.9491813    .3838052
                 |
           1.sex |  -.4441405   .1561065    -2.85   0.004    -.7501036   -.1381774
        ageGroup |  -1.033842   .1158706    -8.92   0.000    -1.260944   -.8067396
           _cons |   2.442984   .3803291     6.42   0.000     1.697553    3.188416
    ------------------------------------------------------------------------------
    My questions are:
    1) How do I run the model without using the USA and the first age group as the base? I want to be able to report the coefficients for all countries and age groups, not just what they are based on the "1" category.
    2) Does this test fit? When working with a binary predictor and multiple nominal predictors, multinomial logistic regression came up, but I am going down the wrong road?

    Thanks!


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte age str6 gender str11 location byte(stage1 stage2 stage3 stage4) str21 know float(ageGroup country sex knowSci)
    10 "Female" "NC" 1 1 1 1 "Yes"     3 1 1  1
    10 "Female" "NC" 1 1 1 1 "Yes"     3 1 1  1
    10 "Female" "NC" 1 0 0 0 "No"      3 1 1  0
    10 "Female" "NC" 1 0 1 0 "NO DATA" 3 1 1 -1
    10 "Female" "NC" 0 0 0 0 "Yes"     3 1 1  1
    10 "Female" "NC" 0 1 0 0 "Yes"     3 1 1  1
    10 "Female" "NC" 1 1 1 0 "Yes"     3 1 1  1
    10 "Female" "NC" 0 0 1 0 "Yes"     3 1 1  1
    10 "Female" "NC" 0 0 0 0 "Yes"     3 1 1  1
    10 "Female" "NC" 1 0 1 1 "Yes"     3 1 1  1
    10 "Female" "NC" 0 0 0 1 "Yes"     3 1 1  1
    10 "Female" "NC" 0 0 1 1 "Yes"     3 1 1  1
    10 "Female" "NC" 1 0 1 0 "Yes"     3 1 1  1
    10 "Female" "NC" 0 0 0 1 "Yes"     3 1 1  1
    10 "Female" "NC" 1 0 1 1 "Yes"     3 1 1  1
    10 "Female" "NC" 0 1 1 1 "Yes"     3 1 1  1
    10 "Female" "NC" 0 1 0 0 "Yes"     3 1 1  1
    11 "Female" "NC" 1 0 1 0 "Yes"     3 1 1  1
    11 "Female" "NC" 0 1 0 1 "Yes"     3 1 1  1
    11 "Female" "NC" 1 1 1 1 "Yes"     3 1 1  1
    11 "Female" "NC" 0 1 0 0 "Yes"     3 1 1  1
    11 "Female" "NC" 1 0 1 0 "Yes"     3 1 1  1
    11 "Female" "NC" 1 1 1 0 "Yes"     3 1 1  1
    11 "Female" "NC" 1 0 0 0 "Yes"     3 1 1  1
    11 "Female" "NC" 1 1 0 0 "No"      3 1 1  0
    11 "Female" "NC" 0 0 0 0 "Yes"     3 1 1  1
    11 "Female" "NC" 1 0 0 0 "Yes"     3 1 1  1
    11 "Female" "NC" 0 0 1 0 "Yes"     3 1 1  1
    11 "Female" "NC" 1 0 1 0 "No"      3 1 1  0
    11 "Female" "NC" 1 0 0 1 "Yes"     3 1 1  1
    11 "Female" "NC" 1 0 1 0 "Yes"     3 1 1  1
    11 "Female" "NC" 0 0 1 0 "Yes"     3 1 1  1
    11 "Female" "NC" 1 0 0 0 "Yes"     3 1 1  1
    11 "Female" "NC" 0 0 0 0 "Yes"     3 1 1  1
    11 "Female" "NC" 0 1 0 1 "No"      3 1 1  0
    11 "Female" "NC" 1 0 0 0 "Yes"     3 1 1  1
    11 "Female" "NC" 1 0 0 0 "Yes"     3 1 1  1
    11 "Female" "NC" 0 1 0 0 "Yes"     3 1 1  1
    12 "Female" "NC" 0 1 0 1 "Yes"     3 1 1  1
    12 "Female" "NC" 0 1 0 1 "No"      3 1 1  0
    12 "Female" "NC" 0 1 0 1 "No"      3 1 1  0
    12 "Female" "NC" 0 1 1 0 "Yes"     3 1 1  1
    12 "Female" "NC" 1 0 1 1 "Yes"     3 1 1  1
    12 "Female" "NC" 0 0 0 0 "Yes"     3 1 1  1
    12 "Female" "NC" 0 1 1 0 "No"      3 1 1  0
    12 "Female" "NC" 0 0 1 0 "Yes"     3 1 1  1
    12 "Female" "NC" 0 0 1 1 "Yes"     3 1 1  1
    12 "Female" "NC" 0 1 1 0 "Yes"     3 1 1  1
    12 "Female" "NC" 0 0 0 0 "Yes"     3 1 1  1
    12 "Female" "NC" 1 1 1 0 "No"      3 1 1  0
    12 "Female" "NC" 1 0 0 0 "No"      3 1 1  0
    12 "Female" "NC" 1 0 1 0 "No"      3 1 1  0
    12 "Female" "NC" 0 0 0 0 "No"      3 1 1  0
    12 "Female" "NC" 0 0 0 1 "No"      3 1 1  0
    12 "Female" "NC" 1 0 1 0 "Yes"     3 1 1  1
    12 "Female" "NC" 0 0 0 0 "No"      3 1 1  0
    13 "Female" "NC" 1 1 1 1 "No"      3 1 1  0
    13 "Female" "NC" 0 1 1 1 "No"      3 1 1  0
    13 "Female" "NC" 0 0 0 0 "Yes"     3 1 1  1
    13 "Female" "NC" 0 0 1 1 "No"      3 1 1  0
    13 "Female" "NC" 0 0 1 0 "No"      3 1 1  0
    13 "Female" "NC" 0 0 1 0 "NO DATA" 3 1 1 -1
    13 "Female" "NC" 1 0 1 0 "No"      3 1 1  0
    13 "Female" "NC" 0 1 1 0 "No"      3 1 1  0
    13 "Female" "NC" 1 1 1 1 "Yes"     3 1 1  1
    13 "Female" "NC" 0 0 0 0 "No"      3 1 1  0
    13 "Female" "NC" 0 1 1 1 "No"      3 1 1  0
    13 "Female" "NC" 0 0 1 0 "No"      3 1 1  0
    13 "Female" "NC" 0 0 1 0 "No"      3 1 1  0
    13 "Female" "NC" 0 0 0 1 "Yes"     3 1 1  1
    13 "Female" "NC" 0 1 1 1 "No"      3 1 1  0
    13 "Female" "NC" 0 0 1 1 "No"      3 1 1  0
    13 "Female" "NC" 0 0 1 0 "NO DATA" 3 1 1 -1
    13 "Female" "NC" 0 0 1 0 "Yes"     3 1 1  1
    13 "Female" "NC" 0 1 1 0 "No"      3 1 1  0
    13 "Female" "NC" 0 0 1 1 "No"      3 1 1  0
    13 "Female" "NC" 0 1 1 0 "No"      3 1 1  0
    13 "Female" "NC" 0 0 1 0 "No"      3 1 1  0
    13 "Female" "NC" 0 1 1 0 "No"      3 1 1  0
    13 "Female" "NC" 0 1 1 1 "No"      3 1 1  0
    13 "Female" "NC" 0 0 0 0 "Yes"     3 1 1  1
    13 "Female" "NC" 0 1 1 0 "No"      3 1 1  0
    13 "Female" "NC" 0 1 1 1 "No"      3 1 1  0
    13 "Female" "NC" 0 0 1 0 "No"      3 1 1  0
    14 "Female" "NC" 0 0 1 0 "No"      4 1 1  0
    14 "Female" "NC" 0 0 0 0 "No"      4 1 1  0
    14 "Female" "NC" 1 0 0 0 "No"      4 1 1  0
    14 "Female" "NC" 0 1 0 1 "No"      4 1 1  0
    14 "Female" "NC" 0 0 1 0 "No"      4 1 1  0
    14 "Female" "NC" 0 1 0 0 "Yes"     4 1 1  1
    14 "Female" "NC" 0 1 1 1 "Yes"     4 1 1  1
    14 "Female" "NC" 0 1 0 0 "No"      4 1 1  0
    14 "Female" "NC" 0 0 1 0 "No"      4 1 1  0
    14 "Female" "NC" 0 0 0 0 "No"      4 1 1  0
    14 "Female" "NC" 0 0 1 0 "No"      4 1 1  0
    14 "Female" "NC" 0 1 1 1 "No"      4 1 1  0
    14 "Female" "NC" 0 1 1 1 "NO DATA" 4 1 1 -1
    14 "Female" "NC" 1 1 1 1 "No"      4 1 1  0
    14 "Female" "NC" 0 1 1 1 "No"      4 1 1  0
    14 "Female" "NC" 0 1 0 1 "No"      4 1 1  0
    end
    Last edited by Lee Kenneth; 10 Jan 2020, 15:48. Reason: I realized ageGroup was ordinal, not nominal

  • #2
    1) How do I run the model without using the USA and the first age group as the base? I want to be able to report the coefficients for all countries and age groups, not just what they are based on the "1" category.

    You can't. It doesn't even make any sense. The complete set of country indicators plus the constant term form a colinear set. Consequently the model including all of these is not identified. The omission of one of these breaks the collinearity, resulting in an identified model, but the coefficients you get are just arbitrary numbers that depend on which one you choose to omit. None of these coefficients has any meaning other than relative to the others. You can create the appearance of having some information about every country if you specify the -noconstant- option, but it is just an illusion. If you want to get predicted probabilities of the outcome for every country, you can do that with the -margins- command, even though one country is omitted from the regression itself. But there is no meaningful sense in which you can get a coefficient for every country.


    2) Does this test fit? When working with a binary predictor and multiple nominal predictors, multinomial logistic regression came up, but I am going down the wrong road?
    Well, I wouldn't say you're going down a completely wrong path, but I don't see the point of using -mlogit- here. You can get the same results more simply and cleanly with just -logit-. You have four separate dichotomous ("binary") outcome variables. From your data it is clear that they are not simply four levels of a single category variable: they are four different outcome variables. So since your outcome is dichotomous, the simplest and most natural approach is with -logit-, not -mlogit-. (If you try it you wll see that -logit- gives you the same results, but the output is more parsimonious.

    Comment


    • #3
      Thanks, Clyde.
      Can you clarify a little or help me interpret? Sorry if it is naive, but I am looking at similar studies and they report their findings in a format shown below. How is this data generated?
      I would create this table (or similar) for each of the four binary constructs.
      Click image for larger version

Name:	2020-01-11_10-07-57.png
Views:	0
Size:	0
ID:	1531598
      Last edited by Lee Kenneth; 10 Jan 2020, 18:11.

      Comment


      • #4
        There are many ways of showing the results of a logistic regression. To get the kind you refer to in #3, use -logistic- instead of -logit- and the output will look like that (plus a z-statistic and p-value in the columns between standard error and 95% CI.

        As for putting stars on the ones that are "significant at the p < 0.05 level" I strongly discourage you from doing that. The American Statistical Association has recommended that the concept of statistical significance be abandoned altogether. See https://www.tandfonline.com/doi/full...5.2019.1583913 for the "executive summary" and https://www.tandfonline.com/toc/utas20/73/sup1 for all 43 supporting articles. Or https://www.nature.com/articles/d41586-019-00857-9 for the tl;dr. Even if you want to ignore that advice, the use of significance stars is a particularly egregious practice that really emphasizes all of the worst aspects of the statistical significance concept and offers none of its positive aspects. If you want to include p-values in your output table, that is far preferable to significance stars: at least they convey a little bit of useful information. And, as already noted, the p-values will be there in the -logistic- output. So just don't delete them.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          There are many ways of showing the results of a logistic regression. To get the kind you refer to in #3, use -logistic- instead of -logit- and the output will look like that (plus a z-statistic and p-value in the columns between standard error and 95% CI..
          Awesome, thank you. When I run this, it only shows countries 2 - 5, meaning the USA is not shown. How do I address this? Gender only shows female, but that makes sense to me (if male is .64 more likely, than females are .64 less likely, correct?).

          Code:
          . logistic stage1 i.country i.sex ageGroup
          
          Logistic regression                             Number of obs     =        949
                                                          LR chi2(6)        =     120.03
                                                          Prob > chi2       =     0.0000
          Log likelihood = -502.01685                     Pseudo R2         =     0.1068
          
          ------------------------------------------------------------------------------
                stage1 | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
               country |
                    2  |    1.03671   .2330413     0.16   0.873     .6672922    1.610639
                    3  |   .8326139   .2621841    -0.58   0.561     .4491656    1.543408
                    4  |   1.647124   .3894159     2.11   0.035     1.036296    2.617995
                    5  |   .7537549   .2563172    -0.83   0.406     .3870578    1.467859
                       |
                 1.sex |   .6413753   .1001228    -2.85   0.004     .4723176    .8709441
              ageGroup |    .355638    .041208    -8.92   0.000     .2833864    .4463108
                 _cons |   11.50733   4.376572     6.42   0.000     5.460568    24.24997
          ------------------------------------------------------------------------------
          Note: _cons estimates baseline odds.

          Comment


          • #6
            Ok, I did some more reading and read up on dummy variables . I created five new variables (countryDummy) and set them to 1 if they were from that country (zero otherwise). Does this make more sense? Also, do you know why El Salvador was omitted?
            Code:
            logit stage1 i.usaDummy i.koreaDummy i.japanDummy i.jamaicaDummy i.esDummy i.sex ageGroup
            
            note: 1.esDummy omitted because of collinearity
            Iteration 0:   log likelihood = -562.03133  
            Iteration 1:   log likelihood =  -503.4806  
            Iteration 2:   log likelihood = -502.02042  
            Iteration 3:   log likelihood = -502.01685  
            Iteration 4:   log likelihood = -502.01685  
            
            Logistic regression                             Number of obs     =        949
                                                            LR chi2(6)        =     120.03
                                                            Prob > chi2       =     0.0000
            Log likelihood = -502.01685                     Pseudo R2         =     0.1068
            
            --------------------------------------------------------------------------------
                    stage1 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            ---------------+----------------------------------------------------------------
                1.usaDummy |    .282688   .3400538     0.83   0.406    -.3838052    .9491813
              1.koreaDummy |   .3187399   .3797812     0.84   0.401    -.4256175    1.063097
              1.japanDummy |   .0995028   .4420406     0.23   0.822    -.7668809    .9658864
            1.jamaicaDummy |   .7817188   .3848565     2.03   0.042      .027414    1.536024
                 1.esDummy |          0  (omitted)
                     1.sex |  -.4441405   .1561065    -2.85   0.004    -.7501036   -.1381774
                  ageGroup |  -1.033842   .1158706    -8.92   0.000    -1.260944   -.8067396
                     _cons |   2.160296   .4593863     4.70   0.000     1.259916    3.060677
            --------------------------------------------------------------------------------

            Comment


            • #7
              For your first question, please re-read my response in #2. The only way you can get that is if you add the -noconstant- option, but then what you have is something of an illusion rather than a reality. Those would not be odds ratios. They would be odds, and they would only apply when sex = 0 and ageGroup = 0, so not very useful. It is a mathematical impossibility to have a set of odds ratios for all of the countries. If you think you have seen that somewhere in the literature than either you are misreading what you saw or what was shown was misrepresented.


              Gender only shows female, but that makes sense to me (if male is .64 more likely, than females are .64 less likely, correct?).
              This is sloppy use of language at best, so I can't tell if your intended meaning is correct or not--I suspect it isn't. The correct interpretation here is that the odds of stage1 for a female are only 0.64 times the odds of stage1 for a male, all else being equal. Another way of saying this is that the odds of stage 1 for a female are 36 percent lower than the odds of stage 1 for a male, all else equal.

              Added: Crossed with #6. The above responds to # 5. My response to #6 is below:

              As I have said starting in #2, one of the country indicators ("dummies"), or the constant term must always be omitted. It is mathematically impossible to keep them all in. This is not some peculiarity of Stata; it is linear algebra and there is no getting around it. As for why, in particular, it was esDummy that got omitted, I cannot say for sure. Stata makes up its own mind about that. In general it seems to usually drop the last one in the list of predictor variables, but I have seen exceptions to that behavior. If you want to control which one gets dropped when using factor variable notation, you can do so with the ibn. notation. (Read -help fvvarlist for details.) But you need to give up on the idea of dropping them all. You can also drop the constant by using the -noconstant- option, but then what you have will be odds, not odds ratios, for the countries, and, again, odds that are only correct for sex and ageGroup both 0.

              Last edited by Clyde Schechter; 10 Jan 2020, 18:42.

              Comment


              • #8
                Ok, so here is an example study I am looking at. In there, they have four races (Latino, White, Black, Asian). Maybe I'm thinking about it wrong, but wouldn't that be similar to Country in my model? Why are they able to display for all races but I cannot for all countries? This is still the one thing I'm having trouble wrapping my head around...
                Click image for larger version

Name:	2020-01-11_10-41-43.png
Views:	1
Size:	207.1 KB
ID:	1531604

                Comment


                • #9
                  Yes, this might be analogous to countries. If so, either there is a missing fifth race/ethnic group (perhaps corresponding to "other" or "no response") not shown in the table, or it is just plain wrong.

                  Another possibility is that these race/ethnicity variables are not indicators for a single category variable. For example, in some classifications Latino is not considered a race but is an ethnicity that is separate from race. In that case the non-Latino is missing, and also some fourth race other than White Black or Asian must be missing (again, that last category might be other/unknown).

                  Comment


                  • #10
                    Okay, I think I get it. Thanks again for all the help!

                    Comment

                    Working...
                    X