Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with binary logistic regression using Logit

    Dear Statalisters,

    I want to perform a binary logistic regression for a dataset where people have been split into 3 groups (grp), with binary outcome (outcome) and several explanatory variables, some of which are binary, and some continuous (x1, x2, c1, c2...); also 'age', 'sex' and 'proc' (procedure). I have used multilevel models before (-xtmixed- and -mixed-), but not done a binary logistic regression. My questions are at the end.
    I started out with the following command:

    Code:
    logit outcome age sex x1 c1 x2 x3 c2 proc i.grp
    The output is below:

    Code:
    note: proc != 1 predicts failure perfectly
          proc dropped and 37 obs not used
    
    Iteration 0:   log likelihood = -31.403149  
    Iteration 1:   log likelihood = -27.613462  
    Iteration 2:   log likelihood =  -26.88484  
    Iteration 3:   log likelihood = -26.874952  
    Iteration 4:   log likelihood = -26.874938  
    Iteration 5:   log likelihood = -26.874938  
    
    Logistic regression                             Number of obs     =        113
                                                    LR chi2(9)        =       9.06
                                                    Prob > chi2       =     0.4321
    Log likelihood = -26.874938                     Pseudo R2         =     0.1442
    
    ------------------------------------------------------------------------------
         outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             age |   .0382949    .032096     1.19   0.233    -.0246122    .1012019
             sex |   1.076406   1.239829     0.87   0.385    -1.353614    3.506426
              x1 |  -.0144402   1.173195    -0.01   0.990    -2.313859    2.284979
              c1 |   .0438674   .0333587     1.32   0.189    -.0215144    .1092492
              x2 |  -1.639493   1.058371    -1.55   0.121    -3.713863    .4348764
              x3 |   2.111303    1.22703     1.72   0.085    -.2936309    4.516237
              c2 |   .0033934   .0177173     0.19   0.848    -.0313318    .0381186
            proc |          0  (omitted)
                 |
             grp |
              2  |  -.2179836   1.343467    -0.16   0.871     -2.85113    2.415163
              3  |   1.477582   1.018119     1.45   0.147    -.5178936    3.473058
                 |
           _cons |  -8.467525   3.412397    -2.48   0.013     -15.1557    -1.77935
    ------------------------------------------------------------------------------

    As 'proc' perfectly predicts failure, I tried re-running without including it as a variable:

    Code:
    logit outcome age sex x1 c1 x2 x3 c2 i.grp
    To prevent the post from being too long I have omitted the output, but this time 150 observations are used.

    Having spoken to one of my colleagues (who uses SPSS), I was advised to perform a stepwise selection of variables, e.g. by removing those with significance level > 0.1. Noting some of the problems with stepwise analysis
    HTML Code:
    http://www.stata.com/support/faqs/st...ems/index.html
    , I nevertheless decided to try as I believe it is still used by many non-statisticians (in the medical field). I therefore ran the following:

    Code:
    stepwise, pr(0.1): logit outcome age sex x1 c1 x2 x3 c2 proc grp
    As you can see, I am unable to specify 'grp' as a categorical variable in this.

    The output:

    Code:
    note: proc dropped because of estimability
    note: o.proc dropped because of estimability
    note: 37 obs. dropped because of estimability
                          begin with full model
    p = 0.8570 >= 0.1000  removing c2
    p = 0.7386 >= 0.1000  removing x1
    p = 0.3661 >= 0.1000  removing sex
    p = 0.1929 >= 0.1000  removing age
    p = 0.3303 >= 0.1000  removing grp
    p = 0.2421 >= 0.1000  removing x2
    p = 0.2371 >= 0.1000  removing x3
    p = 0.2081 >= 0.1000  removing c1
    
    Logistic regression                             Number of obs     =        113
                                                    LR chi2(0)        =       0.00
                                                    Prob > chi2       =          .
    Log likelihood = -31.403149                     Pseudo R2         =     0.0000
    
    ------------------------------------------------------------------------------
         outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           _cons |  -2.447166   .3474572    -7.04   0.000     -3.12817   -1.766163
    ------------------------------------------------------------------------------
    There don't appear to be any significant predictors.

    My questions therefore are:

    1. Does my first -logit- command tell me all that I need to know?
    2. Is it unsafe to remove 'proc' as a variable as I have done in the second command, resulting in more observations being used? i.e. is this quick fix for complete separation likely to render the results inaccurate?
    3. Does the -stepwise- command I have used support the results of the first -logit- command?
    4. Whichever command I use, do I then need to go on to re-run the model with a more limited number of variables to produce a 'final model'?

    thanks

    Jem

  • #2
    Jem:
    1a) your model seems to give you back all you asked it for. Perfect prediction is, in brief, a synonym, for "no variation in your data" as far as that predictor is concerned: hence, there's nothing you can do;
    1b) the lack of statistical significance of your coefficients may well due to a limited sample size;
    3) as per Frank Harrell's concerns on -stepwise- (which is usually seen as an avoidable evil on this forum), see https://www.stata.com/support/faqs/s...ion-problems/;
    2) and 4): as always, the main issute is to give a fair and true view of the data generating process: the literature in your research field can help you out in this respect.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Dear Carlo,

      many thanks for this. To give a little more detail about the data set - the hypothesis is that there are no differences in 'outcome' between groups 1,2 and 3 (based on treatment). This is therefore the variable of most interest. The other variables mentioned above could also be explanatory re: 'outcome'. Below is a table showing the numbers in each group, and the total number of people (201).

      Code:
                 |        outcome
             grp |         0          1 |     Total
      -----------+----------------------+----------
               1 |        41          4 |        45 
               2 |        48          3 |        51 
               3 |       99          6 |       105 
      -----------+----------------------+----------
           Total |       188         13 |       201

      Could I ask which of the methods below you recommend as the best way to investigate possible explanatory variables with these data:

      i. start with univariable analysis (e.g. logit outcome age, then logit outcome i.grp etc), and then only take those variables with p<0.1 and add to a multivariable model (as below)
      ii. stepwise model as in my first post
      Code:
      stepwise, pr(0.1): logit outcome age sex x1 c1 x2 x3 c2 proc grp
      iii. start straight off with multivariable model
      Code:
      logit outcome age sex x1 c1 x2 x3 c2 proc i.grp
      Jem

      Comment


      • #4
        Jem:
        as per my previous reply, I would avoid -stepwise- altogether (unless you're forced to go that way due to some research purposes).
        That said, I find your last code OK; I would only wonder whether some interactions (say
        Code:
        i.sex##i.group
        could make any sense on the grounds of the literature in your research field.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          With only 13 successes (1s) I don't think there is enough information present in the data to include any control variables, let alone interaction effects. I even doubt if you have enough information for a model with just your group variable.

          This is not fun advise, but with such a rare outcome you are probably better off trying to collect new data. In that case I would recommend a case-controle design.
          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            Dear Carlo and Maarten,

            thank you for your replies. Carlo, in response to your comment re: possible interaction, I don't think this is likely for these data.

            Maarten, I agree the number of outcome=1 is small for each group, and also overall (at 13 in total). It would be preferable to have a larger data set, but time constraints prevent this. I would like to analyse in the best way, given the small number of outcomes. Are you suggesting that if the variable of most interest is 'grp' (group), then rather than trying to perform a binary logistic regression/multivariable analysis using -logit-, I use 'Fisher's exact' instead:

            Code:
            tabulate outcome grp, chi2 exact
            
            Enumerating sample-space combinations:
            stage 3:  enumerations = 1
            stage 2:  enumerations = 2
            stage 1:  enumerations = 0
            
                       |               grp
               outcome |         1          2          3 |     Total
            -----------+---------------------------------+----------
                     0 |        41         48        99 |       188 
                     1 |         4          3          6 |        13 
            -----------+---------------------------------+----------
                 Total |        45         51        105 |       201 
            
                      Pearson chi2(2) =   0.4730   Pr = 0.789
                       Fisher's exact =                 0.804

            Jem

            Comment


            • #7
              Lets assume that the pattern you found in the data is real and we want to be able to detect it (i.e. reject the null-hypothesis of independence). Now you can ask what is the chance of detecting it when you draw a random sample. You can do so by drawing many random samples and count the number of draws in which you reject the null-hypothesis. You should obviously do such a power analysis before collecting your data, but if the number is low, then that will give you an indication that the whole project is just not feasible.

              Code:
              . clear all
              
              . set seed 1423456
              
              . input grp outcome freq
              
                         grp    outcome       freq
                1. 1 0 41
                2. 1 1  4
                3. 2 0 48
                4. 2 1  3
                5. 3 0 99
                6. 3 1  6
                7. end
              
              .
              . expand freq
              (195 observations created)
              
              . gen w = 1
              
              .
              . program sim
                1.         bsample , weight(w)
                2.         tab outcome grp [fw=w], exact
                3. end
              
              .
              . simulate p=r(p_exact), reps(20000) nodots: sim
              
                    command:  sim
                          p:  r(p_exact)
              
              
              .
              . count if p < .05
                2,105
              
              . di r(N)/_N
              .10525
              So with this type of data and sample size and test you would have only an 11% chance of detecting that pattern at the 5% significance level. A typical lower bound would be about 80%, so you are not even close. I realize that this is not the outcome you would like to see, but if information is not there than there is pretty much nothing we can do.
              ---------------------------------
              Maarten L. Buis
              University of Konstanz
              Department of history and sociology
              box 40
              78457 Konstanz
              Germany
              http://www.maartenbuis.nl
              ---------------------------------

              Comment


              • #8
                Originally posted by Jem Lane View Post
                Maarten, I agree the number of outcome=1 is small for each group, and also overall (at 13 in total). It would be preferable to have a larger data set, but time constraints prevent this.
                Increasing the sample size is one solution, but for such rare outcomes it usually much more feasible to use a case control design instead. This means that you draw separate samples from your cases and your controls. That way you can control the number of cases in your data and avoid this problem. You obviously can't say anything about the probabilities anymore based on this data, but the odds ratio is identified. So you can use logistic regression on such data. The constant will mean nothing, but the rest of the coefficients are fine.

                ---------------------------------
                Maarten L. Buis
                University of Konstanz
                Department of history and sociology
                box 40
                78457 Konstanz
                Germany
                http://www.maartenbuis.nl
                ---------------------------------

                Comment


                • #9
                  Dear Maarten,

                  thank you for your suggestions. Firstly, I would just like to say that our hypothesis is that there is no difference between groups, i.e. we expect the null hypothesis to be true, but want to test it. This is the main focus of the study, rather than the other explanatory variables.
                  Secondly, thanks for the suggestion about a case-control study. But are you able to advise on the Stata code for this (I am using Stata/IC 14.2)?

                  Jem

                  Comment


                  • #10
                    as to point 1: that is even worse. Not rejecting the null can mean either the difference is not there or you don't have enough data, and we don't know which applies. This is especially problematic in extremely low power studies like yours, as the latter is now very likely and the results you get now mean exactly nothing.

                    as to point 2: just use logistic regression, interpret the odds ratios, but don't interpret the constant and don't use marginal effects.
                    ---------------------------------
                    Maarten L. Buis
                    University of Konstanz
                    Department of history and sociology
                    box 40
                    78457 Konstanz
                    Germany
                    http://www.maartenbuis.nl
                    ---------------------------------

                    Comment


                    • #11
                      OK, thanks for your help Maarten. Will try this.

                      Jem

                      Comment

                      Working...
                      X