Help with binary logistic regression using Logit

Jem Lane

Join Date: Nov 2014
Posts: 60

Help with binary logistic regression using Logit

22 Nov 2017, 00:41

Dear Statalisters,

I want to perform a binary logistic regression for a dataset where people have been split into 3 groups (grp), with binary outcome (outcome) and several explanatory variables, some of which are binary, and some continuous (x1, x2, c1, c2...); also 'age', 'sex' and 'proc' (procedure). I have used multilevel models before (-xtmixed- and -mixed-), but not done a binary logistic regression. My questions are at the end.
I started out with the following command:

Code:

logit outcome age sex x1 c1 x2 x3 c2 proc i.grp

The output is below:

Code:

note: proc != 1 predicts failure perfectly
      proc dropped and 37 obs not used

Iteration 0:   log likelihood = -31.403149  
Iteration 1:   log likelihood = -27.613462  
Iteration 2:   log likelihood =  -26.88484  
Iteration 3:   log likelihood = -26.874952  
Iteration 4:   log likelihood = -26.874938  
Iteration 5:   log likelihood = -26.874938  

Logistic regression                             Number of obs     =        113
                                                LR chi2(9)        =       9.06
                                                Prob > chi2       =     0.4321
Log likelihood = -26.874938                     Pseudo R2         =     0.1442

------------------------------------------------------------------------------
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0382949    .032096     1.19   0.233    -.0246122    .1012019
         sex |   1.076406   1.239829     0.87   0.385    -1.353614    3.506426
          x1 |  -.0144402   1.173195    -0.01   0.990    -2.313859    2.284979
          c1 |   .0438674   .0333587     1.32   0.189    -.0215144    .1092492
          x2 |  -1.639493   1.058371    -1.55   0.121    -3.713863    .4348764
          x3 |   2.111303    1.22703     1.72   0.085    -.2936309    4.516237
          c2 |   .0033934   .0177173     0.19   0.848    -.0313318    .0381186
        proc |          0  (omitted)
             |
         grp |
          2  |  -.2179836   1.343467    -0.16   0.871     -2.85113    2.415163
          3  |   1.477582   1.018119     1.45   0.147    -.5178936    3.473058
             |
       _cons |  -8.467525   3.412397    -2.48   0.013     -15.1557    -1.77935
------------------------------------------------------------------------------

As 'proc' perfectly predicts failure, I tried re-running without including it as a variable:

Code:

logit outcome age sex x1 c1 x2 x3 c2 i.grp

To prevent the post from being too long I have omitted the output, but this time 150 observations are used.

Having spoken to one of my colleagues (who uses SPSS), I was advised to perform a stepwise selection of variables, e.g. by removing those with significance level > 0.1. Noting some of the problems with stepwise analysis

HTML Code:

http://www.stata.com/support/faqs/st...ems/index.html

, I nevertheless decided to try as I believe it is still used by many non-statisticians (in the medical field). I therefore ran the following:

Code:

stepwise, pr(0.1): logit outcome age sex x1 c1 x2 x3 c2 proc grp

As you can see, I am unable to specify 'grp' as a categorical variable in this.

The output:

Code:

note: proc dropped because of estimability
note: o.proc dropped because of estimability
note: 37 obs. dropped because of estimability
                      begin with full model
p = 0.8570 >= 0.1000  removing c2
p = 0.7386 >= 0.1000  removing x1
p = 0.3661 >= 0.1000  removing sex
p = 0.1929 >= 0.1000  removing age
p = 0.3303 >= 0.1000  removing grp
p = 0.2421 >= 0.1000  removing x2
p = 0.2371 >= 0.1000  removing x3
p = 0.2081 >= 0.1000  removing c1

Logistic regression                             Number of obs     =        113
                                                LR chi2(0)        =       0.00
                                                Prob > chi2       =          .
Log likelihood = -31.403149                     Pseudo R2         =     0.0000

------------------------------------------------------------------------------
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |  -2.447166   .3474572    -7.04   0.000     -3.12817   -1.766163
------------------------------------------------------------------------------

There don't appear to be any significant predictors.

My questions therefore are:

1. Does my first -logit- command tell me all that I need to know?
2. Is it unsafe to remove 'proc' as a variable as I have done in the second command, resulting in more observations being used? i.e. is this quick fix for complete separation likely to render the results inaccurate?
3. Does the -stepwise- command I have used support the results of the first -logit- command?
4. Whichever command I use, do I then need to go on to re-run the model with a more limited number of variables to produce a 'final model'?

thanks

Jem

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#2

22 Nov 2017, 00:53

Jem:
1a) your model seems to give you back all you asked it for. Perfect prediction is, in brief, a synonym, for "no variation in your data" as far as that predictor is concerned: hence, there's nothing you can do;
1b) the lack of statistical significance of your coefficients may well due to a limited sample size;
3) as per Frank Harrell's concerns on -stepwise- (which is usually seen as an avoidable evil on this forum), see https://www.stata.com/support/faqs/s...ion-problems/;
2) and 4): as always, the main issute is to give a fair and true view of the data generating process: the literature in your research field can help you out in this respect.

Kind regards,
Carlo
(Stata 19.0)
Comment
Jem Lane

Join Date: Nov 2014

Posts: 60
#3

22 Nov 2017, 11:33

Dear Carlo,

many thanks for this. To give a little more detail about the data set - the hypothesis is that there are no differences in 'outcome' between groups 1,2 and 3 (based on treatment). This is therefore the variable of most interest. The other variables mentioned above could also be explanatory re: 'outcome'. Below is a table showing the numbers in each group, and the total number of people (201).

Code:

| outcome grp | 0 1 | Total -----------+----------------------+---------- 1 | 41 4 | 45 2 | 48 3 | 51 3 | 99 6 | 105 -----------+----------------------+---------- Total | 188 13 | 201

Could I ask which of the methods below you recommend as the best way to investigate possible explanatory variables with these data:

i. start with univariable analysis (e.g. logit outcome age, then logit outcome i.grp etc), and then only take those variables with p<0.1 and add to a multivariable model (as below)
ii. stepwise model as in my first post

Code:

stepwise, pr(0.1): logit outcome age sex x1 c1 x2 x3 c2 proc grp

iii. start straight off with multivariable model

Code:

logit outcome age sex x1 c1 x2 x3 c2 proc i.grp

Jem
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#4

22 Nov 2017, 11:51

Jem:
as per my previous reply, I would avoid -stepwise- altogether (unless you're forced to go that way due to some research purposes).
That said, I find your last code OK; I would only wonder whether some interactions (say

Code:

i.sex##i.group

could make any sense on the grounds of the literature in your research field.

Kind regards,
Carlo
(Stata 19.0)
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#5

22 Nov 2017, 14:42

With only 13 successes (1s) I don't think there is enough information present in the data to include any control variables, let alone interaction effects. I even doubt if you have enough information for a model with just your group variable.

This is not fun advise, but with such a rare outcome you are probably better off trying to collect new data. In that case I would recommend a case-controle design.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Jem Lane

Join Date: Nov 2014

Posts: 60
#6

22 Nov 2017, 15:21

Dear Carlo and Maarten,

thank you for your replies. Carlo, in response to your comment re: possible interaction, I don't think this is likely for these data.

Maarten, I agree the number of outcome=1 is small for each group, and also overall (at 13 in total). It would be preferable to have a larger data set, but time constraints prevent this. I would like to analyse in the best way, given the small number of outcomes. Are you suggesting that if the variable of most interest is 'grp' (group), then rather than trying to perform a binary logistic regression/multivariable analysis using -logit-, I use 'Fisher's exact' instead:

Code:

tabulate outcome grp, chi2 exact Enumerating sample-space combinations: stage 3: enumerations = 1 stage 2: enumerations = 2 stage 1: enumerations = 0 | grp outcome | 1 2 3 | Total -----------+---------------------------------+---------- 0 | 41 48 99 | 188 1 | 4 3 6 | 13 -----------+---------------------------------+---------- Total | 45 51 105 | 201 Pearson chi2(2) = 0.4730 Pr = 0.789 Fisher's exact = 0.804

Jem
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#7

23 Nov 2017, 02:03

Lets assume that the pattern you found in the data is real and we want to be able to detect it (i.e. reject the null-hypothesis of independence). Now you can ask what is the chance of detecting it when you draw a random sample. You can do so by drawing many random samples and count the number of draws in which you reject the null-hypothesis. You should obviously do such a power analysis before collecting your data, but if the number is low, then that will give you an indication that the whole project is just not feasible.

Code:

. clear all . set seed 1423456 . input grp outcome freq grp outcome freq 1. 1 0 41 2. 1 1 4 3. 2 0 48 4. 2 1 3 5. 3 0 99 6. 3 1 6 7. end . . expand freq (195 observations created) . gen w = 1 . . program sim 1. bsample , weight(w) 2. tab outcome grp [fw=w], exact 3. end . . simulate p=r(p_exact), reps(20000) nodots: sim command: sim p: r(p_exact) . . count if p < .05 2,105 . di r(N)/_N .10525

So with this type of data and sample size and test you would have only an 11% chance of detecting that pattern at the 5% significance level. A typical lower bound would be about 80%, so you are not even close. I realize that this is not the outcome you would like to see, but if information is not there than there is pretty much nothing we can do.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#8

23 Nov 2017, 03:07

Originally posted by Jem Lane View Post

Maarten, I agree the number of outcome=1 is small for each group, and also overall (at 13 in total). It would be preferable to have a larger data set, but time constraints prevent this.

Increasing the sample size is one solution, but for such rare outcomes it usually much more feasible to use a case control design instead. This means that you draw separate samples from your cases and your controls. That way you can control the number of cases in your data and avoid this problem. You obviously can't say anything about the probabilities anymore based on this data, but the odds ratio is identified. So you can use logistic regression on such data. The constant will mean nothing, but the rest of the coefficients are fine.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Jem Lane

Join Date: Nov 2014

Posts: 60
#9

23 Nov 2017, 15:32

Dear Maarten,

thank you for your suggestions. Firstly, I would just like to say that our hypothesis is that there is no difference between groups, i.e. we expect the null hypothesis to be true, but want to test it. This is the main focus of the study, rather than the other explanatory variables.
Secondly, thanks for the suggestion about a case-control study. But are you able to advise on the Stata code for this (I am using Stata/IC 14.2)?

Jem
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#10

24 Nov 2017, 06:17

as to point 1: that is even worse. Not rejecting the null can mean either the difference is not there or you don't have enough data, and we don't know which applies. This is especially problematic in extremely low power studies like yours, as the latter is now very likely and the results you get now mean exactly nothing.

as to point 2: just use logistic regression, interpret the odds ratios, but don't interpret the constant and don't use marginal effects.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Jem Lane

Join Date: Nov 2014

Posts: 60
#11

25 Nov 2017, 13:19

OK, thanks for your help Maarten. Will try this.

Jem
Comment

Announcement

Help with binary logistic regression using Logit

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment