Continuous vs binary independent variable in logistic regression

Marienne Smith

Join Date: Jul 2021

Posts: 1
#1

Continuous vs binary independent variable in logistic regression

12 Jul 2021, 14:42

I'm new to statistics and coding so apologize if this is a dumb question but here it goes:
I'm running a logistic regression model with a binary dependent variable and when I use a continuous variable as an independent variable (min 0, max 2,330), I get an OR of 1. When I categorize this continuous variable into 2 categories and run the regression model again using this new binary variable as independent variable, I get an OR of 22, p<0.001.
What is the most appropriate way to run it, using a binary or continuous variable? and why could it be that there is such a big difference?

Thanks!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

12 Jul 2021, 15:21

There are infinitely many ways to categorize a continuous variable into 2 categories, and in general they will produce different results, and none of them should be trusted. The best you can hope for when dichotomizing a continuous variable is that you will lose precision due to all the information that is being thrown away. More often, you just get results that are potentially misleading. Dichotomizing can only be justified when what is being measured a natural cutoff that defines a true real-world discontinuity. Otherwise put, it is justifiable when there really are two clearly distinct real-world categories and where the distributions of the continuous variables in question in the two distinct categories do not overlap at all (or hardly at all). Even then, information about the effect of the continuous variable within those groups on the study outcome is being discarded.

That said, given the extreme change in OR you are seeing and assuming that you did not pick an extreme cutoff to do the dichotomization leaving only a handful of observations in one of the groups, this kind of result suggests that the log-odds of outcome is not linear in the continuous variable. In your continuous variable model, how is the goodness of fit? I suspect not very good. You can get a good sense of what is happening by fitting a linear spline to the continuous variable (-help mkspline-) and seeing the extent to which the OR varies in the different segments of the spline. Or, if you're not comfortable with that, split the continuous variable into a larger number of categories (just how many depends on how large your sample is--you'd like each category to have 30 or more observations in it, but you also don't want a zillion categories) and see how the odds ratio varies across categories.
1 like
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4944
#3

12 Jul 2021, 16:24

Actually, I can't tell if there is a big difference in the results or not. In the first approach you have a variable that ranges from 0 to 2,330. In the 2nd the var is probably coded 0/1. A one unit change in X is no doubt vastly bigger in the 2nd case!!! Hence the effect is bigger too.

By way of analogy, suppose you had 2 measures of income -- income in pennies, and income in thousands of dollars. The effect of a 1 penny increase may seem all but infinitesimal -- the odds ratio might be .9999999999999999 and get rounded to 1. But, a change of 100 thousand pennies ($1,000) will likely be far more noticeable.

Hence income in pennies may seem to have no effect, but income in thousands of dollars can have a very noticeable effect. But such differences reflect differences in scaling, i.e. 1 penny may seem to do nothing but the effect of 100,000 pennies should be the same as the effect of $1,000.

The scaling of X shouldn't affect the p value though. But you never say what the original P value was. Was it also highly significant? I suspect yes. But in any event, the odds ratio is not a good way of assessing the effect of a variable that has such a huge range.

I don't know what X is, but I would suggest dividing it by 100 or something like that. Again, that would be like measuring income in thousands of dollars rather than in dollars.

Now, it may indeed be that the original X is problematic. Like Clyde says, do splines, have more categories, maybe do logs or other transformations.

But, don't do any of that until you are sure there is a problem. Just because the ORs are 1 -- or, more specifically, round off to 1 -- does not mean that that the variable has no effect. It may just be that the scaling of X makes it very difficult to see what these effects are.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment

Richard Williams

Join Date: Apr 2014
Posts: 4944

12 Jul 2021, 16:43

Here is an example of what I mean. Substantively, it doesn't matter how weight is scaled. Fit, p-values, and predicted values are the same with each approach. But, interpreting results may be easier with some scalings rather than others.

Code:

. webuse nhanes2f, clear

. gen weightX100 = weight * 100

. gen weightZ100 = weight /100

. logit diabetes weightX100, or nolog

Logistic regression                                     Number of obs = 10,335
                                                        LR chi2(1)    =  47.26
                                                        Prob > chi2   = 0.0000
Log likelihood = -1975.4386                             Pseudo R2     = 0.0118

------------------------------------------------------------------------------
    diabetes | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
  weightX100 |   1.000193   .0000272     7.08   0.000     1.000139    1.000246
       _cons |   .0121865   .0025911   -20.73   0.000     .0080333    .0184867
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

. logit diabetes weight, or nolog

Logistic regression                                     Number of obs = 10,335
                                                        LR chi2(1)    =  47.26
                                                        Prob > chi2   = 0.0000
Log likelihood = -1975.4386                             Pseudo R2     = 0.0118

------------------------------------------------------------------------------
    diabetes | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
      weight |   1.019443   .0027724     7.08   0.000     1.014023    1.024891
       _cons |   .0121865   .0025911   -20.73   0.000     .0080333    .0184867
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

. logit diabetes weightZ100, or nolog

Logistic regression                                     Number of obs = 10,335
                                                        LR chi2(1)    =  47.26
                                                        Prob > chi2   = 0.0000
Log likelihood = -1975.4386                             Pseudo R2     = 0.0118

------------------------------------------------------------------------------
    diabetes | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
  weightZ100 |    6.85923   1.865376     7.08   0.000     4.025224    11.68855
       _cons |   .0121865   .0025911   -20.73   0.000     .0080333    .0184867
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Announcement

Continuous vs binary independent variable in logistic regression

Comment

Comment

Comment