Is there an arbitrary scaling factor in logistic regression in Stata?

Suhail Doi

Join Date: Dec 2014

Posts: 83
#1

Is there an arbitrary scaling factor in logistic regression in Stata?

16 Jan 2019, 11:37

Hi,

Norton et al state in a recent JAMA guide that the magnitude of the odds ratio from a logistic regression is scaled by an arbitrary factor (equal to the square root of the variance of the unexplained part of binary outcome). They say that adding more independent explanatory variables to the model will increase the odds ratio of the variable of interest (eg, treatment) due to dividing by a smaller scaling factor. They thus warn that different odds ratios from the same study cannot be compared when the statistical models that result in odds ratio estimates have different explanatory variables because each model has a different arbitrary scaling factor.

I ran a simple logistic regression on the dataset below using the code below and the crude odds ratio is 2.33 and the adjusted odds ratio is 3.0 for hiv status both of which exactly match the stratified analysis without logistic regression. The arbitrary scaling factor does not surface here.

I would appreciate any thoughts on this issue and if this is actually a valid concern in Stata and if anyone can share a contrary dataset (using categorical variables only)

Thanks
Suhail

Code:

logit risky i.hiv [fw=fw], or logit risky i.hiv i.nyc [fw=fw], or

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(risky hiv nyc fw) 1 1 0 25 1 0 0 75 0 1 0 10 0 0 0 90 1 1 1 75 1 0 1 25 0 1 1 50 0 0 1 50 end

Regards
Suhail Doi
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

16 Jan 2019, 12:27

Norton et al state in a recent JAMA guide

Many readers of this will likely want to see this guide to more fully understand what you summarize in your post. Can you post a link to this material, preferably one that does not require payment. As the Statalist FAQ tells us,

13. How should I give literature references?

Please give precise literature references. The literature familiar to you will be not be familiar to all members of Statalist. Do not refer to publications with just author and date, as in Sue, Grabbit, and Runne (1989).

References should be in a form that you would expect in an academic publication or technical document. Good practice is to give a web link accessible to all or alternatively full author name(s), date, paper title, journal title, and volume and page numbers in the case of a journal article.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

16 Jan 2019, 12:42

The article in question appears to be

Norton EC, Dowd BE, Maciejewski ML. Odds Ratios—Current Best Practice and Use. JAMA. 2018;320(1):84–85. doi:10.1001/jama.2018.6971

I found the article online, not behind JAMA's paywall, at

https://www.feinberg.northwestern.ed...ce-and-use.pdf

It in turn justifies the assertion referenced in post #1 with a reference to

Norton, E. C. and Dowd, B. E. (2018), Log Odds and the Interpretation of Logit Models. Health Serv Res, 53: 859-878. doi:10.1111/1475-6773.12712

but I have not been able to get access to this paper.
Comment
Suhail Doi

Join Date: Dec 2014

Posts: 83
#4

16 Jan 2019, 13:33

Thanks William, yes, that is the paper

Just to add to my comments, if I were to create two variables predictive of "risky" in the dataset I posted previously as follows below, I still cannot get the hiv odds ratio to budge much from 3.0 unless the new variable is very highly predictive of risky and even then the uncertainty in the estimate for hiv increases so I see no clinical significance of this observation and the papers recommendations seem highly overstated - were the theory to be confirmed. Any thoughts on this would also be appreciated

Code:

expand fw gen x = rnormal(risky,1) gen y = rnormal(risky,0.5) logit risky i.hiv i.nyc x , or logit risky i.hiv i.nyc x y, or

Last edited by Suhail Doi; 16 Jan 2019, 13:37.

Regards
Suhail Doi
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4945
#5

16 Jan 2019, 14:30

There actually has been a great deal of discussion on this topic in various places. For my own take, see

https://www3.nd.edu/~rwilliam/stats3/Nested01.pdf

https://www3.nd.edu/~rwilliam/stats3/Nested02.pdf

Some key takeways:
Comparisons of coefficients across nested models is problematic. Coefficients are not consistently scaled the same way. It could be like using income in dollars as your DV in one model, and then using income in thousands of dollars in another -- and not realizing you had done that. Potentially, wildly incorrect conclusions can be reached. For example, you might think there are really dramatic suppressor effects when there are no such effects at all. I give a hypothetical example to show this.

In practice, I've only found instances where the conclusions were mildly incorrect. But I haven't re-analyzed every data set in the world. Maybe Norton has better real world examples than I do. Then again maybe he is making the problem seem much more serious than it usually is in real world situations.

There are potential solutions. For one thing, just don't do comparisons across nested models. Just talk about your final model.

If you do want to make such comparisons, the KHB method (mentioned in my handouts) may be the best way to go.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
4 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

16 Jan 2019, 15:14

Thanks as always to Professor Williams for a clear explanation of the issue, with handouts not behind a paywall.

I was sure he'd have something helpful to say, but it didn't occur to me that he'd already said it , if I'd only looked in his invaluable repository about the analysis of categorical data at http://www3.nd.edu/~rwilliam/stats3/.
2 likes
Comment

Suhail Doi

Join Date: Dec 2014
Posts: 83

17 Jan 2019, 10:58

Hi Richard,

I agree with your observation that he may be making the problem seem much more serious than it usually is in real world situations. I believe that a majority of the change seen in adjusted estimates is quite easy to demonstrate to be due to the stratification and not any arbitrary factor.

If we take the previous dataset in post #1 and create x and y as in post #4 then we can dichtomise both x and y at the median to create categorical variables predictive of the outcome. We can then compare logistic regression and stratification as follows:

Code:

logit risky i.hiv i.nyc i.x i.y, or
cs risky hiv, by(nyc x y) or w

Logistic regression results will show


risky	Odds Ratio	Std. Err.	z	P>z	[95% Conf.	Interval]

hiv	4.239133	1.678756	3.65	0.000	1.950695	9.212229
nyc	.535027	.1989655	-1.68	0.093	.2581258	1.10897
fw	.9846991	.0074828	-2.03	0.042	.9701418	.9994748
x	4.742803	1.578448	4.68	0.000	2.470288	9.105896
y	51.80127	18.03298	11.34	0.000	26.18312	102.4848
_cons	.1299183	.0787229	-3.37	0.001	.0396179	.4260383

While stratification will show

nyc x y	OR	[95% Conf. Interval]

0 0 0	7.83	1.14 53.75	(Woolf)
0 0 1	.	. .	(Woolf)
0 1 0	11.2	.84 148.13	(Woolf)
0 1 1	.	. .	(Woolf)
1 0 0	1.45	.22 9.28	(Woolf)
1 0 1	3.51	.68 18.07	(Woolf)
1 1 0	8	.86 74.21	(Woolf)
1 1 1	1.73	.14 20.63	(Woolf)

Crude	2.33	1.54 3.51
M-H combined	4.92	2.19 11.02

The stratification can be done for each of the variables and It is clear that whichever variable we consider, the MH combined estimate will closely mirror the adjusted estimate from logistic regression. It does not matter how many variables we create or how correlated (or not) they are with the outcome this remains the situation. Therefore regardless of if there are different predictors in the model of different correlations with the outcome what is more important in the variability seen across methods is the fact that the adjusted estimate combines several heterogeneous stratum specific estimates and this is what leads to the variability between stratification and regression. This is not clinically significant enough variability to justify the paper's conclusions and thus, if I am right, the following four conclusions in Norton's paper are grossly overstated:

a) there is no unique odds ratio to be estimated even from a single study - it may not be unique but they are close enough to be treated as such

b) Different odds ratios from the same study cannot be compared when the statistical models that result in odds ratio estimates have different explanatory variables - this is not the case at all

c) the magnitude of the odds ratio from one study cannot be compared with the magnitude of the odds ratio from another study, because different samples and different model specifications will have different arbitrary scaling factors - no evidence for this above.

d) the magnitudes of odds ratios of a given association in multiple studies cannot be synthesized in a meta-analysis - clearly no evidence for this either.

Any thoughts would be appreciated

Thanks
Suhail

Last edited by Suhail Doi; 17 Jan 2019, 11:01.

Regards
Suhail Doi

Announcement

Is there an arbitrary scaling factor in logistic regression in Stata?

Comment

Comment

Comment

Comment

Comment

Comment