Logistic regression and categorical variable interactions

Frank Shann

Join Date: Jan 2018
Posts: 15

Logistic regression and categorical variable interactions

19 Jan 2018, 18:37

In a randomised trial, a drug had no effect on mortality in low-birth-weight babies. However, if the drug *had* been effective, how would I use Stata 15 (Windows) to test whether the effect of the drug in babies weighing <1000 g (baseline) differed from (1) the effect of the drug in babies weighing 1000-1499 g and (2) from the effect of the drug in babies weighing 1500-1999 g?

The Stata file has variables:
died: 0 (survived), 1 (died)
drug: 0 (control), 1 (treatment)
wtgp: 0 (0-999g), 1 (1000-1499g), 2 (1500-1999g).

Code:

. cs died drug, by(wtgp)
           wtgp |       RR       [95% Conf. Interval]   M-H Weight
-----------------+-------------------------------------------------
          0-999g |    .8643293     .7181096   1.040322     32.19632
      1000-1499g |    1.034783      .831774   1.287339         57.5
      1500-1999g |    .9549525     .6873575   1.326725     34.01655
------------------------------------------------------------------------------

Using the Excel spreadsheet for ratios from http://www.statpages.info/Confidence...0P-values.xlsx
wtgp 0 vs 1: 0.86 (0.72-1.04) vs 1.03 (0.83-1.29) gives p = 0.22.
wtgp 0 vs 2: 0.86 (0.72-1.04) vs 0.95 (0.69-1.33) gives p = 0.60.

Code:

. logistic died i.drug##i.wtgp
------------------------------------------------------------------------------
        died | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      1.drug |   .5721154   .2068869    -1.54   0.123      .281627    1.162232
        wtgp |
          1  |   .0979067   .0287701    -7.91   0.000     .0550408    .1741566
          2  |   .0188347   .0056566   -13.23   0.000     .0104549    .0339312
   drug#wtgp |
        1 1  |   1.832185   .7197436     1.54   0.123     .8483771    3.956852
        1 2  |   1.663846   .6716229     1.26   0.207     .7542577    3.670342
       _cons |   3.764706    1.02721     4.86   0.000     2.205355    6.426634
------------------------------------------------------------------------------
 
. margins wtgp, dydx(drug) pwcompare(effects)
Pairwise comparisons of conditional marginal effects
Model VCE    : OIM
Expression   : Pr(died), predict()
dy/dx w.r.t. : 1.drug
------------------------------------------------------------------------------
             |   Contrast Delta-method    Unadjusted           Unadjusted
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
0.drug       |  (base outcome)
-------------+----------------------------------------------------------------
1.drug       |
        wtgp |
     1 vs 0  |   .1165643    .074964     1.55   0.120    -.0303625    .2634911
     2 vs 0  |   .1042139   .0693238     1.50   0.133    -.0316582    .2400861
     2 vs 1  |  -.0123504   .0323953    -0.38   0.703     -.075844    .0511433
------------------------------------------------------------------------------

P = 0.22 (from Excel) differs from p = 0.120 (margins)
P = 0.60 (from Excel) differs from p = 0.133 (margins).

What is the correct way to compare the effect on drug of wtgp 0 vs 1, and wtgp 0 vs 2?
I apologise if this problem is very basic, but I’ve spent many hours trying to solve it.

Last edited by Frank Shann; 19 Jan 2018, 18:40.

Tags: categorical, interaction, logistic regression

Clyde Schechter

Join Date: Apr 2014

Posts: 29947
#2

20 Jan 2018, 00:33

Well, with the full admission that I have a strong prejudice here, I do not take statistical analyses calculated in Excel seriously. Certainly in earlier versions of Excel, many of them were flat-out wrong. I am told that in more modern versions of Excel the errors have been fixed. Perhaps so. But even so, what you do in Excel leaves no audit trail, so you can never adequately explain or defend your analyses. So I tend to dismiss them out of hand. Others may beg to differ. And I realize this is more of an ad "hominem" attack on Excel than a critique of the actual analysis (the implementation of which I know nothing about.) Somebody who is familiar with this particular Excel routine might want to chime in and defend it.

To me, the more interesting question here is whether you should be looking at the interaction coefficients in the logistic regression output, or at the contrasts of the marginal effects on outcome probability in the -margins- output. This subject comes up fairly regularly here and remains controversial. It is very far from being basic. When viewed through the distorting lens of p-values, one gets an additional layer of complexity because the arbitrary dichotomization of statistical significance at the arbitrary threshold of .05 (or any other arbitrary threshold) can make results that are substantively very similar appear to be qualitatively different, especially since most people tend to forget that the difference between statistically significant and statistically insignificant is not, itself, statistically significant.

So let's leave issues of statistical significance out of it and just discuss whether the logistic regression output or the margins output is more salient. Logistic regression is, of course, a non-linear model. As a result, even in a model that contains no explicit interaction terms, the marginal effect of any predictor on the outcome probability will not be constant but will depend on the value of that predictor, and it will also depend on the values of other variables. In other words, even without explicit interaction terms, the logistic model creates interaction when you look at effects in the probability metric. So there are a few ways to look at it.

If you really believe that the odds ratios are the true effects, not the probability differences, then you would rely on the logistic regression output and ignore the marginal effects on probabilities. On the other hand if you believe the outcome probabilities are the real deal and that the logistic model and odds ratios are just an indirect way to get at them, necessitated by the inconvenient properties of linear probability models, then you would rely on the marginal effects on probabilities.

In my own work, I am inclined to think of these models as preludes to decision analysis, where the marginal effects on outcome probabilities are all-important and odds ratios don't really figure into it. So I tend to focus on those. On the other hand if you are focused on causal modeling, the case can be made that the odds ratio is actually the better representation of the treatment effect.

Much ink has been spilled over this issue, both here on this Forum and in the literature. Notwithstanding all the sturm und drang, and strongly expressed opinions (mea culpa), I would also note that if we put aside artificial examples created to demonstrate a theoretical point, I have never seen a real-world application where you would really come to different conclusions one way or the other (unless you allow statistical significance to obfuscate the issue). In your own example, the odds ratios for 1 vs 0, 2 vs 0, and 2 vs 1 are, rounding to 2 places, 1.83, 1.66, and 1.66/1.83 = about 0.91. So you could characterize the first two interaction effects as being moderate to large positive, and the last one as being small negative. If you look at the marginal effect on predicted probability contrasts, the corresponding figures are 0.12, 0.10, and -.01. In these terms, you would also characterize the first two interactions as moderate to large positive, and the third as small negative.
2 likes
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1116
#3

20 Jan 2018, 06:23

I think Frank's point is that in randomized trials (and in cohort studies) with dichotomous outcomes, the RR is (usually) the preferred measure of effect size. In that case, one of the approaches discussed on this UCLA page might be useful.
https://stats.idre.ucla.edu/stata/fa...ohort-studies/

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4373
#4

20 Jan 2018, 07:16

Well, then, in that case, can't Frank use glm died i.drug##i.wtgp, family(binomial) link(log)?
Comment

Frank Shann

Join Date: Jan 2018
Posts: 15

20 Jan 2018, 07:21

Thank you very much indeed Clyde and Bruce for your prompt and thoughtful replies. I do indeed want RR rather than odds ratios. In addition to the link Bruce gave, I also found very helpful Clyde's posts of August 2016 at www.statalist.org/forums/forum/general-stata-discussion/general/1353874-relative-risk-ratio-calculation-help

Stata's binreg command gives me risk ratios (rather than odds ratios) and confidence intervals and p values identical to http://www.bmj.com/content/bmj/326/7382/219.full.pdf
Do you think binreg is suitable?

Code:

. binreg died i.drug##i.wtgp, rr
------------------------------------------------------------------------------
             |                 EIM
        died | Risk Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      1.drug |   .8643293   .0817298    -1.54   0.123     .7181096    1.040322
        wtgp |
          1  |   .3408592   .0334547   -10.97   0.000     .2812103    .4131605
          2  |   .0837999   .0109297   -19.01   0.000      .064897    .1082088
   drug#wtgp |
        1 1  |   1.197209   .1749585     1.23   0.218     .8990347    1.594275
        1 2  |   1.104848   .2127638     0.52   0.605     .7575027    1.611465
       _cons |   .7901235   .0452467    -4.11   0.000     .7062372    .8839736
------------------------------------------------------------------------------

I understand Clyde's concerns about Excel calculations but, just before he died, John Pezzullo (of statpages.info) wrote an Excel routine for me that calculates the ratio of two relative risks (or means - it is important to choose the correct tab) from just the ratios and 95% CI. This is invaluable when the raw data are not available (e.g. when reading a paper). It is available at statpages.info/Confidence%20Intervals%20to%20P-values.xlsx. It gives results that are identical to Stata’s binreg command for the data I posted and the BMJ article (and for five other datasets that I have tested)...

Code:

wtgp == 0 vs wtgp == 1  
Group A
Group B




Ratio:
0.8643293
1.034783
(rate/odds/risk/hazard ratio, relative risk)

Lower Conf. Limit:
0.7181096
0.831774




Upper Conf. Limit:
1.040322
1.287339




Confidence Level:
95.0%
95.0%




Std Err of Ln(Ratio):
0.094558716
0.111423427




p-value for Ho: Ratio=1
0.12309439
0.758947544
(testing whether each true ratio is 1)

For Comparing Two Groups:





Ratio of Ratios (B/A):
1.197209212

Ratio A/B:
0.835275898


Std Err of Ln(B/A)
0.146138738





Desired Conf. Level:
95.0%





Lower 95.0% Conf. Limit:
0.899034857

Lo CI for A/B:
0.627243877


Upper 95.0% Conf. Limit:
1.594276224

Up CI for A/B:
1.112303925


p-value for Ho: A=B:
0.218076238
(testing whether both groups have same true ratio)


wtgp == 0 vs wtgp==2  
Group A
Group B




Ratio:
0.8643293
0.9549525
(rate/odds/risk/hazard ratio, relative risk)

Lower Conf. Limit:
0.7181096
0.6873575




Upper Conf. Limit:
1.040322
1.326725




Confidence Level:
95.0%
95.0%




Std Err of Ln(Ratio):
0.094558716
0.167761818




p-value for Ho: Ratio=1
0.12309439
0.783503192
(testing whether each true ratio is 1)

For Comparing Two Groups:





Ratio of Ratios (B/A):
1.104848002

Ratio A/B:
0.905101877


Std Err of Ln(B/A)
0.192575643





Desired Conf. Level:
95.0%





Lower 95.0% Conf. Limit:
0.757498627

Lo CI for A/B:
0.620550001


Upper 95.0% Conf. Limit:
1.61147369

Up CI for A/B:
1.320134407


p-value for Ho: A=B:
0.604626435
(testing whether both groups have same true ratio)

Again, my profuse thanks to both of you for your advice. StataList is a wonderful resource!

Frank

Comment

Frank Shann

Join Date: Jan 2018
Posts: 15

20 Jan 2018, 07:29

Thank you Joseph. I get slightly different results with glm

Code:

. glm died i.drug##i.wtgp, family(binomial) link(log)
------------------------------------------------------------------------------
             |                 OIM
        died |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      1.drug |  -.1458015   .0945587    -1.54   0.123    -.3311331    .0395301
        wtgp |
          1  |  -1.076286   .0981481   -10.97   0.000    -1.268653    -.883919
          2  |  -2.479323   .1304281   -19.01   0.000    -2.734958   -2.223689
   drug#wtgp |
        1 1  |   .1799929   .1461387     1.23   0.218    -.1064337    .4664194
        1 2  |   .0997078   .1925756     0.52   0.605    -.2777335    .4771491
       _cons |  -.2355661   .0572654    -4.11   0.000    -.3478041    -.123328
------------------------------------------------------------------------------

binreg died i.drug##i.wtgp, rr
seems to be preferable for my analysis.

Frank

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4373
#7

20 Jan 2018, 08:13

If you want to match binreg, then you might want to look into the eform and vce(eim) options.

Code:

glm died i.drug##i.wtgp, family(binomial) link(log) eform vce(eim)
Comment

Frank Shann

Join Date: Jan 2018
Posts: 15

20 Jan 2018, 08:15

glm is fine with the eform option

Code:

. glm died i.drug##i.wtgp, family(binomial) link(log) eform
------------------------------------------------------------------------------
             |                 OIM
        died | Risk Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      1.drug |   .8643293   .0817298    -1.54   0.123     .7181096    1.040322
        wtgp |
          1  |   .3408592   .0334547   -10.97   0.000     .2812103    .4131605
          2  |   .0837999   .0109299   -19.01   0.000     .0648967    .1082092
   drug#wtgp |
        1 1  |   1.197209   .1749585     1.23   0.218     .8990346    1.594276
        1 2  |   1.104848   .2127668     0.52   0.605     .7574987    1.611474
       _cons |   .7901235   .0452467    -4.11   0.000     .7062372    .8839736
------------------------------------------------------------------------------

Comment

Frank Shann

Join Date: Jan 2018
Posts: 15

20 Jan 2018, 08:28

Thanks again Joseph. We both posted at about the same time - I'm getting so much good advice that I can't keep up.

Code:

glm died i.drug##i.wtgp, family(binomial) link(log) eform

gives the same result as binreg (see above)...

...but adding vce(eim) causes an error

Code:

. glm died i.drug##i.wtgp, family(binomial) link(log) eform vce(eim)
Iteration 0:   log likelihood =  -1533.168  
Iteration 1:   log likelihood = -1106.9587  
Iteration 2:   log likelihood = -1090.0395  
Iteration 3:   log likelihood = -1087.0655  
Iteration 4:   log likelihood = -1086.9353  
Iteration 5:   log likelihood =  -1086.935  
Iteration 6:   log likelihood =  -1086.935  
matrix has missing values
r(504);

Announcement

Logistic regression and categorical variable interactions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment