"Multivariate logistic regression"-?

Piotr Lewczuk

Join Date: Apr 2016

Posts: 59
#1

"Multivariate logistic regression"-?

25 Aug 2016, 07:02

Good afternoon,
I wonder if something like "multivariate logistic regression" exists, and if it can be analyzed in Stata. What I mean is a kind of analogy to mvreg/manova, but with dichotomous (and not continuous) outcomes. For example, we may have collected data on three outcome variables in every patient: diabetes (yes/no), stroke (yes/no), chronic inflammation (yes/no), and some predictive variables (age, sex, number of children, smoker(yes/no), etc.). Is there a method to test if given set of predictors is associated stronger with one of the three outcome variables?

Many thanks in advance for all comments/suggestions.
Best greetings,
Piotr Lewczuk
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#2

25 Aug 2016, 07:45

You might want to take a look at this post.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#3

25 Aug 2016, 08:51

Hi Piotr. Have you considered using -gsem- to estimate your model?

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Adriana Peci

Join Date: Nov 2014

Posts: 32
#4

25 Aug 2016, 11:07

Hi Piotr,
Yes you can run a multinomial logistic regression with three outcomes in stata . Please see the code below:
mlogit if the function in Stata for the multinomial logistic regression model.

To explain this a bit in more detail:

1-First you have to transform you outcome variable in a numeric one in which all categorise are ranked as 1, 2, 3. For your specific outcome you would have to have something like gen outcome_present=.
replace outcome_present=1 if outcome=="diabetes"
replace outcome_present=2 if outcome=="chronic_inflamation"
replace outcome_present=3 if outcome=="stroke"

2-Then you have to create all your predicted variables to dummy variables as in every logistic regression analysis.
For example for age this will be like: tab age, gen(age_cat) and you will have age_cat1, age_cat2, age_cat3, age_cat4 and so on and so forth. For sex you will be doing the same for example tab sex, gen(gender) and gender=1 will be male and gender2 =female. Note that numeric variables do not need to be transformed into a 3-dummy variable.

3-You have to have a hypothesis before throwing everything in the model:
For example if you think that younger people and male are more likely to have diabetes than inflammation or stroke then you could chose stroke as your reference variable (base=3) and diabetes (base=1) and chronic inflammation (base=2) will be the comparison ones. Younger people will be your reference for the age and gender1=male will be your reference for gender.
4-After you have created the outcome variable and all your dummy variables you could throw all variables in the model using the "mlogit" function as below:

For example:

mlogit age_group2 age_group3 age_reg4 gender2, base(1)
mlogit rrr,

Note above that base(1) is the outcome variable which I chose to be diabetes and age_cat2 and age_cat3 , age_cat4 are older group of people than age_group1 (which is my reference for age). Gender2 is the female category versus gender1=male category. So in the final model you include the comparison categorise not the reference categorise.

5-The output you receive is OR in case control studies/cross sectional studies and Relative Risk in prospective cohort studies or randomize control trials.
Different from dichotomous logistic regression your output will have two parts one comparing those who have or not have stroke against those who have or not have diabetes and the other one comparing those who have or not have inflammation against diabetes. If this is not your outcome of interest you can chose another category for reference but remember that all other comparison categorise will compare to the reference one not in between. If this is not enough please try to read multinomial logistic regression model. There is a lot of information about it in Stata page.

Hope it is helpful,

Adriana
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#5

25 Aug 2016, 11:59

Hello Adriana. I believe Piotr has 3 binary dependent (outcome) variables. So using your example, diabetes, chronic inflammation and stroke would all be Yes/No variables; and therefore, there are 8 possible combinations, or 8 outcome categories if you convert it to a multinomial logistic regression problem. But based on what I read here, I'm not convinced that's a good way to go. I suggested using -gsem- to estimate the model, because it should allow for 3 separate but correlated binary outcome variables.

Cheers,
Bruce

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
1 like
Comment
Piotr Lewczuk

Join Date: Apr 2016

Posts: 59
#6

26 Aug 2016, 02:11

Joseph,
Thank you very much for pointing at this post.

Bruce,
Thanks! Your interpretation of the problem is correct (8 possibilities, not 3), and your advice very helpful.

Adriana,
Thanks for discussion but there are two issues with your approach:
1. First, your transformation of the output variable neglects possibilities of having two or three conditions simultaneously (as Bruce said): what about cases having both diabetes and stroke, etc.? In my, rather artificial, example I wanted to analyze three outcome variables, but what about if there were ten? You'd have to create 1024 levels and, if your cohort is not enormously large, most of them would be empty.
2. Second, I'm not sure if I understand your model. If you write "mlogit age_group2 age_group3 age_reg4 gender2, base(1)" then you seem to define age_group2 as an outcome variable-??? And you model age_group2 as a function of the remaining age groups-???

Next question: Why do you want to recategorize age (continuous variable by definition) into categorical (ordinal) variable? Just to have a reference category? I don't think it is a good idea, I'd rather use margins, at(age=(<value that I'm interested in>)), instead.

By the way, in the second line the coma should be before, not after, rrr (mlogit, rrr).

Thanks anyway, and I would be happy to discuss further.

Greetings,
Piotr Lewczuk
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#7

26 Aug 2016, 06:56

Piotr, here are two examples that might be helpful with respect to using -gsem-:
http://www.stata.com/manuals13/semex...#semexample33g

http://www.stata.com/manuals13/semex...#semexample34g

The second example is similar to your case, except that you would have Bernoulli-logit for all of your outcomes. It does look as though you might not get any conventional multivariate tests (i.e., showing the effects of the explanatory variables on some "best" linear combination of all the outcome variables) using this approach. Rather, you might get multiple "univariate" (but simultaneously estimated) sets of results. This may still be useful to you.

HTH.

Last edited by Bruce Weaver; 26 Aug 2016, 07:27. Reason: I omitted the comment on simultaneously esimtated "univariate" results in the original version.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Piotr Lewczuk

Join Date: Apr 2016

Posts: 59
#8

26 Aug 2016, 07:24

Thank you, Bruce; this is very helpful!
Comment

Bruce Weaver

Join Date: May 2014
Posts: 1133

26 Aug 2016, 12:36

Hi Piotr. I found some time to play around with this a bit today, and unfortunately, it's not going to work out the way I hoped it might. I've pasted my code below so that you can see what I've tried.

Cheers,
Bruce

Code:

* Estimate two univariate logistic regression models
* simultaneously via -gsem-.
* This example is based on the example found here:
* http://www.stata.com/manuals13/semexample34g.pdf#semexample34g

use http://www.stata-press.com/data/r13/gsem_lbw, clear
describe
summarize ptl
generate byte anyptl = (ptl > 0)
label var anyptl "Any history of premature labor"
tabulate ptl anyptl

* First, estimate each model separately.
logit low age smoke ht lwt i.race ui
logit anyptl age smoke ht lwt i.race ui

* Second, estimate both models simultaneously, but
* with no correlation between the two outcome variables.
gsem (low anyptl <- age smoke ht lwt i.race ui, logit)
estat eform low anyptl // Display the odds ratios

* Now add a covariance term to allow the outcome variables to be correlated.
****************************************************************************
* I have commented out the next -gsem- command for now, because it causes
* an error, and thus causes DO file processing to terminate.    
****************************************************************************
*gsem (low anyptl <- age smoke ht lwt i.race ui, logit), cov(e.low*e.anyptl)
* This does not work--it generates an error message as follows:
// invalid covariance specification;
// e.anyptl does not identify an error for an uncensored gaussian response with
// the identity link
* Try with Normal error distribution and identity link:
gsem ///
 (low anyptl <- age smoke ht lwt ui, family(gaussian) link(identity)), ///
 cov( e.low*e.anyptl) nocapslatent
* Okay, one can have correlated errors for the model with normal errors
* & identity link, but not for bernoulli errors & logit link.

* CONCLUSION:  It appears that one gets the same results when estimating two
* logit models separately as when estimating them simultaneously via -gsem-.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)

Comment

Piotr Lewczuk

Join Date: Apr 2016

Posts: 59
#10

27 Aug 2016, 07:01

Hi Bruce,
Many thanks for your excellent explanations! As a matter of fact, I came to the same conclusion: two separate logit models give the same result as a common gsem, so indeed this is not much helpful. Moreover, whereas I was able to reproduce some examples from the Stata help (one of them being the same you used here), I have problems to use it on my "real" data (although separate logit's work). So, all in all, I agree - this is not the best approach for my problem. Nevertheless I've learned a lot thanks to this discussion.
Best greetings,
Piotr
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#11

28 Aug 2016, 04:59

Isn't the problem that there is no multivariate logistic distribution? If one uses a probit rather than logit link, then progress can be made: multivariate probit modelling, in this case trivariate probit, because there is a trivariate normal distribution. There is no closed form expression for this, but it can be calculated using simulation methods. I suggest having a look at -mvprobit- on SSC and SJ, and the associated article in SJ 2003 (by Cappellari & Jenkins), a free download. See also their article in SJ 2006, vol 6(2), also a free download. These models can also be fitted using -cmp- by David Roodman (on SSC)
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#12

28 Aug 2016, 19:52

Originally posted by Piotr Lewczuk View Post

Is there a method to test if given set of predictors is associated stronger with one of the three outcome variables?

It's not clear to me how an analogue of MANOVA will allow you to test whether a set of explanatory variables is more strongly associated with one of three outcomes.

You could fit three separate logistic models to your set of explanatory variables and use an index of model fit (log-likelihood, pseudo R²) to make an informal or semiformal (AIC, BIC) judgment as to relative strength of association, but that wouldn't be formal hypothesis testing, either.

If you want to test for a difference in relative strength of association of a set of explanatory variables with the three (or more) outcomes, then consider fitting a generalized structural equation model, and test for differences in magnitude between the factor loadings of the three outcome indicator variables on the latent variable (factor) that is predicted by the set of explanatory variables. An example is shown below (start after the "Begin here" comment).

.ÿ
.ÿversionÿ14.1

.ÿ
.ÿclearÿ*

.ÿsetÿmoreÿoff

.ÿsetÿseedÿ1354310

.ÿ
.ÿ/*ÿcollectedÿdataÿonÿthreeÿoutcomeÿvariablesÿinÿeveryÿpatient:ÿdiabetesÿ(yes/no),
>ÿÿÿÿstrokeÿ(yes/no),ÿchronicÿinflammationÿ(yes/no),ÿandÿsomeÿpredictiveÿvariables
>ÿÿÿÿ(age,ÿsex,ÿnumberÿofÿchildren,ÿsmoker(yes/no)ÿ*/
.ÿtempnameÿCorr

.ÿmatrixÿdefineÿ`Corr'ÿ=ÿJ(5,ÿ5,ÿ0.5)ÿ+ÿI(5)ÿ*ÿ0.5

.ÿquietlyÿdrawnormÿdiabetesÿstrokeÿinflammationÿsexÿsmoker,ÿdoubleÿ///
>ÿÿÿÿÿcorr(`Corr')ÿn(250)

.ÿforeachÿvarÿofÿvarlistÿ_allÿ{
ÿÿ2.ÿÿÿÿÿÿÿÿÿquietlyÿreplaceÿ`var'ÿ=ÿ`var'ÿ>ÿ0.5
ÿÿ3.ÿ}

.ÿgenerateÿintÿageÿ=ÿruniformint(45,ÿ65)

.ÿgenerateÿbyteÿchildrenÿ=ÿruniformint(0,ÿ6)

.ÿ
.ÿ*
.ÿ*ÿBeginÿhere
.ÿ*
.ÿgsemÿ(diabetesÿstrokeÿinflammationÿ<-ÿF,ÿlogit)ÿ(Fÿ<-ÿageÿsexÿchildrenÿsmoker),ÿ///
>ÿÿÿÿÿnocnsreportÿnodvheaderÿnolog

GeneralizedÿstructuralÿequationÿmodelÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿobsÿÿÿÿÿ=ÿÿÿÿÿÿÿÿ250
Logÿlikelihoodÿ=ÿ-380.04226

---------------------------------------------------------------------------------
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|ÿÿÿÿÿÿCoef.ÿÿÿStd.ÿErr.ÿÿÿÿÿÿzÿÿÿÿP>|z|ÿÿÿÿÿ[95%ÿConf.ÿInterval]
----------------+----------------------------------------------------------------
diabetesÿ<-ÿÿÿÿÿ|
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿFÿ|ÿÿÿÿÿÿÿÿÿÿ1ÿÿ(constrained)
ÿÿÿÿÿÿÿÿÿÿ_consÿ|ÿÿ-1.597736ÿÿÿ1.656494ÿÿÿÿ-0.96ÿÿÿ0.335ÿÿÿÿ-4.844405ÿÿÿÿ1.648933
----------------+----------------------------------------------------------------
strokeÿ<-ÿÿÿÿÿÿÿ|
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿFÿ|ÿÿÿ.7016929ÿÿÿ.2152096ÿÿÿÿÿ3.26ÿÿÿ0.001ÿÿÿÿÿ.2798898ÿÿÿÿ1.123496
ÿÿÿÿÿÿÿÿÿÿ_consÿ|ÿÿ-1.467324ÿÿÿ1.176234ÿÿÿÿ-1.25ÿÿÿ0.212ÿÿÿÿÿÿ-3.7727ÿÿÿÿÿ.838053
----------------+----------------------------------------------------------------
inflammationÿ<-ÿ|
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿFÿ|ÿÿÿ.8192105ÿÿÿ.2661413ÿÿÿÿÿ3.08ÿÿÿ0.002ÿÿÿÿÿ.2975831ÿÿÿÿ1.340838
ÿÿÿÿÿÿÿÿÿÿ_consÿ|ÿÿ-1.456878ÿÿÿ1.337811ÿÿÿÿ-1.09ÿÿÿ0.276ÿÿÿÿ-4.078939ÿÿÿÿ1.165183
----------------+----------------------------------------------------------------
Fÿ<-ÿÿÿÿÿÿÿÿÿÿÿÿ|
ÿÿÿÿÿÿÿÿÿÿÿÿageÿ|ÿÿ-.0281287ÿÿÿ.0295425ÿÿÿÿ-0.95ÿÿÿ0.341ÿÿÿÿÿ-.086031ÿÿÿÿ.0297736
ÿÿÿÿÿÿÿÿÿÿÿÿsexÿ|ÿÿÿÿ2.06479ÿÿÿ.5815782ÿÿÿÿÿ3.55ÿÿÿ0.000ÿÿÿÿÿ.9249172ÿÿÿÿ3.204662
ÿÿÿÿÿÿÿchildrenÿ|ÿÿÿ.2009465ÿÿÿ.1005151ÿÿÿÿÿ2.00ÿÿÿ0.046ÿÿÿÿÿ.0039406ÿÿÿÿ.3979525
ÿÿÿÿÿÿÿÿÿsmokerÿ|ÿÿÿ2.058773ÿÿÿ.5390329ÿÿÿÿÿ3.82ÿÿÿ0.000ÿÿÿÿÿ1.002288ÿÿÿÿ3.115258
----------------+----------------------------------------------------------------
ÿÿÿÿÿÿÿÿvar(e.F)|ÿÿÿ3.249945ÿÿÿ1.626703ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ1.21849ÿÿÿÿ8.668223
---------------------------------------------------------------------------------

.ÿ
.ÿ/*ÿIsÿthereÿaÿmethodÿtoÿtestÿifÿgivenÿsetÿofÿpredictorsÿisÿassociatedÿstronger
>ÿÿÿÿwithÿoneÿofÿtheÿthreeÿoutcomeÿvariables?ÿ*/
.ÿtestÿ_b[stroke:F]ÿ=ÿ1,ÿnotest

ÿ(ÿ1)ÿÿ[stroke]Fÿ=ÿ1

.ÿtestÿ_b[inflammation:F]ÿ=ÿ1,ÿaccumulate

ÿ(ÿ1)ÿÿ[stroke]Fÿ=ÿ1
ÿ(ÿ2)ÿÿ[inflammation]Fÿ=ÿ1

ÿÿÿÿÿÿÿÿÿÿÿchi2(ÿÿ2)ÿ=ÿÿÿÿ1.94
ÿÿÿÿÿÿÿÿÿProbÿ>ÿchi2ÿ=ÿÿÿÿ0.3787

.ÿ
.ÿexit

endÿofÿdo-file

.
Comment
Piotr Lewczuk

Join Date: Apr 2016

Posts: 59
#13

12 Sep 2016, 08:01

Thank you, Stephen and Joseph, and sorry for doing it so late, but I was off for two weeks without access to the net.
Best regards,
Piotr Lewczuk
Comment

Announcement

"Multivariate logistic regression"-?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment