Repeated measurements with a binary outcome

Dimitrie Siriopol

Join Date: Dec 2016

Posts: 13
#1

Repeated measurements with a binary outcome

02 Feb 2017, 09:15

Dear all,
I have a database from which the following variables are of interest for my analysis:
id (patient id), time (with the following possible values 1 2 3 6 9 12 - representing the months we investigated particular biomarkers), csa_tac (the medication that was administered during the initial 12 months - csa or tac, binary), CKD_L (a biomarker that was measuread at 1, 2, 3, 6, 9 and 12 months) PU_L (another biomarker that was measured at that particular moments), C0 (the levels of the csa drug), T0 (the levels of the tac drug), location (where the patients was investigated, binary),diabetes (the presence of diabetes, binary, 0 - no, 1 - yes) outcome_30 (the outcome of interest, binary, 0 - no, 1 - yes)

Code:

id time csa_tac CKD_L PU_L CO_L T0_L location diabetes outcome_30 1 1 1 64 0 210 . 2 0 0 1 2 1 52 0 244 . 2 0 0 1 3 1 32 .9 . . 2 0 0 1 6 1 44 0 . . 2 0 0 1 9 1 63 0 . . 2 0 0 1 12 1 63 0 . . 2 0 0 2 1 1 54 .3 236 . 1 0 0 2 2 1 58 0 . . 1 0 0 2 3 1 58 0 . . 1 0 0 2 6 1 34 0 . . 1 0 0

I have the following research questions.
1. Does the CKL_L measurements influence the outcome?
2. Does the effect of CKD_L on the outcome is influenced by the location, the presence of diabetes or by PU_L?
3. Does the level of the medication (csa or tac) influences the CKD_L (I suppose that I have to do this analysis separately for the csa and tac) - and what level could be associated with deleterious effects on the outcome?

I have read different things (even took an online course) about multilevel modelling, but I still don't know how to perform this analyses.

Thank you

Last edited by Dimitrie Siriopol; 02 Feb 2017, 09:33.
Tags: None
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#2

02 Feb 2017, 10:24

Dimitrie, what's the question?

If you are looking for how to set up the code, then that's relatively easy. You can use generalized estimating equations, or you can use either the XT logit command or the mixed effect logit command. They may use slightly different estimation routines, but fundamentally I think they are pretty similar. However, XT only allows one random effect; the mixed command allows random intercepts and slopes at multiple levels.

The basic setup for xtgee would be something like this:

Code:

xtset id time xtgee outcome_30 time i.csa_tac CKD_L PU_L i.diabetes, family(binomial) corr(exch)

For mixed effect logit, the first command is going to be essentially equivalent to the xtgee command, the second adds a second random intercept for location (i.e. as if patients were nested within clinic), and the third adds a random slope for time to the first:

Code:

melogit outcome_30 time i.csa_tac CKD_L PU_L i.diabetes || id: melogit outcome_30 time i.csa_tac CKD_L PU_L i.diabetes || location: || id: melogit outcome_30 time i.csa_tac CKD_L PU_L i.diabetes || id: time

The coefficient on time basically tells you the unit change per month in the odds of the outcome. The coefficient on csa_tac tells you the change in odds of the outcome for whatever group was coded as 1 (in reference to the one coded 0). You may actually want to interact csa_tac and time, which gets you a sense of the change in trajectories, but if you do this, I'd urge you to use the margins post-estimation command to show the effects in percentage terms (it's probably a good idea anyway when dealing with probability models).

If you want to see if the outcome (diabetes) is associated with the location in #2, you probably want to add a fixed effect for location. Often, in healthcare research, the specific location would get treated as a random intercept, though.

Code:

melogit outcome_30 time i.csa_tac CKD_L PU_L i.location || id:

One problem is that you have many missing values in CO_L and T0_L in the sample data you posted. I'm not sure if that reflects the real data. If it does, you will need to think of how to treat the missingness. If you ran either estimation command, I believe it would drop all time points where there is missing data. That would be a problem. As to your question 3, I'd urge you to think about the basic biology of the drugs. What do clinical trial data say about averse effects and dosing? There are a lot of substantive questions here that you are probably in the better position to answer than any of us.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Dimitrie Siriopol

Join Date: Dec 2016

Posts: 13
#3

02 Feb 2017, 10:50

Thank you very much for this very comprehensive answer.
I have stata 12, so I have tried to use the xtmelogit command (after the xtset command). After I have read and read and...Although I wasn't very sure that I understood this multilevel analysis, I have succeeded in writing a command (somehow similar to your first one - I didn't included time). Using either one of these commands the final output says:

Refining starting values:

Iteration 0: log likelihood = -179.29978
Iteration 1: log likelihood = -107.54816 (not concave)
Iteration 2: log likelihood = -107.54816 (backed up)

Performing gradient-based optimization:

initial values not feasible
r(1400);

This error is related to some numerical overflow... What should I do to overcome this issue?

I have missing values in all the biomarker level data (it's a retrospective analysis). For the C0_L variable the values are missing when the patient is on the tac drug, and for the T0_L are missing when the patient is on the csa drug.

Thank you again,
Dimi
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

02 Feb 2017, 12:02

Dimitrie, the error you have is related to the likelihood function. I hope people with more technical knowledge will correct me if my explanation below is wrong.

Many likelihood functions are pretty easy to optimize, but some are not. It looks like the likelihood for the multi-level logit is hard to maximize. The specific error you got meant that the parameter values that Stata used as a starting point did not give the optimizer a plausible place to take its next iteration.

You should present the specific command you typed. But, one solution that others have given is to fit a simpler logit model, save the estimated parameters (i.e. the estimates of the coefficients in the model), and then fit your final model and tell it to start optimizing from the simple model's parameter values. You should delete one random effect from your model, and if your model only had one random effect (i.e. random intercept only), then fit a plain logit model.

For example:

Code:

xtmelogit outcome_30 time i.csa_tac CKD_L PU_L i.diabetes || id: matrix b = e(b) xtmelogit outcome_30 time i.csa_tac CKD_L PU_L i.diabetes || location: || id:, from(b)

Sources:

https://faustusnotes.wordpress.com/2...ling-in-stata/
http://www.stata.com/statalist/archi.../msg00906.html
http://www.stata.com/statalist/archi.../msg00560.html

If even that fails, use GEE. In fact, if you don't understand the model you are fitting, I would urge you to either start with GEE, which is simpler, and/or to speak to a statistician in person. It's impossible to give comprehensive help over the Internet.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Dimitrie Siriopol

Join Date: Dec 2016

Posts: 13
#5

03 Feb 2017, 01:09

Thank you again for your response. As you correctly noticed I'm not a statistician (I'm actually a physician). Unfortunately, we don't have a skilled statistician in our department to go to ask for advice.
My first code was

Code:

xtmelogit outcome_30 time i.csa_tac CKD_L || id:

and I still got the same output.

I have also read the help for maximize. It's says there that if you have this kind of a problem (not concave), one could change the standard ml algorithm to a different one... The difficult option seems appropriate for overcoming this issue (at least from the description I've read there).

This should be the code for using this different algorithm

Code:

xtmelogit outcome_30 time i.csa_tac CKD_L PU_L i.diabetes || id:, difficult

?

Thank you,
Dimi

Last edited by Dimitrie Siriopol; 03 Feb 2017, 01:14.
Comment
Dimitrie Siriopol

Join Date: Dec 2016

Posts: 13
#6

03 Feb 2017, 04:32

I have also tried the xtgee command and the model is still not converging.
What I didn't say (and maybe this is the cause of these error messages) is that my outcome variable (outcome_30) is not changing during the follow-up. It remains the same for each patient at each visit.
Thank you
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#7

03 Feb 2017, 07:57

Originally posted by Dimitrie Siriopol View Post

I have also tried the xtgee command and the model is still not converging.
What I didn't say (and maybe this is the cause of these error messages) is that my outcome variable (outcome_30) is not changing during the follow-up. It remains the same for each patient at each visit.
Thank you

This is very likely to be the problem. It's a bit puzzling that the xtmelogit command didn't give you the same error. But, I did this to myself one time; I was running a GEE model with healthcare spending as the dependent variable, and I had accidentally set up the dataset such that every observation contained the same amount or some similar error. The model refused to converge, but it immediately resolved when I just looked at my data and found the problem. My methods professor also warned us that convergence problems can sometimes be due to data errors.

Nonetheless, I think others have complained about the mixed effect logit command and maximization before, so the advice about diagnosing the maximization process still stands. Back to the line you quoted from the Stata manual. You only need to worry about the "not concave" and "backed up" errors if they are appear in the last iteration of the maximizer. It sounded to me like the difficult option was for the "not concave" error, rather than "backed up". The difficult option is also not guaranteed to work.

As to seeking advice from a skilled statistician, I am not a "real" statistician myself, so the better advice would be to find anyone who has at least some experience with a GEE model or a mixed model (may be known as random effect model or hierarchical model) to talk to. It may be better to start with a GEE or a mixed logit with only one random intercept first.

From there, the Stata manuals and their examples can be helpful in understanding what you are fitting. It may help for you to go to the xtmixed command manual and work through the examples: that is the linear mixed effect model. Ignore any examples that have anything to do with crossed random effects first, just work on the basic examples.

Note on crossed random effects: I am assuming that all patients had one assigned location. If the same patient could be tested at multiple locations, that would need a crossed random effect. Personally, I found crossed random effects confusing, so if you don't have this situation, I think I would ignore it for now and just try to understand the basic mixed model, then the mixed model with random slope, then with more than one level of clustering. Then, go on to the mixed logit model; by then, if you know regular logit, you should be OK.

Last couple of notes. I don't know what you mean when you ask if the effect of CKD_L on the outcome is influenced by the location. Is there a reason to think it would be? Go back to the code you posted and imagine your model is now a linear model, so your outcome is continuous. If you include a fixed effect for each location, you're merely shifting the intercept up or down by a bit. Each location still is assumed to have the same slope (i.e. you are estimating the change in the outcome per unit time overall, everybody has the same slope). If you do actually think location matters a lot, you would probably be fitting an interaction model between time and the dummy for location (i.e. each location now has a different slope).

If you think that patients within each clinic are likely to be more similar to one another and you just want to treat that as noise, because you aren't specifically interested in the clinics, then you are better off using a random effects model.

And lastly, when you add a random effect to a model, you can do a likelihood ratio test to see if it improves things. I'm going to acknowledge Carlo Lazzaro for this code:

Code:

xtmelogit outcome_30 time i.csa_tac CKD_L PU_L i.diabetes || id: estimates store a xtmelogit outcome_30 time i.csa_tac CKD_L PU_L i.diabetes || id: time estimates store b lrtest a b

In this case, if the likelihood ratio test does not reject, then both models fit the data equally well, so adding the random slope did not improve model fit. In practice, it would mean that there isn't a lot of heterogeneity in patient trajectories. If the random effect you added was a clinic-level random intercept, it would mean that outcomes didn't differ that much between clinics.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Dimitrie Siriopol

Join Date: Dec 2016

Posts: 13
#8

03 Feb 2017, 23:07

Thank you for these very extensive explanations - when someone puts it in simple words (and not only coefficients, residuals, variances and so on...) it doesn't look that scary anymore.
Using difficult (or any other maximization option) didn't work. The model gives me the same errors (not concave and backed up). What I am now trying to understand is where the problem is: in my model/commands or in my database.
When I've read the examples for the xtmixed command for repeated measurements (not from the stata manual, but from an online course https://www.cmm.bris.ac.uk/lemma/login/index.php) the dependent variable was also measured/estimated at different interval times (so a repeated variable itself). In my example, the outcome of interest is constant for each patient. So my first question: is xtmelogit command appropriate for my analysis? My first guess is that it is....

Another issue that I thought it could be related to these errors is that of the sample size. In the initial wide form I have 271 patients with 6 repeated measures for each one (for the CKD_L, PU_L, C0_L, T0_L variables), but only 25 positive outcomes (the outcome_30 has 25 "1"s and 246 "0"s). Is this number of outcomes too low for an appropriate estimation? And if so, are there any methods to overcome this?

Thank you,
Dimi
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#9

04 Feb 2017, 10:48

Originally posted by Dimitrie Siriopol View Post

Using difficult (or any other maximization option) didn't work. The model gives me the same errors (not concave and backed up). What I am now trying to understand is where the problem is: in my model/commands or in my database.
When I've read the examples for the xtmixed command for repeated measurements (not from the stata manual, but from an online course https://www.cmm.bris.ac.uk/lemma/login/index.php) the dependent variable was also measured/estimated at different interval times (so a repeated variable itself). In my example, the outcome of interest is constant for each patient. So my first question: is xtmelogit command appropriate for my analysis? My first guess is that it is....

Another issue that I thought it could be related to these errors is that of the sample size. In the initial wide form I have 271 patients with 6 repeated measures for each one (for the CKD_L, PU_L, C0_L, T0_L variables), but only 25 positive outcomes (the outcome_30 has 25 "1"s and 246 "0"s). Is this number of outcomes too low for an appropriate estimation? And if so, are there any methods to overcome this?

Thank you,
Dimi

The number of positive outcomes is definitely not too low.

I clearly misunderstood you earlier. When you said that the outcome does not vary by patient, I thought you might have coded all patients' outcomes as 0. That would definitely mess up the likelihood estimation. But I think you mean that a person is coded as 1 at all time points or 0 at all time points.

Does a GEE model or your first mixed model (the one with just a random intercept) estimate without entering time as a covariate? If those don't estimate, does a plain logit model?

More fundamentally, if the outcome doesn't change with time, then is there a need for repeated measurements over time? When you say repeated measures and when I'm given a dataset with measurements over a period of several months, I usually assume that the outcome could change at each measurement. I think most people would say the same. In this case, say a patient had a CKD level of 64 at 1 month, and they have had the outcome. If you didn't know that the outcome could not change with time, then that's an interesting discovery. If you knew it couldn't (e.g. it's something irreversible), then is there a need for any further measurements?

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Dimitrie Siriopol

Join Date: Dec 2016

Posts: 13
#10

06 Feb 2017, 03:42

The number of positive outcomes is definitely not too low.

I clearly misunderstood you earlier. When you said that the outcome does not vary by patient, I thought you might have coded all patients' outcomes as 0. That would definitely mess up the likelihood estimation. But I think you mean that a person is coded as 1 at all time points or 0 at all time points.

Does a GEE model or your first mixed model (the one with just a random intercept) estimate without entering time as a covariate? If those don't estimate, does a plain logit model?

More fundamentally, if the outcome doesn't change with time, then is there a need for repeated measurements over time? When you say repeated measures and when I'm given a dataset with measurements over a period of several months, I usually assume that the outcome could change at each measurement. I think most people would say the same. In this case, say a patient had a CKD level of 64 at 1 month, and they have had the outcome. If you didn't know that the outcome could not change with time, then that's an interesting discovery. If you knew it couldn't (e.g. it's something irreversible), then is there a need for any further measurements?

The logit model works. The xtgee and xtmelogit don't... However, how do I account for the individual slopes? I don't know if I can explain it very well, but one of my research question is if the evolution of CKD during the follow up differs between different groups (medication, location, comorbidities) and has an effect on the outcome. Should I use different interaction terms between time and confounders?

I assumed, and probably I was wrong, that if one of the independent variables is changing during the follow-up I should use this type of analysis. I am very sorry for the misunderstanding.

Dimi
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#11

09 Feb 2017, 08:48

Dimitrie, sorry for the slight delay in replying. Your question is a bit complicated!

Easy answer first: I believe that when time-varying covariates are involved, you want to use a mixed model rather than GEE, as per advice from the biostatistics professor who taught me longitudinal modeling. I was told that GEE is best for pure between cluster effects (i.e. compare the effects between different people in your situation; a within cluster effect here would be comparing the effects of a change within each individual). I should have noticed that and said that earlier, my apologies for the confusion. So, your last sentence is correct.

Think about what "individual slopes" means in this context. If the outcome were continuous and you were in a linear mixed model, you would be using a random slope to account for heterogeneous individual trajectories in how the outcome evolved:

Code:

xtmixed CKD_L i.drug time X || id: time xtmixed CKD_L i.drug##c.time X || id: time

The coefficient for time gives you the overall mean change, but each individual is allowed their own slope. (Others note: xtmixed is the old Stata 12 and earlier syntax, but converting this to mixed should work fine for 13 and 14). If the outcome were binary, the model still works. The slope of time now means the change in the probability that you develop the outcome at each time point. So yes, you're on the right track.

An interaction between time and, for example, the medication, means that you are testing if the slope of time differs by type of medication. Generally, I was advised to first fit the main effects model with no interaction terms, then go test if you have interactions.

It sounds like if one of your research questions is whether CKD levels evolve differently by drug group, then you should go test that, which is what the code above would do. Thing is, now we are getting into the area where subject matter knowledge dominates, and I have no specific familiarity with this subject, so I shall say that you need to make sure your conceptual model is clear (i.e. what are you testing and why), and I'll say no more on that subject.

As to the binary outcome, it still sounds like all patients have all 1s or all 0s, correct? If this is true, and if you are in fact trying to test if changing CKD levels influence that outcome, then I fear that's impossible with your data. I guess you could keep just the first observation at time = 0 (or whatever the lowest is), but that's conceptually problematic. Is this a coding problem, or is that actually how patients were measured? I guess, what is the outcome?

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Announcement

Repeated measurements with a binary outcome

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment