Adjusting Covariates when using Marginal Effects at Representative Values

Julien Dagenais

Join Date: Sep 2016

Posts: 34
#1

Adjusting Covariates when using Marginal Effects at Representative Values

06 Oct 2016, 09:34

Dear all,
I'm employing marginal effects in a multivariate regression to look at how changes in a continuous IV (log_avl) impact a continuous DV (Ischemia_Time_Min), as modulated by a 3rd binary IV (it_type3). I'm having an issue interpreting how the other IV's are handled in the model. The code is below.

regress Ischemia_Time_Min c.log_avl i.it_type3 c.eGFR_pct2 i.agecat male race i.bmicat i.cci_cat Auto_CKD_Preop i.renal i.clavien_cat
margins it_type3, at(log_avl=(-2(1)6))
marginsplot, recastci(rarea)

In this model, my interpretation for calculating AAP's in the context of the other covariates is as follows: it is predicting Y (Ischemia_Time_Min) for each value of log_avl if everyone had an it_type3 of 0 or 1, leaving all other covariates as is. If I instead use the atmeans command, I get a similar graph. Here, however, it is setting the population to be exactly average in all respects except for it_type3 (0 or 1) at each value of x (log_avl). Leaving the differences between AAP and APMs aside for a second, what's clear when I observe the output from the atmeans command is that all other IVs are being set to the same value, irrespective of the value of x (log_avl).

I am concerned that this might not be what I want because it is not predicting different values of each IV at each value of x (log_avl). In other words, I know the data well enough to know, for example, that as x(log_avl) changes, there are changes in the IV, eGFR_pct2. Therefore, in the graph above, I don't want eGFR_pct2 to be the exactly the same value at each point of x(log_avl), but rather I want the predicted value to change as a function of x(log_avl). Bottom line, I want to be able to say that at an x-axis value of 2, for example, I'm predicting an Ischemia time of __ and __ for an it_type3 of 0 or 1, respectively, while predicting eGFR_pct2 (and all other IVs) at that given x value, and setting them to be equal across it_type3.

Is there a way to do that? Or am I incorrect in my interpretation of exactly what's being done with the other IV's in this model?

Appreciate any help.

Best,
Julien
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#2

06 Oct 2016, 12:38

In this model, my interpretation for calculating AAP's in the context of the other covariates is as follows: it is predicting Y (Ischemia_Time_Min) for each value of log_avl if everyone had an it_type3 of 0 or 1, leaving all other covariates as is. If I instead use the atmeans command, I get a similar graph. Here, however, it is setting the population to be exactly average in all respects except for it_type3 (0 or 1) at each value of x (log_avl). Leaving the differences between AAP and APMs aside for a second, what's clear when I observe the output from the atmeans command is that all other IVs are being set to the same value, irrespective of the value of x (log_avl).

That's exactly correct!

As for whether you want to use -atmeans- or not, neither approach is right or wrong. It's a matter of which is suited to your research goals. When you use -atmeans-, you are predicting outcomes for a person who is average in all respects other than log_avl and it_type3. Variation in outcome attributable to anything else is completely suppressed. Noting also that you have some categorical predictors in your model, it is worth noting that this hypothetical person who is average in all other respects does not exist, because you can't be fractionally male or belong in different proportions to different age categories. That doesn't mean it can't be useful to do this: it gives you a sense of the pure effects of log_avl and it_type3 when all other sources of variation are entirely eliminated.

If you don't use -atmeans-, then you are getting predictions that are adjusted for these other sources of variation, rather predictions based on suppressing other variation. These predictions give you expected outcomes for a population that has the same distribution of these other sources of variation as the distribution in your data set. They are not predictions for any individual in that population, nor for a hypothetical totally average individual in that population.

Since your model is linear and your model contains no interaction terms, the expected values come out the same either way, but the uncertainty around them is different. The decision as to which is more suitable for your purposes depends on your purposes. But your understanding of these approaches seems perfectly accurate, so I think the only problem you face is making up your mind what you want to predict.
Comment
Julien Dagenais

Join Date: Sep 2016

Posts: 34
#3

06 Oct 2016, 13:07

Thanks Clyde!

That helps me feel fairly confident with the discrepancy between AAP and APM. I guess what I'm wondering is as follows:

In the marginsplot in #1, I'm looking at MPRs. However, with the margins command, it is calculating an APM/ or AP by suppressing or adjusting, respectively, for all covariates, irrespective of the representative value of x (log_avl). Looking at the output with APMs, this becomes clear (first data 2 points of log_avl shown for brevity) :

1._at : log_avl = -2
0.it_type3 = .1750369 (mean)
1.it_type3 = .8249631 (mean)
eGFR_pct2 = 81.99006 (mean)
1.agecat = .204579 (mean)
2.agecat = .4320532 (mean)
3.agecat = .3633678 (mean)
male = .6144756 (mean)
race = .8677991 (mean)
1.bmicat = .1787297 (mean)
2.bmicat = .3552437 (mean)
3.bmicat = .2518464 (mean)
4.bmicat = .2141802 (mean)
1.cci_cat = .4963072 (mean)
2.cci_cat = .3530281 (mean)
3.cci_cat = .1506647 (mean)
Auto_CKD_P~p = .161743 (mean)
1.renal = .3397341 (mean)
2.renal = .5051699 (mean)
3.renal = .155096 (mean)
0.clavien_~t = .7710487 (mean)
1.clavien_~t = .16839 (mean)
2.clavien_~t = .0605613 (mean)

2._at : log_avl = -1
0.it_type3 = .1750369 (mean)
1.it_type3 = .8249631 (mean)
eGFR_pct2 = 81.99006 (mean)
1.agecat = .204579 (mean)
2.agecat = .4320532 (mean)
3.agecat = .3633678 (mean)
male = .6144756 (mean)
race = .8677991 (mean)
1.bmicat = .1787297 (mean)
2.bmicat = .3552437 (mean)
3.bmicat = .2518464 (mean)
4.bmicat = .2141802 (mean)
1.cci_cat = .4963072 (mean)
2.cci_cat = .3530281 (mean)
3.cci_cat = .1506647 (mean)
Auto_CKD_P~p = .161743 (mean)
1.renal = .3397341 (mean)
2.renal = .5051699 (mean)
3.renal = .155096 (mean)
0.clavien_~t = .7710487 (mean)
1.clavien_~t = .16839 (mean)
2.clavien_~t = .0605613 (mean)

However, in reality, I'd expect these average values to change with each value of x (log_avl). For example, knowing the data, I'd expect that eGFR_pct2 will decrease as x (log_avl) increases. Is there a way to use margins to calculate a mean that changes with each value of x (log_avl)? I want to be able to say that at each value of x (log_avl), I'm getting a predicted Y (Ischemia_Time_Minutes) for a given it_type3, and that the model predicts and adjusts or suppresses for all other covariates at that given value of x(log_avl).

Thanks again

Julien
Comment
Julien Dagenais

Join Date: Sep 2016

Posts: 34
#4

12 Oct 2016, 14:10

Dear statalist,
Any thoughts as to my response in #3? Am I thinking about this incorrectly by expecting that there I will need to adjust/suppress different values of each covariate at each corresponding value of x (log_avl) in the MPR model? Thanks

Julien
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#5

12 Oct 2016, 16:09

Well, if you want an individual predicted value for every observation in the data set using its observed values on all variables, you should use -predict-, not -margins-. But I'm not sure that's what you mean.

Maybe you want -margins, over(log_avl)-? That would give you, for each value of log_avl, an average predicted outcome among all observations having that value of log_avl with all other variables held at their observed values. That sounds kind of like what you're describing. I've never seen it used with a continuous variable, but there's no prohibition against it. It's hard for me to see how it would be useful, but that's up to you.
Comment
Julien Dagenais

Join Date: Sep 2016

Posts: 34
#6

13 Oct 2016, 12:55

Hmm, that may be it, Clyde. Thanks again. However, log_avl has negative and non-integer values so it doesn't work within the over command.

Instead of log_avl, let's use the interval independent variable Auto_RENAL_score, which goes from 4-12. What I think I understand from what you're saying and from what I've read is that if I use over(Auto_RENAL_score), than at each respective value of Auto_RENAL_score, I'll be getting a predicted value of y(Ischemia_Time_Min) that either adjusts/suppresses the other covariates at that value, depending on whether I use AAPs/APMs, respectively. Is that correct? If I'm using the over command, should I than use the vce(robust) in the original regression and vce(unconditional) in the margins command? To treat each group like a subpopulation?

If I use APRs instead of the over command, however, than it seems as though the margins command will adjust/suppress for covariates independent of the respective values of Auto_RENAL_score? Hence why the output in #3 gives me the same value for each covariate?

If all the above is accurate, what I think I want to use within my study is the over command, because it accurately reflects the predicted values at each level of the key independent variable. However, if the covariates are being adjusted for regardless of the method used, does it really matter?? This is where I start to confuse myself.

Let's compare using the over command (a) to using MPRs (b):

regress Ischemia_Time_Min c.log_avl c.eGFR_pct2 c.Auto_RENAL_score c.Preop_GFR i.it_type3 c.EBL_OR i.agecat male race i.bmicat i.cci_cat, vce(robust)

/* a */
margins, over(Auto_RENAL_score) at(it_type3=(0 1)) vce(unconditional)
marginsplot, x(Auto_RENAL_score) recastci(rarea) name(a)

/* b */
margins it_type3, at(Auto_RENAL_score=(4(1)12))
marginsplot, x(Auto_RENAL_score) recastci(rarea) name(b)

graph combine a b, ycommon

I tIt's interesting to note that the graph in (a) is not a straight line, compared to (b) which is giving me APRs. I'm still scratching my head as to if (a) or (b) answers the question I have.

Julien
Attached Files

Margins wam_cold over_at.gph (20.9 KB, 1 view)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#7

13 Oct 2016, 13:31

First, at the risk of being pedantic, -over()- is not a command. It is an option within a command. It's hard to think clearly even when using careful language; it's impossible when using loose language.

The graphs you made nicely illustrate what using -over()- does. In the second graph all variables in the data set are held at their observed values except for Auto_RENAL_score and it_type3. Those two variables are set to the combinations of values you specified in your -margins- command, and predicted values are computed for every observation in the data set. Those predicted values are then averaged over the entire data set. Because your model is linear, and because all of the variables other than Auto_RENAL_score and it_type3 are the same in all of the calculations, the linear nature of the relationship imposed by the regression model shines through clearly because Auto_RENAL_score and it_type3 are the only sources of variation across the graph..

In the first graph, where you used the -over()- option, what happens is very different. For each value of Auto_RENAL_score -margins- selects for calculation only those observations for which Auto_RENAL_score takes on that value: the rest of the data are excluded from the calculation. Within that restricted data set, the other variables are left at their observed values, and the value of it_type3 is set first to 0 and then to 1, Auto_RENAL_score being held at the value in question at the moment. Predicted outcomes are calculated, and then averaged within the restricted data set. -margins- then moves on to the next value of Auto_RENAL_score and does this again. Thus, with the -over()- option specified, the other variables are adjusted to a different distribution at each value of Auto_RENAL_score, specifically, they are adjusted to the distribution of those variables conditional on the value of Auto_RENAL_score. This is why the graph no longer follows a straight line: the adjustment is done differently at each level of Auto_RENAL_score and that variation in the adjustment is non-linear and is superimposed on the linear variation from it_type3 and Auto_RENAL_score itself. And this, I believe, corresponds to what you asked for in #4.

As for which approach to the data is right for your purpose, that depends on the conclusions you are trying to reach and the uses to which you wish to put your model. In principle, either could be correct in different circumstances, though I have to say, I have never encountered in real life a situation in which (a) is appropriate.

As for the choice between -vce(unconditional)- and -vce(delta)-, this, again depends on your context and how you view your data. If this is an experiment where all of the values of the covariates are controlled and chosen, then -vce(delta)- would be the necessary choice. If this is purely an observational study and you are trying to generalize to a larger population from which this is a sample, then -vce(unconditional)- is the right choice. It really boils down to this: -vce(delta)- calculates standard errors taking the observed values of the data as a given, whereas -vce(unconditional)- calculates them assuming that the observed values of the data are in fact a random sample.
Comment
Julien Dagenais

Join Date: Sep 2016

Posts: 34
#8

13 Oct 2016, 14:04

Much appreciated, Clyde, thank you. Really interesting how the graph shows how the -over()- option introduces non-linear variation on top of the linear variation of it_type3 and Auto_RENAL_score.

Here's the scenario. It's an observational study in which, y (Ischemia_Time_min) is a surgical outcome that is influenced by multiple IV's. One of those variables is Auto_RENAL_score, which has a specific value for each individual in the study. In the study population at large, I expect that as Auto_RENAL_score increases, that it will result in increasing values of Ischemia_Time_min. What I'm interested in determining is, at each level of Auto_RENAL_score, how it_type3 affected the prediction of Ischemia_Time_min, while adjusting for all other covariates.
In other words, for a given individual with a given Auto_RENAL_score what was the effect of the conditioning variable it_type3 on predicted Ischemia_Time_min, while controlling for all other covariates? It seems like (a) might be appropriate? Because of the fact that the values of the covariates are in fact conditional on the value of Auto_RENAL_score?

Best,
Julien
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#9

13 Oct 2016, 14:39

Well, I don't think what you say in #8 settles the question. After all, in almost any observational study, the covariates will differ somewhat across different values of a selected variable. The question then is how much of that variation in the covariates is signal and how much is noise. If the variation in the covariates is largely a matter of sampling error, then using -over()- is overfitting (no pun intended) your model to noise in the data. On the other hand, if this variation is systematic and predictable, then using approach a) might be reasonable, particularly if the value of Auto_RENAL_Score would be known prospectively to anybody trying to use your model for prediction purposes. If you decide to go with approach a), I would urge you to be careful about your terminology. Referring to these predictions as "adjusted" for other covariates will lead to misunderstanding because most people will interpret that as meaning you used option b). I think you need to say that these are Auto_RENAL_Score-specific predictions averaged over Auto_RENAL_Score-specific associated covariate distributions. That's long and cumbersome, but it's accurate and will not be misunderstood. The other linguistic caution I would urge on you is to be very clear that what you are showing in graph a) is not the direct effect of Auto_RENAL_Score on ischemia time, but is the joint effect of Auto_RENAL_Score and associated changes in other variables that covary with it, and in particular the findings are not adjusted for those other covariates in the usual sense of the word. In particular, it would be an error to compare the different points on graph a) with each other, whereas those on graph b) can be compared with each other.

To make my point a little clearer, here's something analogous to what you're doing in a). Suppose I want to study the relationship between systolic blood pressure (SBP) and renal function (say as measured by eGFR).. Suppose also that I am doing this study globally, so there are salient differences among nations, such as differing age distributions, differing access to medical care, different dietary patterns, different income distributions, etc. The usual approach to this would be to adjust the analysis for all of those additional variables, standardizing them to some common distribution (which, if done by regression and -margins- would be the pooled distribution across the countries in the study). You would then calculate and graph predicted eGFR vs SBP at selected interesting values of SBP. You would describe your results as showing the relationship between renal function and SBP adjusted for those other things. And you would be able to compare the results from one country with those of another and say that you are looking at the "pure" effect of SBP, disentangled from the effects of these other variables. This is approach b).

What approach a) would do is within each country adjust the results separately for age, medical care access, dietary patterns and income distributions in each country. The results would be separate relationships in each country, and it would not be meaningful to compare one country's results with another, at least not for the purposes of saying something about the effect of SBP on eGFR because some appreciable part of the difference would be attributable to the other variables. This is method a).

Again, if this suits your purposes and if it doesn't constitute overfitting the data, it could be the correct thing to do. But, as I said earlier, I've never (before) encountered a real world situation where this was what was needed.
Comment
Julien Dagenais

Join Date: Sep 2016

Posts: 34
#10

14 Oct 2016, 12:38

Thanks for the detailed explanation, and for your example. This discussion helped disentangle a lot of the subtleties in how these commands and options alter the results and interpretations. I think ultimately what I need is in (b). Much appreciated!

Julien
Comment
Julien Dagenais

Join Date: Sep 2016

Posts: 34
#11

14 Oct 2016, 14:19

One additional question that gets a little more into basic statistics, but I wanted to be sure held true within the framework we discussed above:

Is it acceptable to use a covariate in the regression model in #6 that temporally comes after the given outcome? In other words, is it ok to use an "outcome" as a "predictor"? I would think so, given that I'm implying associations only, and not causality. To explain what I mean, the covariate eGFR_pct2 is a calculation of GFR preservation that is determined 72 hours after the above surgery has happened, and which is in fact predicted to some extent by the Y outcome in the model above, Ischemia_Time_min.

eGFR_pct2 is of critical importance because I'd like to be able to state that at each value of Auto_Renal_score, there was a given predicted difference in Ischemia_Time_Min between ischemia_type3==0 and ischemia_type3==1, and that eGFR_pct2 was no different between the two conditions of ischemia_type3.

Again, many thanks.

Julien
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#12

15 Oct 2016, 13:48

Well, as long as you make it very clear that you are not looking at causality, it is not out of the question to include as a "predictor" something that precedes the "outcome" in time. But I don't quite get from your explanation why you want to do this. To demonstrate "that eGFR_pct2 was no different between the two conditions of ischemia_type3" to me suggests that you want to do a t-test of eGFR_pct2 by inschemia_type3, or perhaps a regression that enables you to adjust for effects of other variables.

Perhaps what you meant is that you want to conclude from your original regression that the relationship between ischemia time and ischemia_type3 still holds after adjusting for possible differences in the eGFR_pct2 outcome in the two procedures. Or, more precisely, what you really want to say is that the relationship holds after adjusting for differences in the extent to which the two types of procedures preserve renal function, and you have no way to measure that before the procedure ends, but must use a measurement obtained later as a proxy for that. I think it is OK to do this, so long as you make it very clear to your audience that this is what you are doing.
Comment

Announcement

Adjusting Covariates when using Marginal Effects at Representative Values

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment