Mediation analysis panel data

Frits de Wrede

Join Date: Apr 2022

Posts: 5
#1

Mediation analysis panel data

12 Apr 2022, 11:29

Hello,

For my research I need to conduct a mediation analysis on panel data, my data setup is as follows: It is an (unbalanced) panel dataset including about 50 countries and 14 years. I have one dv, one iv and am interested in the mediation effect of this iv by estimating mediation effects of two different mediator variables. In addition, I include 5 control variables.

I have seen loads of similar questions like this on statalist and other forums. Unfortunately, the majority remains unanswered and the others do not arrive at a clear solution. So I am starting to doubt whether this is possible at all (atleast in stata).

So my question in short: Is anyone aware of a stata program or command that is capable of (multiple) mediation analysis suitable for panel data? Additionally, I want to add both country and year fixed effects to this model.

Kind regards,

Frits
Tags: fixed effects, mediation, panel data
Maxence Morlet

Join Date: Mar 2021

Posts: 650
#2

12 Apr 2022, 11:55

https://ideas.repec.org/c/boc/bocode/s457294.html

https://www.stata.com/symposiums/bio...1_Bellavia.pdf
Comment
Frits de Wrede

Join Date: Apr 2022

Posts: 5
#3

12 Apr 2022, 12:16

Hello Maxence,

Thank you for your reply and references. Even though both references show mediation analysis for stata, neither of them mentions panel data. The main issue for me is finding a mediation analysis method that is suitable for panel data - to my knowledge following the approaches in these references would not be appropriate because of my panel data structure?

Kind regards,

Frits
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#4

12 Apr 2022, 15:05

I have some limited experience with modeling mediation in longitudinal data. At the risk of being nihilistic, my experience urges me to caution you that this is an extremely complicated problem and one that you should avoid altogether unless you can very precisely pin down your exact mechanistic hypothesis in advance.

The problem arises from the fact that in longitudinal data, every explanatory variable has two separate effects on the outcome: a within-panel effect and a between-panels effect. The most commonly used models do not deal with this. The usual fixed-effects models simply ignore the existence of between-panel effects and estimate purely within-panel effects. The usual random effects models simply conflate the two and estimate a weighted average of them. (If you like, you can think of this as the random effects models assuming that the two effect are necessarily the same and any observed difference is chance rather than systematic.) Now, there are circumstances where these approaches make sense, but I have seen both techniques used in situations where their appropriateness is not clear, or even implausible--indeed, I have made that error myself on more than one occasion.

This problem becomes thornier still when mediation is considered. The within-panel effects of a variable can mediate both the within- and between-panel relationships of another explanatory variable and an outcome. And it may do that to different degrees, and even in opposite directions. The same is true of the between-panel effects of each candidate mediator. If you then throw in additional mediators, the number of potential causal paths among these variables grows even larger. In fact it grows combinatorially in the number of mediators. Even if you have a data set large enough to enable you to reasonably estimate coefficients along all these paths, just reading and understanding the output is a problem that scales poorly with the number of variables involved.

So if you are going to go down this path at all, I strongly urge you to draw a directed acyclic graph of the relationships among these variables (path diagram) that includes both within-panel and between-panel relationships and try to severely pare down which of the mediating pathways need to be explored in order to accomplish your research goals. Otherwise you are likely to find yourself on a rickety raft in a turbulent sea of output. I've been there, and it isn't fun!

Sorry to be discouraging, and sorry for not providing any positive recommendations on coding, but really you shouldn't even think about programs and commands until you first designate a workable model.

Last edited by Clyde Schechter; 12 Apr 2022, 15:10.
Comment
Frits de Wrede

Join Date: Apr 2022

Posts: 5
#5

12 Apr 2022, 17:05

Hello Clyde,

thank you for your honest advice. I already was afraid this might be the case, as it explains why it is so hard to find other people who have done it while many are asking. I have one followup question regarding the validity of using another method (gsem), I think a short summary of my research approach will help to clarify.

The figure below summarizes my approach. In addition, 5 control variables that explain variation in subjective well-being are added to the model. My panel data contains 48 countries, 14 years and a total of 521 total observations. I am interested in the effect of structural transformation on subjective well-being. Besides that, I want to examine whether this effect is mediated through its effect on economic development and inequality, and most importantly their rough sign and significance. This is for a master's thesis economic development and the statistical skills required for panel data mediation seem to go beyond that level.

I already estimated this model using gsem, and inserting country and year fixed effects as additional 'controls', but this does not account for panel data. Below is stated the code I used for my estimation, with year and country fixed effect in bald.

gsem (EMP -> lgdp, ) (EMP -> ten, ) (EMP -> Lifeladder, ) (lgdp -> Lifeladder, ) (ten -> Lifeladder, ) (support -> lgdp, ) (support -> ten, ) (support -> Lifeladder, ) (health -> lgdp, ) (health -> ten, ) (health -> Lifeladder, ) (freedom -> lgdp, ) (freedom -> ten, ) (freedom -> Lifeladder, ) (generosity -> lgdp, ) (generosity -> ten, ) (generosity -> Lifeladder, ) (perception -> lgdp, ) (perception -> ten, ) (perception -> Lifeladder, ) (i.countryid -> Lifeladder, ) (i.year -> Lifeladder, ), vce(robust) startvalues(fixedonly) nocapslatent

I was wondering whether these regression results can be used in any meaningful way, in order to discuss the potential presence of a mediation effect in this paper, or whether the unsuitability of gsem with regards to panel data is too much of a problem to allow it to be mentioned as a (heavy) limitation, and that these results would be too unreliable to be worth mentioning. I am not interested in the exact size of the coefficients etc, I rather want to explore whether the model is able to confirm the presence and sign of the the indirect BC and DE effects.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#6

12 Apr 2022, 17:29

Well, I think it is problematic. When you use a two-way fixed effects model, the effects you are estimating are a weighted average of the within- and between-panel effects. See https://journals.plos.org/plosone/ar...l.pone.0231349. This is fine if you have external evidence that within- and between- effects are one and the same thing and any observed differences are noise. But do you have any such evidence? If not, you are estimating a somewhat complicated set of estimands and even assuming you manage to figure out for yourself what it all means, I think it will defy explanation to others.

Now, one thing you can do is to create new variables from the ones shown in your diagram. One set of the new variables comes by de-meaning them all around the panel mean (these variables will represent the within-panel differences), and the other set of variables consists of the panel means of the variables themselves (representing between-panel differences). And since you seem to want to deal with time fixed effects you also have to make a set of variables that is demeaned within years and another set that consists of the means across years. You can then make a model out of these variables and run -gsem- (without the i.country and i.year effects), but you will find it is very unwieldy unless you can specify limited roles of the within- and between effects along the many different possible paths. (And neither here nor in #4 did I stray into the further complications that arise if we consider the possibility of interactions or lagged effects.)

I don't want to tell you it can't be done. It can't. But I think you will spin your wheels working off a diagram such as the one you show in #5. You need a more ramified diagram that has a pair of arrows (one within and the other between) where each of A, B, C, D, and E are and then carefully specify based on your best understanding of the real-world data generating process which of the 48 pathways through that model can be excluded from consideration. Or, if you have enough data, you can run it with all of the pathways and then spend a lot of nerve-wracking time trying to figure out what the results are telling you. Either way, it seems very daunting to me. Frankly, it seems beyond the level of a doctoral dissertation, let alone a master's thesis.

Added: Now, take the following with a grain of salt, because every institution has its own expectations. But if I were your master's thesis advisor, I would recommend not using panel data here. I would recommend instead picking one early cross section for the Structural Transformation variable, a somewhat later cross-section for the economic development and inequality variables, and yet a later cross-section for the subjective well-being outcome variable and combine those into one observation for each of your current panel. (This assumes you have at least some sense of the time scales over which these effects act.) Then the time-dimension is flattened out and everything simplifies to a standard mediation analysis (which, in my view, is already difficult enough!)

Last edited by Clyde Schechter; 12 Apr 2022, 17:35.
1 like
Comment
Frits de Wrede

Join Date: Apr 2022

Posts: 5
#7

13 Apr 2022, 05:07

Thank you very much for your clarifications and recommendation. This panel data approach indeed seems too difficult to pursue. Below I have listed a follow-up question regarding your proposed alternative, after which I outline my revised methodological approach based on this.

The alternative you propose sounds more approachable. If I understand this correctly I would conduct a cross-sectional regression on a manipulated dataset containing e.g. the iv for year 2007, mv for year 2008, and dv for year 2009 - and perhaps play around with different combinations of relative time differences and the initial year. Is it correct that doing so would allow me to use the same gsem approach and code I mentioned in #5? Below I have visualized this equation using the gsem stata model builder (with controls affecting both mediators and dv), would this cross sectional approach then allow me to include i.year and i.country as added in the equation in bald? (if time and year fixed effects are present at all in the cross-section data).

Taking this approach I will then first conduct a general fixed effects regression suitable for panel data, and when this regression shows that the independent variable and potential mediator(s) are significant, further examine their relationship using a cross-sectional mediation approach. And mention that the mediation analysis is performed for a cross-section due to panel data mediation being beyond the scope of the paper - and perhaps interesting as a follow-up.

gsem (EMP -> lgdp, ) (EMP -> ten, ) (EMP -> Lifeladder, ) (lgdp -> Lifeladder, ) (ten -> Lifeladder, ) (support -> lgdp, ) (support -> ten, ) (support -> Lifeladder, ) (health -> lgdp, ) (health -> ten, ) (health -> Lifeladder, ) (freedom -> lgdp, ) (freedom -> ten, ) (freedom -> Lifeladder, ) (generosity -> lgdp, ) (generosity -> ten, ) (generosity -> Lifeladder, ) (perception -> lgdp, ) (perception -> ten, ) (perception -> Lifeladder, ) (i.countryid -> Lifeladder, ) (i.year -> Lifeladder, ), vce(robust) startvalues(fixedonly) nocapslatent
Comment

Joseph L. Staats

Join Date: Aug 2015
Posts: 28

13 Apr 2022, 09:58

Frits,

I don't know of a way to do what you want completely in Stata, but I believe you can accomplish all that you want in Stata except for calculating the confidence intervals of your independent variable after it goes through its mediation pathways. My suggested approach is to run fixed effects panel regressions (e.g., xtreg, fe or xtscc, fe) with control variables on each stage of your two separate pathways (you mentioned two parallel mediators, I believe). If you have only one sequential mediator, that means you will get a coefficient and standard error for the effect of the independent variable on the mediator and the mediator on the dependent variable. Your indirect effect coefficient, using the "product of the coefficients" method, will be calculated by multiplying the two coefficients. Now your task is to calculate the confidence interval to determine if the indirect effect is statistically significant. I suppose with some effort you could find a way to do this in Stata, but Kris Preacher has a website that shows you how to do this using the Monte Carlo Method for Assessing Mediation with a simple R program: http://quantpsy.org/medmc/medmc.htm This program only allows one sequential mediator, but if you are interested in two or more mediators in sequence, I have modified (with past help from Kris) the R program to accommodate as many as three mediators in sequence, which I show below:

Code:

##### Below is the code for three sequential mediators#####

################################################
# This code can be edited in this window and #
# submitted to Rweb, or for faster performance #
# and a nicer looking histogram, submit #
# directly to R. #
################################################
require(MASS)
#####Below are the coefficients for each stage of the mediation pathway#####
a=2.036
b=.006
c=1.661
d=.033
rep=50000
conf=99
pest=c(a,b,c,d)
#####Below are the standard errors squared for each stage of the mediation pathway#####
acov<-matrix(c(.287,0,0,0,0,.000001,0,0,0,0,.293,0,0,0,0 ,.00008),
nrow=4)
mcmc <- mvrnorm(rep,pest,acov,empirical=FALSE)
abcd <- mcmc[,1]*mcmc[,2]*mcmc[,3]*mcmc[,4]
low=(1-conf/100)/2
upp=((1-conf/100)/2)+(conf/100)
LL=quantile(abcd,low)
UL=quantile(abcd,upp)
LL4=format(LL,digits=4)
UL4=format(UL,digits=4)
################################################
# The number of columns in the histogram can #
# be changed by replacing 'FD' below with #
# an integer value. #
################################################
hist(abcd,breaks='fd',col='skyblue',xlab=paste(con f,'% Confidence Interval ','LL',LL4,' UL',UL4),
main='Distribution of Indirect Effect')

#####Below is the code for two sequential mediators#####

################################################
# This code can be edited in this window and #
# submitted to Rweb, or for faster performance #
# and a nicer looking histogram, submit #
# directly to R. #
################################################
require(MASS)
b=.006
c=1.661
d=.033
rep=50000
conf=99
pest=c(b,c,d)
acov<-matrix(c(.000001,0,0,0,.293,0,0,0,.00008),
nrow=3)
mcmc <- mvrnorm(rep,pest,acov,empirical=FALSE)
bcd <- mcmc[,1]*mcmc[,2]*mcmc[,3]
low=(1-conf/100)/2
upp=((1-conf/100)/2)+(conf/100)
LL=quantile(bcd,low)
UL=quantile(bcd,upp)
LL4=format(LL,digits=4)
UL4=format(UL,digits=4)
################################################
# The number of columns in the histogram can #
# be changed by replacing 'FD' below with #
# an integer value. #
################################################
hist(bcd,breaks='fd',col='skyblue',xlab=paste(conf ,'% Confidence Interval ','LL',LL4,' UL',UL4),
main='Distribution of Indirect Effect')

##### Below is the code for a single mediator#####

################################################
# This code can be edited in this window and #
# submitted to Rweb, or for faster performance #
# and a nicer looking histogram, submit #
# directly to R. #
################################################
require(MASS)
c=1.661
d=.033
rep=50000
conf=99
pest=c(c,d)
acov<-matrix(c(.293,0,0,.00008),
nrow=2)
mcmc <- mvrnorm(rep,pest,acov,empirical=FALSE)
cd <- mcmc[,1]*mcmc[,2]
low=(1-conf/100)/2
upp=((1-conf/100)/2)+(conf/100)
LL=quantile(cd,low)
UL=quantile(cd,upp)
LL4=format(LL,digits=4)
UL4=format(UL,digits=4)
################################################
# The number of columns in the histogram can #
# be changed by replacing 'FD' below with #
# an integer value. #
################################################
hist(cd,breaks='fd',col='skyblue',xlab=paste(conf, '% Confidence Interval ','LL',LL4,' UL',UL4),
main='Distribution of Indirect Effect')

I hope this is helpful.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#9

13 Apr 2022, 10:49

Re #7. You've got the gist of it. The only thing I disagree with concerns the inclusion of country and year effects in the model. In this single-cross section model there is only one observation per country, and that observation, though it incorporates data from 2007 (iv), 2008 (mv), and 2009 (dv) is best thought of as being a single point in time (2009) with lagged explanatory variables. So country effects would just be colinear with the constant term and get omitted, and year effects are non-existent. Indeed, the whole point of this approach is to eliminate the time-series aspects of the problem and reduce it to a simple mediation model. The code would be similar to that in #5, but without the (i.countryid -> Lifeladder, ) (i.year -> Lifeladder, ) equations.

One aside: if you go this route, because all of the equations are linear, you can do this in -sem- rather than -gsem-. The (slight) advantage of that is that -sem- has a post-estimation command -estat teffects- which will calculate the total, direct, and indirect path coefficients for you and spare you the trouble of coding the -nlcom- commands that you would need to use after -gsem-.

I want to repeat here my earlier caveat: this approach is what I would recommend you do if I were your thesis advisor. I am not your thesis advisor in reality, so you should not move forward with this until you clear it with him or her.
Comment
Frits de Wrede

Join Date: Apr 2022

Posts: 5
#10

14 Apr 2022, 06:14

Re #8: Joseph thank you for your suggestion. Unfortunately I do not have any experience with R, so I will pursue the (cross-section) solution proposed by Clyde.

Re #9: Thank you for the clarification. I have discussed and agreed on this alternative with my supervisor so that is what I will be focusing on next. Thank you for all the feedback and suggestions!
Comment
Joseph L. Staats

Join Date: Aug 2015

Posts: 28
#11

14 Apr 2022, 10:59

Frits,

Just to clarify my suggestion. You don't have to know R if you only have one mediator in each pathway. That is because Kris Preacher at his website has a calculator that does everything for you. His calculator uses R, but you don't have to do so separately unless you want to. All you have to do is type in the coefficients and the squared standard errors that you obtained when doing fixed effects regressions of the effects of your independent variable on the mediator and the mediator on the dependent variable. Doing this might be a good way to check the results you get when doing what Clyde recommends or something else suggested by your supervisor. But, your supervisor is the one to ask about this.
Comment
Antonia Diaz-Valdes

Join Date: Feb 2018

Posts: 8
#12

12 Dec 2022, 06:25

Helo Frits, I am attempring to do a similar model (mediation analysis with panel data, using instrumental variables). I would use a mix of random and fixed effect. Because some fixed effect are needed, as I know some time-invariant variables in my model are key (I have individuals nested within time). Thus, for time varying variables I will used fixed effect transformation and for time-invariant variables I will use random effect transformations. I am still figuring out how to do it, but working on it.

I found some interesting videos on YouTube about the use of instrumental variables and Hausman-Taylor models. I also looked at the ivmediate command, which is used to test causal mediation with instrumental variables (but I have not see any application of the command with panel data).

I wanted to let you know, that after running your model, for example fixed effect model (if the assumptions are met), you can use the nlcom command to calculate SE, 95% CI, p value, and so on. For the indirect and total effect. using the product method for mediation.

I hope this helps.
Comment

Announcement