Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • A/B/C pre-post test with covariates

    I have data from a quasi experiment involving two different interventions and one control. intervention A was given to group 1, intervention B was given to group 2 and the third group was the control C. The study used a tool that classifies participants into either at risk of developmental delay or not (1/0) pre intervention, then the intervention is given for six months, after that there is a post test using the same tool that classifies them as either at risk or not. Then the same test is done at 12 months and at 18 months. At baseline demographic factors were also measured.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(id age sex education risk0 risk1 risk2 risk3)
     1 3 1 1 1 0 0 1
     2 4 0 2 0 1 1 0
     3 3 0 2 1 0 1 1
     4 5 0 3 1 1 1 1
     5 5 1 2 1 1 0 0
     6 5 1 2 0 1 0 0
     7 3 0 3 0 0 0 0
     8 4 0 2 0 1 0 1
     9 4 0 1 0 0 1 1
    10 3 1 2 0 0 0 0
    11 4 1 2 1 1 1 1
    12 3 0 1 1 1 0 0
    13 3 1 3 1 1 1 1
    14 3 0 3 0 0 1 1
    15 4 0 1 0 0 1 1
    16 4 1 2 1 0 1 1
    17 3 0 3 0 0 1 0
    18 3 1 3 1 1 1 0
    19 3 1 3 1 0 0 0
    end

    I would like to do the following;
    1. I would like to know if intervention A is better than C, B is better than C and if A is better than B or otherwise at 6months(risk1).
    2. I would like to do A/B/C test assuming there is something like that at 6 months (risk1).
    3. how do I do a pre-post test for the above at 12 and 18 months and taking care of the missing values due to loss to followup
    4. how can I include the covariates in the next level of analysis were I am including the covariates in the analysis to explain outcome of risk at those time points
    Also if you can include some guidance on interpretation
    Thank you

  • #2
    Unfortunately, your data example does not enable me to fully show you how one might code this because it does not include any variable that specifies which treatment group each subject was in, or if it does, it has a very uninformative name. What I will say is that whatever you ultimately do to analyze this data, you will have to reshape the data to long:

    Code:
    reshape long risk, i(id) j(time)
    I would like to know if intervention A is better than C, B is better than C and if A is better than B or otherwise at 6months(risk1).

    I would like to do A/B/C test assuming there is something like that at 6 months (risk1).
    It is hard for me to say with confidence, not knowing more about the design and how the interventions are imagined to work, but unless there is something really distinctive about the 6 month time period, I would recommend against analyzing the results at any one time period in isolation. Rather I would do a full analysis of the data at all times, and then, from those analytic results, if you have a special interest in outcomes at 6 months, you can "drill down" to those. So let's assume you have a variable called treatment, coded 0 for A, 1 for B, and 2 for C. Then the general outlines of the start of the analysis would be:

    Code:
    mixed risk i.treatment##i.time covariates || id:
    margins treatment#time // PREDICTED OUTCOMES EACH GROUP AT EACH TIME
    marginsplot
    margins time, dydx(treatment)  // TREATMENT GROUP DIFFERENCES AT EACH TIME
    maraginsplot
    If you are not familiar with the -margins- command, I suggest you start with Richard Williams' excellent Stata Journal article at http://www.stata-journal.com/sjpdf.h...iclenum=st0260. It's very clearly written and it covers the basics, including these usages. Then later you can read the -margins- chapter in the Stata PDF documentation that comes with your installation to learn about other more advanced applications that may be of interest to you in other situations.

    how do I do a pre-post test for the above at 12 and 18 months and taking care of the missing values due to loss to followup
    Dealing with missing data due to attrition is a real problem. You need to first develop an understanding of the process that generates the attrition. If you can credibly claim that the attrition occurs totally randomly, and is unrelated to both the treatment itself and to the outcomes being experienced, then the missingness is ignorable. But that is unlikely to be the case. It is also possible that although the missingness is not completely random, it may be random conditional on other variables in your data. In that case multiple imputation would be appropriate. With repeated measures studies like this, sometimes conditioning on the preceding non-missing values satisfies this, but, again, it depends on whether it is credible to believe that. In long-term treatment trials like this, the usual situation is that the data are missing not at random, and you are left with doing a robustness analysis where you try various best- worst- and reasonable-case scenarios to see how extreme the data would need to be in order to materially influence your conclusions. The treatment of missing data is, I think, a reasonable topic for a one-semester course in statistics and cannot adequately be dealt with in a forum post. There are many resources available online that deal with this topic, and after you have completed your analysis of the non-missing data, you should do a search for one that is at your level of technical complexity.

    how can I include the covariates in the next level of analysis were I am including the covariates in the analysis to explain outcome of risk at those time points
    Just list them in the variables list of the regression command, as shown in the code above. Before you do that, you would be wise to do some graphical exploration of the relationships between these covariates and your outcomes. Some of the might be sufficiently non-linear that some kind of transformation or more complicated representation of them is needed.

    Comment


    • #3
      Thanks Clyde . this was very helpful.

      Comment


      • #4
        Thank you Clyde, your response above was really interesting.

        I'm using Stata 15 and I have data where there is one set of schools, all of whom received a treatment (there is no control group), and a test to see whether the children displayed
        a behaviour or not before and after the treatment. I want to see whether the treatment has an effect on behaviour. The thing is that this is a repeated cross-section from the same grade of children: we have no individual identifiers, it's just a cohort of individuals from the grade before the treatment with behaviour or not, gender, and school, and the same after.

        I came across your use of the multilevel model here, and want to make sure I understand correctly, and whether I can alter it by grouping by school rather than individual
        (my understanding is that you are generating a predicted probability for being at risk or not at each time; but you follow the individual over time. Can I do
        the same but generating a predicted probability for having the behaviour or not at the school over time?)

        Here is a sample of the data (for 2 schools, but we have more, all of different sample sizes):

        Code:
        clear
        input byte(school time behaviour gender)
        1    0    0    0
        1    0    0    0
        1    0    0    0
        1    0    0    0
        1    0    1    0
        1    0    0    0
        1    0    0    0
        1    0    0    1
        1    0    0    0
        1    0    0    0
        1    0    0    1
        1    1    0    1
        1    1    0    1
        1    1    0    0
        1    1    1    0
        1    1    1    1
        1    1    0    0
        1    1    0    1
        1    1    1    0
        1    1    1    0
        2    0    0    1
        2    0    1    0
        2    0    0    0
        2    0    0    0
        2    0    0    0
        2    0    0    1
        2    0    0    0
        2    0    0    1
        2    0    0    0
        2    0    0    0
        2    0    1    0
        2    0    0    0
        2    0    0    1
        2    0    0    1
        2    0    0    1
        2    0    0    1
        2    0    0    0
        2    0    0    0
        2    1    0    0
        2    1    0    1
        2    1    0    0
        2    1    0    1
        2    1    0    0
        2    1    0    1
        2    1    0    1
        2    1    0    1
        2    1    0    0
        2    1    0    1
        2    1    1    0
        2    1    0    1
        end
        What I have done is:
        Code:
        mixed behaviour i.time##i.gender || school:,  var mle
        margins gender, at(time=(0 1))
        My interpretation is that at time 0, females have a predicted prob of having the behaviour of 0.15, and males 0.
        At time 1, females have a predicted prob of having the behaviour of 0.4, and males 0.09.
        Is this an appropriate interpretation? Can I add (with my full sample)
        Code:
        gender
        after
        Code:
        school:
        so that I can look variations in the effect of gender across schools?
        Can I add
        Code:
        time
        after
        Code:
        school:
        , so I allow change over time to vary among schools? I'm just not entirely sure what's happening behind the code.

        Thank you so much,
        Shannon

        Comment


        • #5
          Yes, you are interpreting the output correctly.

          I'm not sure I understand your description of how this corpus of data was put together. It seems that the same children are represented twice in the data: once before the intervention and once after. But you do not have the ability to identify which observations are attributable to which child. In that case, the assumption of independent observations is violated, so you should assume that the standard errors, z-statistics, p-values, and confidence intervals you get are incorrect. Without the ability to track individuals through the data, there is no truly good solution to that problem.

          Yes, you can add a school-level random slope on gender, or on time. But unless you have a specific interest in the extent to which these effects actually vary among the schools, or some intent to rank the schools on these effects, there is little reason to do this.

          Finally, I will just reinforce your awareness that this kind of uncontrolled pre-post design is very weak for identifying causal effects.

          Comment


          • #6
            Thank you Clyde! Yes, I am aware of the weakness of the design for identifying causal effects. As an analyst, I very rarely get to contribute to the design of data collection; I am asked to use the data that has been collected to try to answer research questions. In this case, the designer of the research is also very aware of the shortcomings of the data collection process, but as these are school children, there were requirements to satisfy an ethics committee, every school administration, and the parents of the children (because positive assent was part of the ethics stipulations). As a result, the researchers were not allowed to retain any individual identifier for the students who took part.

            You are correct and incorrect about the same children being represented twice in the data. Many of the children will be the same: if their parents consented twice and they themselves chose to take part twice. The response rates and proportions of different demographics are different in the same grade cohort from the same school over the two time periods, to differing extents among the schools. That means that many of the children will be different, but they will belong to the same grade at the same school. That means they were exposed to the same "treatment", and the behaviour is a social behaviour, which the experts in the field see as being strongly influenced by cohort.

            I could analyse the proportion who demonstrate the behaviour at the aggregate level. However, that would not really allow me to observe any gender differences: I could use a proportion of male students in the first time period as an independent variable to see whether the proportion at each school affects the difference over time in each school, but using the mixed command allows me to keep all of the respondents in the analysis as well as see effects of gender at the individual level, rather than at the aggregate school level, which seemed like a more effective way to answer the research questions. Do you have an opinion on whether that might be preferable, and using what method of analysis?

            Essentially, what the researchers want to know is whether there is a difference in prevalence of that behaviour (self-reported) before and after exposure to the "treatment", and whether that difference appears to be affected by gender or possibly age (which I think I will approximate with year level, because of the importance of the cohort as opposed to maturity as such). They are not looking to make claims about causality: they will be unsurprised to find that there is no difference, for various reasons including a reporting effect from increased understanding of the behaviour.

            We do have a specific interest in the extent to which the effects vary among the schools, simply as a source of variation: maybe reporting differences over time as a range of predicted probabilities will be more clear than reporting any kind of confidence interval (as you point out, there is no way to correct those)? However, maybe allowing a school-level random slope is unnecessary given that I have already grouped at that level through the command anyway? And difference in the effect of gender among the schools may be interesting as well, even if it is just to report that there isn't any or the range it takes; but again, I"m unsure if that is doubling the analysis unnecessarily.

            Any thoughts greatly appreciated!

            Shannon

            Comment


            • #7
              Well, I see you've given it a lot of thought. And I can certainly empathize with being in the position of being handed data from a study whose design I had not been able to influence. It sounds like you are making the best of the situation you have. I agree that the individual analysis, notwithstanding its limitations, is better than aggregating up and looking at proportions.

              If you are interested in comparing schools, then, yes, you should keep a random slope for treatment at the school level: that is the way you will be able to distinguish the effectiveness in different schools.

              I do have one question, though. How many different schools are involved here? If they are only a handful, it may not make sense to use a random effects model. It might be better to just include school and school#treatment interaction terms in a one-level model. There isn't a hard and fast minimum number, but I think everyone would agree that if there are only five or fewer the use of a random effects model is questionable: you are simply not sampling school-space in anything like an adequate way.

              Comment


              • #8
                Hello Clyde

                I have 14 schools, and 5212 observations across 2 time periods. The intra-class correlation suggests there is some between group variance, although it is very small. I think it makes sense theoretically to group by schools, because the children are within the school environment and it's a social behaviour. Do you think there are enough schools to use the two-level model? I really don't use multi-level models much, so I'm paddling hard to understand what's happening underneath.

                I've done some modeling and found that the AIC and BIC suggest a better model is without allowing slope to vary by school. The information criteria also suggest not using interaction terms between gender, race and age. I'm not sure whether to use them anyway because it makes sense theoretically. Nothing is statistically significant, which we expect and we know might be because standard errors can't be relied on in this design. All contributions welcome! Thanks for your help so far Clyde,

                Shannon

                Comment


                • #9
                  Well, others may disagree, but my practice and firm belief is to always rely on theory to decide what to include whenever possible, and to rely on test statistics only when there is no reasonable theory.

                  How small is the ICC here? It's important to realize that what seem like small ICCs can still make a meaningful difference in the modeling. So while it seems tempting, for example, to shrug off an ICC of 0.01, it can nevertheless be important. What I actually find more helpful to look at than the ICC is the variance or standard deviation at the group level. If that standard deviation is large enough that a 1 sd difference between groups is a meaningful difference in the outcome variable, then I'm generally inclined to retain the random intercept at that level.

                  That said, with 14 schools, we are not really sampling school-space very well, and the standard deviation or variance estimate at that level could be quite inaccurate. I have to say that I would be inclined to re-do the analysis without the school level but including i.school among the bottom level variables, and, if theoretically justified, i.school#i.treatment as a substitute for the original random slope on treatment at the school level.

                  Comment


                  • #10
                    Hm, an ICC of .009 or so, so less than 1% variation in outcome explained by group. Variance at the group level of .0009. That means (I think) change of one SD at the group level is change of .03 in the outcome (which ranges between 0 and 1, and is predicted at .10 in time 1 and .13 in time 2...so not trivial?) That was using this code, adding in age in the first time period and whether the respondent was white, because they're theoretically important (and as it turns out, age much more important than anything else):
                    Code:
                    mixed behaviour i.time##i.agefirsttime i.gender i.white || school: ,  var mle
                    Remove the multilevel and just allow the intercept to vary by school to control for school differences, and I get quite similar predicted probabilities and a similar pattern of statistical significance for coefficients, but the coefficients themselves are really different:
                    Code:
                    glm behaviour i.time##i.agefirsttime i.gender i.school  i.white, family(binomial) link(logit) vce(robust) ml
                    If I allow the slope of schools to vary as well, it all gets a bit crazy, but the predicted probabilities don't move much in the end.

                    In the spirit of parsimony, I'm leaning towards the glm code from above (or its binreg equivalent, depending on what I want to report: it's coming out exactly the same) without school slope changes. I would just as happily take advice to stick with the multilevel! Thanks Clyde.

                    Comment

                    Working...
                    X