Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • multilevel longitudinal panel data with a binary outcome

    Dear Stata experts,

    I need your help with multilevel longitudinal panel data with a binary outcome. In our study, 20 000 students filled out 1 questionnaire per year on their screen time in 2018, 2019, 2021, 2022 (4 years). The student data is nested within 100 schools and 5 regions. We want to know whether the pandemic is associated with an increase in screen time. The outcome is binary (0= use screens less than two hours per day, 1= use screens two or more hours per day). The onset of pandemic happened once, at the same time for everyone (in 2021 and 2022). I’m thinking about using the command below but I think that this would only allow us to look at the evolution of screen time overtime. It wouldn't have a pandemic variable. Although maybe it's not a big deal since the pandemic overlaps with the year 2021 and 2022.

    . melogit screen_dicho time0 i.sex urban || IdSchool || IdParticipant: t0, cov(uns)


    An alternative would be to run a fixed-effects model to estimate changes in the adolescents’ screen time before and during the pandemic. However, I’m wondering how to manage the fact that schools are nested with such a fixed effects model. Would this command make sense?

    . reghdfe screen_dicho time0, a(IdParticipant IdSchool) vce(cluster IdParticipant IdSchool)

    Thanks,
    Anne

  • #2
    Anne:
    welcome to this forum.
    I would go with your first code.
    The only issue is the seemingly poor specification (i.e., too few predictors) of the fixed part of your regression equation.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Originally posted by Carlo Lazzaro View Post
      Anne:
      welcome to this forum.
      I would go with your first code.
      The only issue is the seemingly poor specification (i.e., too few predictors) of the fixed part of your regression equation.
      Dear Carlo, Thanks so much for your response. Yes, I will add more predictors.

      I'm wondering if it's better to make the year variable a continuous variable, a categorical variable or a dummy variable. Any thoughts?

      Also, for my own understanding, is there anything wrong with the second model that I suggested with the code reghdfe and vce cluster?

      Thank you,
      Anne


      Last edited by Anne Tremblay; 14 Aug 2022, 07:23.

      Comment


      • #4
        Anne:
        1) you may want to add both the linear and the squared terms for year as a continuous variable to investigate possible turning points. Otherwise, you may want to plug in years as a n- level categorical variable (1 year=1 level). Eventually, a dummy variable, that is a two-level categorical variable (dummy is a term that should be better replaced by categorical variable) does not seem to be the way to go unless your -timevar- is composed of two years (put differently: a two-level categorical variable).
        2) your -reghdfe- code reminds me of a linear probability model: is this the way you want to go?
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Originally posted by Anne Tremblay View Post
          Dear Stata experts,

          I need your help with multilevel longitudinal panel data with a binary outcome. In our study, 20 000 students filled out 1 questionnaire per year on their screen time in 2018, 2019, 2021, 2022 (4 years). The student data is nested within 100 schools and 5 regions. We want to know whether the pandemic is associated with an increase in screen time. The outcome is binary (0= use screens less than two hours per day, 1= use screens two or more hours per day). The onset of pandemic happened once, at the same time for everyone (in 2021 and 2022). I’m thinking about using the command below but I think that this would only allow us to look at the evolution of screen time overtime. It wouldn't have a pandemic variable. Although maybe it's not a big deal since the pandemic overlaps with the year 2021 and 2022.
          Do students change schools over the 4 year period? If they do not, the school fixed effects are collinear with the student fixed effects (and thus already accounted for). Same point with region (do students change region over the period?). A variable like gender is almost never time-varying, so drop it from your list. With a binary dependent variable, look at xtlogit with the -fe- option that estimates a conditional fixed effects model. For a pandemic variable, create an indicator (pandemic =1 if year >=2021 and 0 otherwise).

          Code:
          xtset IdParticipant year
          xtlogit screen_dicho pandemic urban i.region i.IdSchool other_controls, fe
          Now, the multilevel (random effects) model relies on very strong assumptions which almost always never hold if you did not take steps to select a random sample prior to collecting the data. You can use the Hausman test to check whether random effects is justified, otherwise stick to fixed effects if you are interested in any form of causal analysis.

          Code:
          help hausman
          Last edited by Andrew Musau; 14 Aug 2022, 16:08.

          Comment


          • #6
            A couple of thoughts after coming late to the party.

            In the -xtlogit- command shown in #5, I would imagine that any school is located in the same region at all times. It follows that the region effects will be colinear with the school effects and will be omitted anyway by Stata. Similarly, wouldn't the urban variable also be a non-varying attribute of a school. If so, it, too, will be automatically dropped.

            Since the originally posed question is about the effects of the pandemic, I suggest creating a pandemic variable:
            Code:
            gen byte pandemic = inlist(year, 2021, 2022)
            and using that in the model.

            So I would do this as:
            Code:
            xtset IdParticipant year
            xtlogit screen_dicho i.pandemic i.IdSchool, fe
            Concerning fixed vs random effects, the economics and finance worlds abhor random effects models. While the concerns raised about them are legitimate to an extent, I view their stance as an over-reaction, and if you are not in those disciplines (or even if you are) I would urge you not to absorb their almost zero-tolerance attitude towards random effects models. And I think that a Hausman test, or, for that matter, any statistical significance test, is a terrible way to choose a model. Without going further down this trail, however, let me point out that in this situation you should definitely use a fixed effects model for a completely different reason. You are interested in estimating the effect of the pandemic on individual students' use of screen time. Well, that is a within-student effect. Fixed-effects models estimate within entity (student, firm, whatever the unit of observation is) effects; random effects models measure a mixture of within- and between-person effects. You definitely need a within-person effect estimator here: hence you must use -fe-, else your results would be estimate of the wrong thing.

            Finally, let me ask why you are using a dichotomous outcome with a cutoff at 2 hours. Is there something special about the difference between one hour fifty nine minutes of screen time and two hours one minute? Why aren't you just modeling the continuous-valued amount of screen time reported? Making a dichotomy out of a continuous valued variable is almost always a bad idea: it discards data and creates noise. It is really only justified when the cutoff defines something that is, in the real world, associated with abrupt, discontinuous consequences.

            Comment


            • #7
              Dear Stata experts,

              Thanks a lot for your feedback! I have a follow-up question and comment.

              In response to Clyde: we want to use a binary outcome because the government’s recommendation for screen time is that children watch two hours or less of screen time. Thus, the 2-hour cut-off is meaningful to know how many children are meeting the recommendation.

              We realized that we have a dynamic panel where screen time in the past is likely to predict screen time in the future. Thus, we are considering structural equation modelling, but we can’t find the appropriate stata code for a binary outcome. Can you help me find the Stata code that would allow us to meet all of these elements at once: SEM, fixed effects, binary outcome, dynamic panel. We have been searching and we can’t find this code for a binary outcome. Alternatively, we could have a fixed effects model with an autoregressive term and a binary outcome. Any advice for the code?

              Thanks a lot,

              Comment


              • #8
                Thanks for explaining the dichotomy. I suggest you name the variable "govt_guideline_adherent" or something like that, so that reviewers won't ask the same question I did when you submit the results for publication.

                You can do structural equations modeling with logistic equations (and many other types of generalized linear models) using the -gsem- command, however it will not do conditional logistic models. You would have to do unconditional logistic regression with fixed-effects represented by identifier variables (in factor-variable notation). If your sample is not large, this leaves you vulnerable to the incidental parameters problem. But if your sample is large, it's no an issue. And -gsem- will let you include lagged variables as well. As for how you would code your model, I think you need to work out your model on paper first to see what you want to model as affecting what. Once you do that, coding this for -gsem- will not be all that different from the way you would code it for simple single-equation models. Figuring out what you want to model is the hard part.

                Finally, let me point out that autoregressive terms are not the same thing as dynamic modeling with lagged terms. An autoregressive term refers to time-dependent correlation in the error terms of the model. This is quite different from including lagged predictors, and the results of those approaches will not resemble each other. Basically in a dynamic model you are accounting for lingering effects over time through the effects of observed variables by using their lagged values. In an autoregressive model you are accounting for them in terms of unmodeled sources of variation.

                Comment

                Working...
                X