Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question about "Non linear correlated random effects models and unbalanced panels"

    Dear All,

    I hope you are doing well. I have an unbalanced panel dataset with a large N (>1600) and small T (15 years). My main outcome is fractional, ranging between 0 and 1, with rare endpoints, though 1 is more likely to occur.

    Following the recent paper by Bates, Papke, and Wooldridge (2024), I estimated several models:

    1. OLS:
    Code:
     reg y x1 x2 i.year, cluster(id)
    2. FE:
    Code:
     reg y x1 x2 $avg_controls i.year, cluster(id)
    3. GLM (Probit):
    Code:
     glm y x1 x2 i.year, family(bino) link(probit) cluster(id)
    4. CRE (Probit):
    Code:
     glm y x1 x2 $avg_controls i.year, family(bino) link(probit) cluster(id)
    5. CRE + Unbalancedness Adjustment:
    Code:
     glm y x1 x2 $avg_controls i.year i.number_years, family(bino) link(probit) cluster(id)
    My question concerns the interpretation of the time-averaged covariates included to relax the exogeneity assumption in the CRE model. In my preferred specification (Model 5), two variables show signs contrary to the literature and expected associations, whereas their time-averaged counterparts have the expected signs.

    I would like to clarify:

    Is there a direct relationship between a variable and its time-averaged version in terms of interpretation?
    Do time-averaged variables have a meaningful interpretation beyond their role in adjusting for unobserved heterogeneity?
    From my understanding, time-averaged covariates capture long-run (between-unit) effects, while their non-averaged counterparts represent short-run (within-unit) effects. The difference in signs may imply differing short- and long-term associations ???

    I would appreciate your insights on whether this interpretation is correct or if there is something I may be overlooking.

    Thank you very much for your guidance.

    Kind regards,

  • #2
    From my understanding, time-averaged covariates capture long-run (between-unit) effects, while their non-averaged counterparts represent short-run (within-unit) effects.
    That's partially correct. Where it goes wrong is in identifying between-unit with long run and within-unit with short run. It is correct if you remove long-run and short-run, and removing the parentheses around between and within-unit. The between-unit effects reflect variation among the units that exist independently of the passage of time, whereas the within-unit effects reflect variation in the same unit over time. And, yes, those can be different, even in sign.

    To see an extremely clear, though artificial, demonstration of how this can work:
    Code:
    clear
    set obs 5
    gen panel_id = _n
    expand 2
    
    set seed 1234
    by panel_id , sort: gen y = 4*panel_id - _n + 3 + rnormal(0, 0.5)
    by panel_id: gen x = panel_id + _n
    
    xtset panel_id
    
    xtreg y x, fe
    regress y x
    
    //    GRAPH THE DATA TO SHOW WHAT'S HAPPENING
    separate y, by(panel_id)
    
    graph twoway connect y? x || lfit y x
    On a more practical level, but without specific data, consider a variable denoting married vs unmarried. There are numerous outcomes with differences between people who end up getting married and those who don't--differences that long predate their marriages. But the differences in these same outcomes may be very different from the changes that they undergo after getting married. The former are between-person changes, and the latter are within-person changes.

    Comment


    • #3
      Thank you very much. Very useful. So, would you suggest to interpret both results in a research paper? the sign of the within unit effect is really unexpected (it is negative, while vast studies show positive), but the between effect (the coefficients on the time averaged variables) is positive. I don't see people interpenetrating the between effect. In the meantime, the within effect holds the expected sign in the OLS, FE, and simple GLM models.

      Comment


      • #4
        In the meantime, the within effect holds the expected sign in the OLS, FE, and simple GLM models
        It is difficult for me to appraise this because you have not explicitly set out exactly what OLS, FE and simple GLM models you have fit here. But be cautious because unless you have done some manipulations on the variables themselves, FE always and only estimates within-effects, whereas OLS and simple GLM models estimate an effect that is estimated under the constraint that the within and between effects are equal. So I don't know what you are referring to when you speak of a within effect in an OLS or simple GLM model.

        So, would you suggest to interpret both results in a research paper?
        That depends on what your research question is. You should present the results that are relevant to the question under investigation. That might be the within- effect, it might be the between-effect, or it might be both.

        the sign of the within unit effect is really unexpected (it is negative, while vast studies show positive)
        If the existing body of evidence on this effect is "vast" in the sense that it has been replicated in many studies that have strong designs, then I would not get too excited about a contrary finding. Yes, I would still publish it. But only after very carefully checking my own work for errors. Errors may arise in the way the data was gathered, the way the data was cleaned and managed before analysis, in the analysis itself, or in the interpretation of the analytic results. And, of course, in addition to errors in the ordinary sense of the word, there is always the possibility of the randomness in the luck of the draw of the sample (aka sampling error). Of course, you should always check your work, but I would be doubly cautious if your result really does contradict a substantial corpus of high quality work. On the other hand, as I have often said in this Forum, the peer-reviewed literature is bristling with junk research, and it may be that the "vast" studies your results contradict are part of that.

        Either way, you would learn a great deal from doing a deeply critical review of that body of literature and a deeply critical review of your own study.

        Comment


        • #5
          Thank you a lot Clyde !! Regarding the literature they didn't apply the same method (correlated Random Effect) as me. I obtained the same conclusion when using OLS, FE or glm without time averaged controls. Things change when I added to the model the time averaged variables and adjustment for unbalancedness by including dummy for the number of years each unit present in data. If I may ask, how would you interpret the between effect then? for the within effect, I would say a 1 percentage point increase (assuming the variable is in percent) in the independent variable there will be "coefficient" increase/decrease in the outcome. What would be the language for between effect?

          Many thanks in advance!

          Comment


          • #6
            for the within effect, I would say a 1 percentage point increase (assuming the variable is in percent) in the independent variable there will be "coefficient" increase/decrease in the outcome. What would be the language for between effect?
            I would say it as: for panels having a 1 percentage point difference in time-averaged values of x, there is a "coefficient" difference in the expected value of y.

            Comment


            • #7
              Great great thanks. So suppose that the time averaged variable is percentage of white population having positive coefficient, "the between interpretation would be that" For panels that experience 1% increase in percentage of white population outcome is x percentage point higher compared to panels that did not saw such increment? and suppose the non time averaged version of the same variable has a negative coefficient " We would say" a 1% increase in percentage of white population leads to x percentage point reduction in outcome within panels. If this is correct, it appears really hard for me to reconcile these findings.

              Comment


              • #8
                For panels that experience 1% increase in percentage of white population outcome is x percentage point higher compared to panels that did not saw such increment?
                NO! Wrong!!!
                For the time averaged variables it is about a 1% point difference in percentage of white population between two different panels (countries? provinces? whatever). It is specifically not about any increase or decrease.

                So what you are looking at is that panels with a higher time-averaged percentage of white population have, higher levels of the outcome variable, whereas those panels that experience an increase in percentage of white population during the study also see their levels of the outcome variable decrease during the study period. There is absolutely nothing inconsistent about those findings. (FWIW this is exactly the situation that I presented in a highly oversimplified way in the code I showed in #2, just with the signs reversed.)

                Comment


                • #9
                  My eyes have just been opened—thank you so much! My results actually makes sense now. Overall, units with a higher percentage of the white population tend to have a higher outcome equal to X coefficients. However, a sudden increase in this percentage might lead to a decrease in the outcome (which aligns with my initial vague intuition about short-run effects).

                  Comment


                  • #10
                    @Clyde Schechter Dear Clyde,

                    I feel that there could be reverse causality between three of my independent variables and my outcome. So, I was thinking about using second lags of each independent variable as an IV. I wanted to apply the control function model detailed in (Papke and Wooldridge, 2008; Wooldridge, 2015). It is a bit over my head though. I don't understand how Stata will handle the second lag for the first obs (first - earliest year in the data) of the endogenous variables, as there will be no lag anymore?

                    Comment


                    • #11
                      I'm sorry, I can't help you with this. I know very little about instrumental variables. They are little used in my line of work, epidemiology, and not at all in my niche, and control function models are not used at all. This current question is connected to the original topic of the thread, but is fairly far afield from the original thrust. I suggest you repost this question as a New Topic. Give the new thread a title that says it's about IV and control function models. That way it will draw the attention of other Forum members (including possibly Jeff Wooldridge) who are knowledgeable in this area.

                      Comment

                      Working...
                      X