Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • What is the correct interpretation of "rho" in xtreg, fe?

    I am confused about the interpretation of the "rho" statistics reported by xtreg. It is supposed to be the "fraction of variance due to u_i", where u_i is the individual effect.

    Consider the following example with a panel data set: ID is the cross-section identifier, YEAR is the time series identifier and Y is any numeric variable. Let's compare the "rho" from the following fixed effects regression and the R-squared of the subsequent OLS regression with cross-section dummy variables:

    Code:
    xtreg Y, fe  // fixed effects regression with no other variables
    
    reg Y i.ID  // pooled regression with dummy variables for each level of ID
    My intuition (which is clearly wrong) says that these two numbers should be the same. In the FE regression there are no other regressors included (only constant), so all the variation in Y must be explained by the individual effect (u_i) and the idiosyncratic error term. So rho should be the share of the variance in Y that is explained by the individual effects. In the second regression, the only regressors are the dummy variables for the levels of ID, so the R-squared should show the fraction of variance in Y explained by the dummies. But the two are not the same numerically, so there is a big mistake somewhere in my intuition. Could you tell me where?

    Thanks a lot.

  • #2
    Your intuition is correct, but as usual the devil is in the details. Although we often refer to R^2 as a proportion of "variance" explained, it is calculated as a ratio of sums of squares and that is what reg reports. In contrast, xtreg calculates variances and takes a ratio of the between-groups to the total.

    As a simple example, consider the data 1,2,3,4,5,6,7,8, with the first 4 observations in group 1 and the next 4 in group 2. The between and within sum of squares are 32 and 10, for an R^2 of 32/42 = 0.76, as
    reg will confirm:
    Code:
    . clear
    
    . set obs 8
    number of observations (_N) was 0, now 8
    
    . gen y = _n
    
    . gen id = 1 + (_n > 4)
    
    . quietly reg y i.id
    
    . di e(mss), e(rss), e(mss)/(e(mss) + e(rss))
    32 10 .76190476


    The group means are 2.5 and 6.5, so the between-groups variance is 8; the within-groups variance is 10/6 = 1.67, giving rho = 8/(8 + 10/6) = 0.83 as xtreg can confirm:

    Code:
    . quietly xtreg y, i(id) fe
    
    . di e(sigma_u)^2, e(sigma_e)^2, e(sigma_u)^2/(e(sigma_u)^2 + e(sigma_e)^2)
    8 1.6666667 .82758621
    Things get a little bit more involved with random-effect models depending on whether you use ML or GLS, but the general idea is the same.

    Comment


    • #3
      Thank you very much, German! This clears up completely my confusion at the technical level. But let me ask a more conceptual follow up question.
      If I have a variable with a panel structure (i.e. there is variation across individuals and time), is there a standard way to measure how much of the variation is coming from the cross-section vs the time series dimension? Intuitively, we all know that there are variables that tend to be very persistent within a panel, but there is huge variation in the cross section. And there are variables which tend to move a lot over time, but on average they are similar across panel units. Is there a way to put number on this intuitive idea? Or is this question ill-defined?
      I was thinking of doing the two things in my original question (hoping that they give the same answer), but in ma actual application they give very different answers.

      Comment


      • #4
        I think that conceptually these measures are getting at the same thing, even if they don't give exactly the same result. I am surprised that in your application the answers are very different, as I would expect them to pick the same variables as moving more along one or the other dimension. At any rate, I believe the question is not ill defined.

        My own preference is for answers expressed in terms of variance estimates, and working with datasets comprised of many individuals observed just a few times each, I tend to rely on estimates based on random-effect models, including variance components and multilevel models. But other approaches may be sensible in different circumstances.

        Comment


        • #5
          German, thanks again. I was also surprised that I got different rankings using the two approaches, so I did some further exploration. And there is something weird going on that I cannot figure out and it drives me crazy. The rho statistics reported by xtreg seem to depend on the scaling of variables. Adapting your example to my data, I get the following:

          Code:
          *** show panel structure
          . xtset
                 panel variable:  id (strongly balanced)
                  time variable:  year, 1995 to 2015
                          delta:  1 unit
          
          *** Show that y is a very normal variable with a range 0-100
          . codebook y
          
          ------------------------------------------------------------------------------------------
          y                                                                                     (unlabeled)
          ------------------------------------------------------------------------------------------
          
                            type:  numeric (double)
          
                           range:  [12.5,100]                   units:  .1
                   unique values:  8                        missing .:  1,416/3,150
          
                      tabulation:  Freq.  Value
                                      26  12.5
                                     174  25
                                     361  37.5
                                     255  50
                                     176  62.5
                                     189  75
                                     145  87.5
                                     408  100
                                   1,416  .
          
          *** Generate re-scaled versions of y
          . gen y_10=y/10
          (1,416 missing values generated)
          
          . gen y_100=y/100
          (1,416 missing values generated)
          
          . *** Simple R-squared using only individual effects => results identical
          . quietly reg y i.id
          . di e(mss), e(rss), e(mss)/(e(mss) + e(rss))
          1059522.8 208638.64 .83547943
          
          . quietly reg y_10 i.id
          . di e(mss), e(rss), e(mss)/(e(mss) + e(rss))
          10595.228 2086.3864 .83547943
          
          . quietly reg y_100 i.id
          . di e(mss), e(rss), e(mss)/(e(mss) + e(rss))
          105.95228 20.863864 .83547943
          
          
          . *** Now "rho" from xtreg fixed-effects regression => result depends on scaling
          . quietly xtreg y, i(id) fe
          . di e(sigma_u)^2, e(sigma_e)^2, e(sigma_u)^2/(e(sigma_u)^2 + e(sigma_e)^2)
          634.34096 126.90915 .8332885
          
          . quietly xtreg y_10, i(id) fe
          . di e(sigma_u)^2, e(sigma_e)^2, e(sigma_u)^2/(e(sigma_u)^2 + e(sigma_e)^2)
          6.3434096 1.2690915 .8332885
          
          . quietly xtreg y_100, i(id) fe
          . di e(sigma_u)^2, e(sigma_e)^2, e(sigma_u)^2/(e(sigma_u)^2 + e(sigma_e)^2)
          .0634341 .07713877 .45125419
          As you can see, when I divide my variable by 100, the rho drops significantly, because the within-group variance estimate is changing a lot. First I thought it is a numerical precision issue. But all the variables are stored as double, and they have pretty normal ranges, so I don't see how it can be a problem.

          Code:
          . sum y y_10 y_100
          
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
                     y |      1,734    63.22088    27.05129       12.5        100
                  y_10 |      1,734    6.322088    2.705129       1.25         10
                 y_100 |      1,734    .6322088    .2705129       .125          1
          Do you or anyone have any idea what is going on here?

          Comment


          • #6
            Very interesting, because rho should not depend on scaling. I can confirm your results using simulated data and fitting the model to outcomes scaled down by powers of ten:

            Code:
            clear
            set seed 321
            set obs 20
            gen a = rnormal(0,1)
            gen id = _n
            expand 20
            gen e = rnormal(0,1)
            gen double y = 1 + a + e
            gen double ys = y
            forvalues i=0/4 {    
                quietly xtreg ys, i(id) fe
                di e(rho), e(sigma_u), e(sigma_e)
                quietly replace ys=ys/10
            }
            Here's what I got:

            Code:
            .49442302 1.0176661 1.0290812
            .33084542 .10176661 .14472914
            .33084542 .01017666 .01447291
            .33084542 .00101767 .00144729
            .49442302 .00010177 .00010291
            Different seeds lead to different results, of course, but different runs of the same script are not consistent. In all cases I have tried sigma_u seems OK and sigma_e is off, affecting rho. This doesn't seem to happen with gls or re, just fe. Replicated on 15.1 on Windows 10 and 14.2 on Mac OS X. But it doesn't happen on good old 11.2 on the same Mac.

            I would contact Stata technical support.
            Last edited by German Rodriguez; 18 Dec 2017, 12:26.

            Comment


            • #7
              Dear Andras and German,

              You are right to point out that there is a bug here and we will fix it in a future update. This bug is specific to the constant-only model and it is a knife-edge case. The problem is with the computation of the residual sum of squares. I say it is a knife-edge case because it is triggered in the code at a point were we are evaluating a computation that should be zero. Yet in the cases above there are instances when the values are greater than zero and cases when they are smaller than zero (in both cases these are very small numbers). This triggers two different computations.
              Last edited by Enrique Pinzon (StataCorp); 20 Dec 2017, 16:09.

              Comment

              Working...
              X