Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Xtreg and reg

    Hi, all.

    I have a question that when I tried to do some analysis with an unbalanced individual panel data, I got different estimation for the variables when using reg and xtreg if I include more than one covariate. Using reg is to pool the observations across years.

    For example,

    I got similar estimation of coefficient "age_greater_than_60" when I do "reg work_or_not age_greater_than_60 i.province, robust cluster(ID)" and "xtreg work_or_not age_greater_than_60, fe i(numeric_id) robust". However, if I further include "age_difference_from_60", then the coefficient for dummy variable "age_greater_than_60" will be different even in their sign in two commands. (both work_or_not and age_greater_than_60 are dummy variables but age_difference_from_60 is a discrete variable)

    Could anybody help explain it?

    Thanks!

  • #2
    In your first -reg- model, you have indicators for province included. In the -xtreg- model, province disappears and is, in effect, replaced by a different variable called numeric_id. Your third regression includes yet another new variable, age_difference_from_60. In regression models, the coefficients, and even their meanings, are always conditional on all of the other variables in the analysis. There is no reason to expect that the coefficient of a given variable will remain the same as, or have the same sign as, or even bear any remote resemblance to its value in one model when the variable is used in a different model, unless that variable is independent of the variables that are being added or removed between the two models. I think it is obvious that age_greater_than_60, whatever it is, would not be independent of age_difference_from_60, assuming that age refers to the same thing in both variables. Its relationship to county indicators or whatever numeric_id represents is less obvious, but suffice it to say that variables being independent of each other is pretty rare in real world problems.

    So, really, there is nothing to explain here.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      In your first -reg- model, you have indicators for province included. In the -xtreg- model, province disappears and is, in effect, replaced by a different variable called numeric_id. Your third regression includes yet another new variable, age_difference_from_60. In regression models, the coefficients, and even their meanings, are always conditional on all of the other variables in the analysis. There is no reason to expect that the coefficient of a given variable will remain the same as, or have the same sign as, or even bear any remote resemblance to its value in one model when the variable is used in a different model, unless that variable is independent of the variables that are being added or removed between the two models. I think it is obvious that age_greater_than_60, whatever it is, would not be independent of age_difference_from_60, assuming that age refers to the same thing in both variables. Its relationship to county indicators or whatever numeric_id represents is less obvious, but suffice it to say that variables being independent of each other is pretty rare in real world problems.

      So, really, there is nothing to explain here.
      Thanks for your reply, Clyde. I should be more clear next time when asking questions. Actually here the ID is the same as the numeric_id. And because the xtreg has the individual fixed effect, I tried to control some fixed effect (province) to some extent when using reg. The reason why I asked this was because I tried to get the same/similar estimation across different methodology (robustness check). But the very different results of the same variable "age_greater_than_60" when I do "reg work_or_not age_greater_than_60 age_difference_from_60 i.province, robust cluster(ID)" and "xtreg work_or_not age_greater_than_60 age_difference_from_60, fe i(numeric_id) robust" made me confused.

      Comment


      • #4
        Hi Rio,

        I think it's fairly clear from your original question what you're trying to do here. Seems like if you want to control for the cross-sectional fixed effects you should already be doing that (at least insofar as you want something equivalent to the fixed effects model) when you cluster by the cross-sectional units given by the id variable. Including the province term may very well explain the difference.

        You might also want to double check that ID (a string?) and numeric_id are indeed equivalent, since there are many ways to convert from a string to a number, and they may not all be equivalent for the purposes of your model. Best practice for a test like this is to try to guarantee as much is the same as possible. Why not just pass numeric_id into reg instead of ID? That way you can take the human element out of the equation and guarantee they are exactly the same variable at runtime. You don't want to run into a situation where you make some small change to the numeric version, forget about it, then wonder why the models aren't equivalent later. It's good practice.

        Comment


        • #5
          Originally posted by Daniel Schaefer View Post
          Hi Rio,

          I think it's fairly clear from your original question what you're trying to do here. Seems like if you want to control for the cross-sectional fixed effects you should already be doing that (at least insofar as you want something equivalent to the fixed effects model) when you cluster by the cross-sectional units given by the id variable. Including the province term may very well explain the difference.

          You might also want to double check that ID (a string?) and numeric_id are indeed equivalent, since there are many ways to convert from a string to a number, and they may not all be equivalent for the purposes of your model. Best practice for a test like this is to try to guarantee as much is the same as possible. Why not just pass numeric_id into reg instead of ID? That way you can take the human element out of the equation and guarantee they are exactly the same variable at runtime. You don't want to run into a situation where you make some small change to the numeric version, forget about it, then wonder why the models aren't equivalent later. It's good practice.
          Thanks for your suggestion, Daniel. I did try it but it seems that nothing has changed. In the second regression including covariate "reg work_or_not age_greater_than_60 age_difference_from_60 i.province, robust cluster(ID)" and "xtreg work_or_not age_greater_than_60 age_difference_from_60, fe i(numeric_id) robust", the coefficients of age_greater_than_60 have different signs but the coefficients of age_difference_from_60 are similar. Part of the reason I think is what Clyde suggested, the dependence of two variables. But I am not sure if there is some other factor that could also lead to different estimation results because in robustness check, it is very common to use different models/methodologies to verify their conclusion (estimation results).

          Comment

          Working...
          X