Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Very high coefficients in linear regression with dichotomous regressor

    Hi,

    I am looking forward to some inputs regarding very high coeffcients in a linear regression model.

    I have a DV that is continuous with values ranging from -infinity to +infinity. I have a regressor that is dichotomous with either 0 or 1. However, the number of 1's is very small compared to that of the 0's. Specifically, out of 158,374 observations, the regressor has 419 1's and 157,955 0's.

    On doing a linear regression (with additional regressors and year dummies), I obtain very high coefficients for the regressor (4096; p<0.001). The R-sq for the model is 0.56.

    Can anyone throw some light on the high coefficients? Is that normal? Or just something wrong with the model?

    Thanks for your time,




  • #2
    Well, the coefficient of 4096 means that the difference between the expected value of DV when regressor = 1 and the expected value of DV when regressor = 0 is 4,096, after adjustment for any other covariates in your model. Since you don't say anything about what these variables are or what the relationships you expect to find between them are it isn't possible to comment on whether this number is too small, too big, or just about right. It could be any of those. I wouldn't take the p-value very seriously; given that your sample is huge, the difference would have to be extremely tiny in order to not be statistically significant at conventional levels. The imbalance between the cases where regressor = 0 and regressor = 1 is not really important. What does matter is whether the smaller of the two groups (regressor = 1 in this case) is large enough to get a meaningful estimate of results for that group, but at 419 it probably is, unless the distribution of DV is extremely skew or bizarre in some way.

    Anyway, I think to really answer the question in your mind you will have to say a lot more about what the content and context is. A regression coefficient can, in principle, take on any value and nothing is automatically too big, or too small just based on its value.

    Comment


    • #3
      Whether 4000 is large or small depends on how your dependent variable is measured: if it is annual income measured in cents then that sounds OK, if it hourly income in euros then I would be worried.
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        Thanks Maarten, Clyde!

        The DV is the annual income that is measured in euros while the regressor is whether a person has a rural housing or an urban housing. In my data, most of the individuals live in an urban area (0's) while the others (1's) live in a rural area.

        For the DV, the min=0, max=14707, mean=18.17, sd=293.73

        Hope those details are good enough to shed some more light on the issue.

        Thanks again,




        Comment


        • #5
          I’d say that living in the rural area increased the predicted mean for the annual income by (round) 4000 euros. That said, income tends to show much skewness, as it seems to be your case. Besides, you have people with null income. I believe you should take this in consideration when elaborating your model. The literature shows several strategies to tackle this issue.
          Last edited by Marcos Almeida; 18 Dec 2017, 12:30.
          Best regards,

          Marcos

          Comment

          Working...
          X