Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Postestimation of standardized Pearson residuals after logit command -- unexpected results

    Dear Statlist members,

    recently I was following the book "Regression Models for Categorical Dependent Variables Using Stata, Third Edition" by J. Scott Long and Jeremy Freese. Mostly I considered to grasp the basics of great user-written postestimation package Spost, but nevertheless -- for consistency -- followed all the book, reproducing results, and I've got some unexpected (to my opinion) results after estimation of Pearson residuals after logit command on a toy dataset.

    Please consider the next test dataset, https://www.dropbox.com/s/lge31gasou..._test.dta?dl=0 :
    Code:
    +--------------+
    | id x y |
    |--------------|
    1. | 1 240 0 |
    2. | 2 246 0 |
    3. | 3 247 0 |
    4. | 4 248 0 |
    5. | 5 248 0 |
    |--------------|
    6. | 6 250 0 |
    7. | 7 251 0 |
    8. | 8 252 0 |
    9. | 9 253 0 |
    10. | 10 253 0 |
    |--------------|
    11. | 11 264 0 |
    12. | 12 270 0 |
    13. | 13 256 1 |
    14. | 14 256 1 |
    15. | 15 257 1 |
    |--------------|
    16. | 16 257 1 |
    17. | 17 260 1 |
    18. | 18 265 1 |
    19. | 19 270 1 |
    20. | 20 273 1
    y is response, x is predictor, id is observation ID.
    We fit logistic regression with:

    Code:
    logit y x
    Then we postestimate probabilities and Pearson residuals:

    Code:
    predict probs
    predict pearson_r, rstandard
    Then we can see the results and plot the residuals vs observation id.
    Code:
    list
    graph twoway scatter pearson_r id, mcolor (black) yline(0) mlabel(id) scale(0.9)
    Code:
    +-------------------------------------+
    | id x y probs pearson_r |
    |-------------------------------------|
    1. | 1 240 0 .0383879 -.2077717 |
    2. | 2 246 0 .1006583 -.3508434 |
    3. | 3 247 0 .1173147 -.3822852 |
    4. | 4 248 0 .1363096 -.6200286 |
    5. | 5 248 0 .1363096 -.6200286 |
    |-------------------------------------|
    6. | 6 250 0 .1820321 -.4931124 |
    7. | 7 251 0 .2090242 -.5364172 |
    8. | 8 252 0 .23885 -.5834865 |
    9. | 9 253 0 .2714711 -.9366656 |
    10. | 10 253 0 .2714711 -.9366656 |
    |-------------------------------------|
    11. | 11 264 0 .7115408 -1.693731 |
    12. | 12 270 0 .8736712 -1.946981 |
    13. | 13 256 1 .3842132 1.937094 |
    14. | 14 256 1 .3842132 1.937094 |
    15. | 15 257 1 .4255867 1.783529 |
    |-------------------------------------|
    16. | 16 257 1 .4255867 1.783529 |
    17. | 17 260 1 .5536892 .9447437 |
    18. | 18 265 1 .7454901 .6335568 |
    19. | 19 270 1 .8736712 -1.946981 |
    20. | 20 273 1 .9205091 .3193257 |
    +-------------------------------------+
    Please pay attention to observations No. 12 and 19. They have the same integer value of x = 270, but actual binary response (y) differs in a first (0) and second case (1).
    But how can obs No. 19 have residual of -1.947 -- the same as obs. No 12, despite the fact that response y is totally different? It's expected, that obs. No 19 should have residual in range of 0.31-0.63, because, in a layman reasoning, it "conforms" with the model, which suggests that higher values of x increase the probability of y = 1.

    Now, if we change the data very slightly, making the observation No 19 x non-integer with adding arbitrary number of 0.01:

    Code:
    replace x = 270.01 in 19
    And we refit the model, and also let's calculate the difference between old and new residual estimates:
    Code:
    logit y x
    predict probs_new
    predict pearson_r_new, rstandard
    
    gen residual_diff = pearson_r - pearson_r_new
    
    list
    Code:
    +---------------------------------------------------------------------------+
    | id x y probs pearson_r probs_~w pearson~w residua~f |
    |---------------------------------------------------------------------------|
    1. | 1 240 0 .0383879 -.2077717 .0383909 -.2077801 8.40e-06 |
    2. | 2 246 0 .1006583 -.3508434 .1006611 -.350848 4.62e-06 |
    3. | 3 247 0 .1173147 -.3822852 .117317 -.3822886 3.34e-06 |
    4. | 4 248 0 .1363096 -.6200286 .1363111 -.6200297 1.19e-06 |
    5. | 5 248 0 .1363096 -.6200286 .1363111 -.6200297 1.19e-06 |
    |---------------------------------------------------------------------------|
    6. | 6 250 0 .1820321 -.4931124 .1820315 -.4931106 -1.82e-06 |
    7. | 7 251 0 .2090242 -.5364172 .2090222 -.5364131 -4.11e-06 |
    8. | 8 252 0 .23885 -.5834865 .2388462 -.5834798 -6.68e-06 |
    9. | 9 253 0 .2714711 -.9366656 .2714653 -.9366506 -.000015 |
    10. | 10 253 0 .2714711 -.9366656 .2714653 -.9366506 -.000015 |
    |---------------------------------------------------------------------------|
    11. | 11 264 0 .7115408 -1.693731 .7115155 -1.693629 -.0001025 |
    12. | 12 270 0 .8736712 -1.946981 .8736519 -2.879869 .9328884 |
    13. | 13 256 1 .3842132 1.937094 .3842002 1.93715 -.0000554 |
    14. | 14 256 1 .3842132 1.937094 .3842002 1.93715 -.0000554 |
    15. | 15 257 1 .4255867 1.783529 .4255711 1.783588 -.0000594 |
    |---------------------------------------------------------------------------|
    16. | 16 257 1 .4255867 1.783529 .4255711 1.783588 -.0000594 |
    17. | 17 260 1 .5536892 .9447437 .5536671 .9447865 -.0000428 |
    18. | 18 265 1 .7454901 .6335568 .745465 .6335998 -.0000429 |
    19. | 19 270.01 1 .8736712 -1.946981 .8738416 .4161257 -2.363107 |
    20. | 20 273 1 .9205091 .3193257 .9204944 .3193603 -.0000346 |
    +---------------------------------------------------------------------------+

    Things change dramatically, and it conforms with a common sense. Now both obs No 12 and 19 have totally different estimates of Pearson residuals.
    To be more confident, that we have some unexpected result here, I refitted the model in R on the data with both integer x = 270 in obs 12 and 19:


    Code:
    model <- glm(y ~ x,family=binomial(link='logit'),data=logit_command _test)
    
    summary(model)
    logit_command_test$residuals_r <- resid(model, type="pearson")
    logit_command_test$predictions_r <- model$fitted.values
    
    print(logit_command_test)

    Estimates from those commands, despite some slight differences, also agree with a common reasoning.
    To wrap things up, I want to know what precisely is wrong with my commands in Stata, so I've got such unexpected (for me) result.
    I'm using Stata 15 MP with latest updates.

    Thank you for any feedback or suggestions in advance!
    Last edited by Stanislav Martsen; 07 Dec 2017, 07:16.
Working...
X