Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How does Stata handle ties in Spearman correlation? Discrepancy to hand-calculation

    Hi,

    I'm correlating rater accuracy to rater experience (measured as number of examinations preformed annually). These are not normally distributed, but likely monotonically related (the more experienced, the more accurate), at least that is the hypothesis I wanna test. Thus, I wanna do Spearmans correlation.

    This is my table
    Rater Accuracy # of cases Accuracy rank Case rank d d2
    Rater 1 79.3% 40 10.5 11 -0.5 0.25
    Rater 2 82.8% 40 13 11 2 4
    Rater 3 70.7% 15 4 1.5 2.5 6.25
    Rater 4 84.5% 23 14.5 7 7.5 56.25
    Rater 5 84.5% 50 14.5 13 1.5 2.25
    Rater 6 75.9% 40 8.5 11 -2.5 6.25
    Rater 7 74.1% 60 6.5 14 -7.5 56.25
    Rater 8 65.5% 20 3 4.5 -1.5 2.25
    Rater 9 75.9% 80 8.5 15 -6.5 42.25
    Rater 10 79.3% 25 10.5 8 2.5 6.25
    Rater 11 81.0% 36 12 9 3 9
    Rater 12 72.4% 20 5 4.5 0.5 0.25
    Rater 13 60.3% 20 2 4.5 -2.5 6.25
    Rater 14 58.6% 20 1 4.5 -3.5 12.25
    Rater 15 74.1% 15 6.5 1.5 5 25

    The sum of d^2 is 235. There are a few ties, for example rater 8, 12, 13 and 14 all claim to assess 20 cases annually, ranking third, fourth, fifth, sixth, thus having the average rank of 4.5.

    Plugging it into the rho formula

    using an Excel sheet gives rho= 0.5803

    With Stata:

    Code:
    spearman accuracy numberofcases
    Number of obs = 15
    Spearman's rho = 0.5731

    Now, I think it has to do with how Stata handle ties. In "A Gentle introduction to Stata", chapter 8, p. 180, it says that Stata uses averages when there is ties, but I have not been able to confirm this from within Stata help files. And then I don't understand why I get a discrepant result, since I averaged ranks across ties.

    When instead using the rho formula for ties,

    I can extend my table with xi-x(bar), yi-y(bar), and those squared
    Rater Accuracy Number of cases Rank of accuracy Rank of cases xi-x(bar) yi-y(bar) (xi-x(bar))*(yi-y(bar)) (xi-x(bar))^2 (yi-y(bar))^2
    Rater 3 0.70689655 15 4 1.5 -0.0390805 -18.6 0.72689655 0.00152728 345.96
    Rater 15 0.74137931 15 6.5 1.5 -0.0045977 -18.6 0.08551724 2.1139E-05 345.96
    Rater 8 0.65517241 20 3 4.5 -0.0908046 -13.6 1.23494253 0.00824547 184.96
    Rater 12 0.72413793 20 5 4.5 -0.0218391 -13.6 0.29701149 0.00047695 184.96
    Rater 13 0.60344828 20 2 4.5 -0.1425287 -13.6 1.9383908 0.02031444 184.96
    Rater 14 0.5862069 20 1 4.5 -0.1597701 -13.6 2.17287356 0.02552649 184.96
    Rater 4 0.84482759 23 14.5 7 0.09885057 -10.6 -1.0478161 0.00977144 112.36
    Rater 10 0.79310345 25 10.5 8 0.04712644 -8.6 -0.4052874 0.0022209 73.96
    Rater 11 0.81034483 36 12 9 0.06436782 2.4 0.15448276 0.00414322 5.76
    Rater 1 0.79310345 40 10.5 11 0.04712644 6.4 0.3016092 0.0022209 40.96
    Rater 2 0.82758621 40 13 11 0.0816092 6.4 0.52229885 0.00666006 40.96
    Rater 6 0.75862069 40 8.5 11 0.01264368 6.4 0.08091954 0.00015986 40.96
    Rater 5 0.84482759 50 14.5 13 0.09885057 16.4 1.62114943 0.00977144 268.96
    Rater 7 0.74137931 60 6.5 14 -0.0045977 26.4 -0.1213793 2.1139E-05 696.96
    Rater 9 0.75862069 80 8.5 15 0.01264368 46.4 0.58666667 0.00015986 2152.96

    The sum of (xi-x(bar))*(yi-y(bar)) in the numerator is 8.148275586. The sum of (xi-x(bar))^2 is 0.09124059 and the sum of (yi-y(bar))^2 is 4865.6. Those two sum multiplied is 443.940198, and the squareroot of that is 21.0698884, which goes in the denominator, giving a rho of 8.148275586/21.0698884 = 0.3867261, making me even more confused.

    Can this have to do with the (pre-view) findings by Hodges and collegues that standard non-parametric test in Stata, SAS, SPSS and R can give different results https://psyarxiv.com/zem2w/download ?

    I know I could use Kendalls tau when there are ties, but that is besides the point.

    The table with my calculations can be found at https://docs.google.com/spreadsheets...it?usp=sharing

    BR,
    Rasmus Green






  • #2
    The formula you cite for Spearman's rho is only in the case that there are no ties. The general formula for Spearman's rho is the same as for Pearson's, but using the ranks of the data instead of the original variables. The ranks are indeed the average over tied ranks. Consider this toy example:

    Code:
    input byte(x y)
    2 3
    4 5
    6 7
    4 5
    3 3
    9 6
    1 1
    0 1
    3 0
    7 3
    6 5
    end
    
    egen xrank = rank(x)
    egen yrank = rank(y)
    spearman x y, matrix
    corr xrank yrank
    and the outcome is exactly the same, as expected.

    Code:
                 |        x        y
    -------------+------------------
               x |   1.0000
               y |   0.7384   1.0000
    
    
                 |    xrank    yrank
    -------------+------------------
           xrank |   1.0000
           yrank |   0.7384   1.0000

    Comment

    Working...
    X