How does Stata handle ties in Spearman correlation? Discrepancy to hand-calculation

Rasmus Green

Join Date: Feb 2019
Posts: 0

How does Stata handle ties in Spearman correlation? Discrepancy to hand-calculation

10 Apr 2020, 10:37

Hi,

I'm correlating rater accuracy to rater experience (measured as number of examinations preformed annually). These are not normally distributed, but likely monotonically related (the more experienced, the more accurate), at least that is the hypothesis I wanna test. Thus, I wanna do Spearmans correlation.

This is my table

Rater	Accuracy	# of cases	Accuracy rank	Case rank	d	d²
Rater 1	79.3%	40	10.5	11	-0.5	0.25
Rater 2	82.8%	40	13	11	2	4
Rater 3	70.7%	15	4	1.5	2.5	6.25
Rater 4	84.5%	23	14.5	7	7.5	56.25
Rater 5	84.5%	50	14.5	13	1.5	2.25
Rater 6	75.9%	40	8.5	11	-2.5	6.25
Rater 7	74.1%	60	6.5	14	-7.5	56.25
Rater 8	65.5%	20	3	4.5	-1.5	2.25
Rater 9	75.9%	80	8.5	15	-6.5	42.25
Rater 10	79.3%	25	10.5	8	2.5	6.25
Rater 11	81.0%	36	12	9	3	9
Rater 12	72.4%	20	5	4.5	0.5	0.25
Rater 13	60.3%	20	2	4.5	-2.5	6.25
Rater 14	58.6%	20	1	4.5	-3.5	12.25
Rater 15	74.1%	15	6.5	1.5	5	25

The sum of d^2 is 235. There are a few ties, for example rater 8, 12, 13 and 14 all claim to assess 20 cases annually, ranking third, fourth, fifth, sixth, thus having the average rank of 4.5.

Plugging it into the rho formula

using an Excel sheet gives rho= 0.5803

With Stata:

Code:

spearman accuracy numberofcases

Number of obs = 15
Spearman's rho = 0.5731

Now, I think it has to do with how Stata handle ties. In "A Gentle introduction to Stata", chapter 8, p. 180, it says that Stata uses averages when there is ties, but I have not been able to confirm this from within Stata help files. And then I don't understand why I get a discrepant result, since I averaged ranks across ties.

When instead using the rho formula for ties,

I can extend my table with xi-x(bar), yi-y(bar), and those squared

Rater	Accuracy	Number of cases	Rank of accuracy	Rank of cases	xi-x(bar)	yi-y(bar)	(xi-x(bar))*(yi-y(bar))	(xi-x(bar))^2	(yi-y(bar))^2
Rater 3	0.70689655	15	4	1.5	-0.0390805	-18.6	0.72689655	0.00152728	345.96
Rater 15	0.74137931	15	6.5	1.5	-0.0045977	-18.6	0.08551724	2.1139E-05	345.96
Rater 8	0.65517241	20	3	4.5	-0.0908046	-13.6	1.23494253	0.00824547	184.96
Rater 12	0.72413793	20	5	4.5	-0.0218391	-13.6	0.29701149	0.00047695	184.96
Rater 13	0.60344828	20	2	4.5	-0.1425287	-13.6	1.9383908	0.02031444	184.96
Rater 14	0.5862069	20	1	4.5	-0.1597701	-13.6	2.17287356	0.02552649	184.96
Rater 4	0.84482759	23	14.5	7	0.09885057	-10.6	-1.0478161	0.00977144	112.36
Rater 10	0.79310345	25	10.5	8	0.04712644	-8.6	-0.4052874	0.0022209	73.96
Rater 11	0.81034483	36	12	9	0.06436782	2.4	0.15448276	0.00414322	5.76
Rater 1	0.79310345	40	10.5	11	0.04712644	6.4	0.3016092	0.0022209	40.96
Rater 2	0.82758621	40	13	11	0.0816092	6.4	0.52229885	0.00666006	40.96
Rater 6	0.75862069	40	8.5	11	0.01264368	6.4	0.08091954	0.00015986	40.96
Rater 5	0.84482759	50	14.5	13	0.09885057	16.4	1.62114943	0.00977144	268.96
Rater 7	0.74137931	60	6.5	14	-0.0045977	26.4	-0.1213793	2.1139E-05	696.96
Rater 9	0.75862069	80	8.5	15	0.01264368	46.4	0.58666667	0.00015986	2152.96

The sum of (xi-x(bar))*(yi-y(bar)) in the numerator is 8.148275586. The sum of (xi-x(bar))^2 is 0.09124059 and the sum of (yi-y(bar))^2 is 4865.6. Those two sum multiplied is 443.940198, and the squareroot of that is 21.0698884, which goes in the denominator, giving a rho of 8.148275586/21.0698884 = 0.3867261, making me even more confused.

Can this have to do with the (pre-view) findings by Hodges and collegues that standard non-parametric test in Stata, SAS, SPSS and R can give different results https://psyarxiv.com/zem2w/download ?

I know I could use Kendalls tau when there are ties, but that is besides the point.

The table with my calculations can be found at https://docs.google.com/spreadsheets...it?usp=sharing

BR,
Rasmus Green

Tags: None

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2338
#2

10 Apr 2020, 11:22

The formula you cite for Spearman's rho is only in the case that there are no ties. The general formula for Spearman's rho is the same as for Pearson's, but using the ranks of the data instead of the original variables. The ranks are indeed the average over tied ranks. Consider this toy example:

Code:

input byte(x y) 2 3 4 5 6 7 4 5 3 3 9 6 1 1 0 1 3 0 7 3 6 5 end egen xrank = rank(x) egen yrank = rank(y) spearman x y, matrix corr xrank yrank

and the outcome is exactly the same, as expected.

Code:

| x y -------------+------------------ x | 1.0000 y | 0.7384 1.0000 | xrank yrank -------------+------------------ xrank | 1.0000 yrank | 0.7384 1.0000
Comment

Announcement

How does Stata handle ties in Spearman correlation? Discrepancy to hand-calculation

Comment