Fleiss Kappa

Katerina Andreadis

Join Date: Apr 2020
Posts: 2

29 Apr 2020, 13:32

Hello,

I'm trying to calculate the Fleiss kappa and keep receiving errors. I have one subject, eight categories, and 10 raters. I organized my data in the following way:

Category	rater 1	rater 2	rater 3	rater 4	rater 5	rater 6	rater 7	rater 8	rater 9	rater 10
1	#	#	#	#	#	#	#	#	#	#
2	#	#	#	#	#	#	#	#	#	#
3	#	#	#	#	#	#	#	#	#	#
4	#	#	#	#	#	#	#	#	#	#
5	#	#	#	#	#	#	#	#	#	#
6	#	#	#	#	#	#	#	#	#	#
7	#	#	#	#	#	#	#	#	#	#
8	#	#	#	#	#	#	#	#	#	#

I run the following command "kap rater1 rater2 rater3 rater4 rater5 rater6 rater7 rater8 rater9 rater10"
my output is the following:

Outcome | Kappa Z Prob>Z
-----------------+-------------------------------
1 | -0.0127 -0.24 0.5949
2 | 0.2982 5.66 0.0000
3 | 0.1228 2.33 0.0099
4 | -0.0403 -0.76 0.7776
5 | 0.2059 3.91 0.0000
-----------------+-------------------------------
combined | 0.1127 2.90 0.0019

This doesn't seem correct, so I was wondering if someone can advise me on how I should reorganize the data, or which command I should try instead.

Thank you in advance for your help!!

Tags: fleiss, interrater, kappa

daniel klein

Join Date: Mar 2014

Posts: 3824
#2

29 Apr 2020, 14:13

The older documentation of kap was much more explicit on the data requirements:

kap assumes that each observation is a subject. varname1 contains the ratings by the first rater, varname2 by the second rater, and so on

As a side note: While it is technically possible to compute the kappa statistic for one subject, the result might not be meaningful and is hard to interpret. Suppose you ask 10 raters to classify the color of an orange. Suppose that all 10 raters agree to classify the color of that orange as orange. Given that the raters rated only one orange, how do you know they would classify another orange as orange, too. How do you know they would not classify an apple's color as orange, too? You simply do not have sufficient data to demonstrate the raters' ability to differentiate between different subjects.

Best
Daniel
Comment
Katerina Andreadis

Join Date: Apr 2020

Posts: 2
#3

29 Apr 2020, 14:58

Thank you for your reply, Daniel! I am trying to validate a model, which is why my "subject" is one, so I want to show that there is high inter-rater reliability amongst the raters. Is there something else that you recommend I do instead? Or could I use each category as a "subject" and reorganize the data in such a way?

Thank you so much for your help!

best,
Katerina
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#4

30 Apr 2020, 02:00

I do not quite know what "validate a model" means and why this implies that there is only one subject to rate. To give more specific advice, I would need more information on the contents of your study.

Anyway, I do not believe that the concept of inter-rater reliability is useful for one subject. I have given a silly example of a fundamental conceptual concern in my previous post. There are also technical concerns.

In your setup, there are only two possible values for Fleiss K. The observed agreement will vary between 0.0444 and 1.* If all raters agree, i.e., choose the category, the expected/chance agreement, as defined by Fleiss (1971), is 1, too.** Fleiss K is then calculated as (1-1)/(1-1), which results in division by 0 and, therefore, a missing value. For any other observed agreement, Fleiss K will be -0.1111.*** These results seem to suggest that Fleiss K for one subject is independent of the observed agreement. If this is true, then you cannot learn anything about reliability from that statistic.

There are other agreement coefficients in the literature that appear to be less affected by the technical problem described above (see kappaetc, SSC). The conceptual concerns remain valid.

Best
Daniel

* I have calculated the lower bound as follows: There are 10 raters but only 8 rating categories.It is, therefore, not possible for all raters to choose a diferent category. At least two pairs of raters must choose the same category, resulting in 2 observed agreements. The total number of pairs is 10*(10-1)/2 = 45 and 2/45 = 0.0444.

** This is calculated as follows: There are 10 raters and 8 categories. If all 10 raters choose one category, this category has a probability of being chosen of (10/10) = 1. All remaining 7 categories have probability of being chosen of + 7*(0/10). Summing up the squares results in 1 again.

*** To be honest, I did not fully get this, either. It turns out that the expecetd/chance agreement is (1-1/10)*p_o+(1/10), where p_o is the observed agreement (see Gwet's [2014] derivation of Krippendorff's alpha). For any value 0 <= p_o <= 1 you plug into the formula, the result of (p_o - p_e)/(1 - p_e) will always be -0.1111.

Fleiss, J. L. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76: 378-382.

Gwet, K. L. 2014. Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters. 4th ed. Gaithersburg, MD: Advanced Analytics.
Comment

Announcement

Fleiss Kappa

Comment

Comment

Comment