Calculating Fleiss Kappa

Lucy Kay

Join Date: Apr 2020

Posts: 43
#1

Calculating Fleiss Kappa

02 Jun 2020, 00:07

Each subject represents a rater. I want to know the agreement for the raters for each test. Why am I getting negatives for the Fleiss' kappa for each of the 9
tests? The score for each test is between 1-9.
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3824
#2

02 Jun 2020, 00:56

Originally posted by Lucy Kay View Post

Each subject represents a rater. [...]
Why am I getting negatives for the Fleiss' kappa for each of the 9
tests?

Both kap and kappa want observations to represent subjects to be rated. The variables either hold the ratings (kap) or the number of raters that have classified the subject into a category (kappa).

I am not entirely sure about your setup because I am having difficulties mapping the terms "raters" and "tests" to the usually used terms "subjects" and "raters". It appears as if you are using the term raters to denote what is usually referred to as subjects (as also indicated by the variable name). Subjects are classified into categories or assigned scores; sometimes by (human) raters/coders/judges, sometimes by (medical/psychological) tests. Thus, the term test would usually denote raters/coders/judges.

Depending on what exactly the raters and tests are in your data, you might need to xpose. If your data records ratings/test-scores, you want kap, not kappa.

For your next posting, please review the FAQ, especially regarding screen-shots and how to present data and code.
2 likes
Comment
Lucy Kay

Join Date: Apr 2020

Posts: 43
#3

02 Jun 2020, 09:46

Hi, you are correct in that I mean "tests" when I say subjects and "subjects" when I say tests1, test 2 etc. Could you expand on when I would use "xpose" and how I could apply that to my dataset functionally? I will change kappa to kap as it aligns with my data records and have also reviewed the FAQ. Thank you for your time!

Edit. I tried this sequence but the test1-test9 variables are no longer found:

. xpose, clear varname

. kap test1-test9
variable test1 not found
r(111);

Last edited by Lucy Kay; 02 Jun 2020, 10:13.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#4

02 Jun 2020, 10:53

Just type:

Code:

help xpose

Best regards,

Marcos
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#5

02 Jun 2020, 10:54

Also, check the name of the variables after xpose.

Best regards,

Marcos
Comment
Lucy Kay

Join Date: Apr 2020

Posts: 43
#6

02 Jun 2020, 11:40

I tried the xpose, clear varname code to convert my data and it says that subject1-subject9 are variables in my do-file but the variable subject1 is still not recognized. Also, I am not really sure what the purpose of the xpose code is in this case. For the record, I am trying to calculate Fleiss kappa to find inter-rater agreement for each of my 9 tests; my data has 55 raters scoring 9 tests on an integer scale from 1-9.

Attached Files

Last edited by Lucy Kay; 02 Jun 2020, 11:45.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#7

02 Jun 2020, 11:58

I believe adding a little content might help to further clarify the data setup. I will stick with a slightly modified version of the first example of Stata's kap command. Suppose that three radiologists have classified 5 xeromammograms into one of four categories: normal, benign disease, suspicion, cancer. As I understand it, you would have the xeromammograms as variables and radiologists as observations. kap expects the setup the other way round.

Here is how (I think) your data looks like:

Code:

// Step 0: example data clear input rad xm1 xm2 xm3 xm4 xm5 1 1 4 3 2 1 2 2 4 2 2 1 3 1 3 3 1 1 end // here is the dataset list

To get the data into shape, we use reshape twice:

Code:

// Step 1: get into shape reshape long xm , i(rad) // look what this has done list // now adjust the names rename (rad _j xm) (_j xm rad) // and look again list // now reshape back reshape wide rad , i(xm) j(_j) // and look at the final result list

We could have replaced the two reshape commands with one xpose command. However, I believe that the reshape approach is more instructive. Also, you will use reshape much more often than xpose.

Now, calculate the kappa coefficient

Code:

// Step 2: calculate kappa kap rad1-rad3

Run the code snippets above in the order and try to follow along.

A couple of additional thoughts: Having 9 rating categories, you might want to use weights for (dis-)agreement. I believe that Stata's kap command will not let you use weights with more than two raters; at least older releases had that limitation. If you want weighted agreement coefficients, download kappaetc from SSC. For example, linear weights could be applied typing

Code:

ssc install kappaetc kappaetc rad1-rad3 , wgt(linear)

where, obviously, you type the first line only once.

Last, please stop posting screenshots. Review the FAQ and use [CODE] delimiters to show code and output, as I have done above.

Last edited by daniel klein; 02 Jun 2020, 12:10. Reason: formatting
Comment
Lucy Kay

Join Date: Apr 2020

Posts: 43
#8

02 Jun 2020, 13:14

Thanks, I tried what you said and have gotten a combined kappa value and a kappa value for each for the 9 tests in my questionnaire. Do you know why I am getting negative kappa values for 3 of the 9 tests? Is this expected (because I have read that a negative kappa represents agreement worse than expected, or disagreement) or is there an error in my calculations with Stata?

Edit: Nvm, apparently kappa values can be between -1 to 1 so this is normal.

Last edited by Lucy Kay; 02 Jun 2020, 13:20.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#9

02 Jun 2020, 13:26

Originally posted by Lucy Kay View Post

[...]
I have read that a negative kappa represents agreement worse than expected

Yes, that follows from the mathematical definiton:

\[
\kappa = \frac{p_o-p_e}{1-p_e}
\]

where \(p_o\) is observed agreement and \(p_e\) is expected agreement.

I [...] have gotten [...] a kappa value for each for the 9 tests in my questionnaire

No, you have not. The individual kappa values do not refer to the 'tests' but to the rating categories! More precisely, the values represent the kappa values that would be obtained if all categories except one were combined. You might get confused here because the number of rating categories happens to be the same as the number of 'tests' in your data.

Last edited by daniel klein; 02 Jun 2020, 13:33.
Comment
Lucy Kay

Join Date: Apr 2020

Posts: 43
#10

02 Jun 2020, 14:41

But I wanted the agreement for the 55 raters for each of the 9 tests, not for each rating category! I don't think the latter gives me any important information.

What do you think about just using this code with my original dataset to get the agreement for the 55 raters for each of the 9 tests:

Code:

kap test1-test9

Last edited by Lucy Kay; 02 Jun 2020, 14:49.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#11

02 Jun 2020, 15:05

Technically, agreement is calculated row-wise. Thus, sticking with your initial attempt would give you something like the average intra-rater agreement, i.e., this would measure how much a rater's score on test1 agrees with the same rater's score on tests2 to test9.

I think we need more content here. What or who are the raters and what are the tests? Tell us a bit more about the substantive question that you are trying to answer.
Comment
Lucy Kay

Join Date: Apr 2020

Posts: 43
#12

02 Jun 2020, 15:13

The raters are 55 experts who were asked to rate the importance of 9 health outcomes on an integer scale of 1-9. I want to figure out how much the rater's scores agree with each other for each of the 9 health outcomes.

My study is exploratory so I am just trying to get some basic statistics on my survey results. I think looking at agreement between raters is important to see how much we can trust the median for each health outcome.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#13

02 Jun 2020, 16:32

I do not understand the part about the median of the health outcomes, but that might not be important here.

I believe that most concepts of chance-agreement are not really applicable to the case of only one subject being rated. For example, Fleiss' bases his notion of chance-agreement on the frequencies with which the rating categories are used by the raters. If there is only one subject, there is no intra-rater variation in the rating categories. Mathematically, there are only two possible values for Fleiss' kappa in this situation: The upper bound is 0 and it is reached if all raters choose one category. In that case, observed agreement equals expected agreement. Both kap and kappaetc will exit with error in that case because there is no variation in the data, whatsoever; I forgot how to get the lower bound of kappa and cannot derive quickly now. However, assume you have 9 observations, representing your health outcomes, and a variable, subject, that is numbered 1, 2, ..., 9. Further, the expert ratings are help in variables rater1, rater2, .., rater55. You can get the lower kappa by typing

Code:

bysort subject , rc0 : kappaetc rater1-rater55

This code will give you a kappa value for each subject (test). As stated above, I do not believe this is useful information because you could derive the two possible values based on the number of subjects, raters, and rating categories without even observing a single rating. If you really wanted this, you should probably either report observed agreement or the Brennan and Prediger coefficient (which you may call PABAK if you like that better).

However, typically, your interest would be in the classification process as a whole.

Last edited by daniel klein; 02 Jun 2020, 16:45. Reason: removed incorrect calculation
Comment

daniel klein

Join Date: Mar 2014
Posts: 3824

#14

02 Jun 2020, 18:01

I could not let go. I am sure there is a simpler way but here is the lower limit for observed agreement

Code:

// number of raters
local R 55

// number of categories
local C 9

// ---------------------------------------------------------------------

// how many times can the categries be repeated?
local full_categories = floor(`R'/`C')
// how many categories are left over; cannot be repeated?
local left_categories = mod(`R', `C')
// combinations for full categories
local full = (`C'-`left_categories')*max(0, comb(`full_categories', 2))
// combinations for the left categories
local left = `left_categories'*max(0, comb((`full_categories'+1), 2))
// fraction of total
display (`full'+`left') / comb(`R', 2)

which yields

Code:

. // fraction of total
. display (`full'+`left') / comb(`R', 2)
.09494949

meaning that with 55 raters choosing among 9 rating categories, it is not possible to observe less than about 10 percent agreement.

Comment

Lucy Kay

Join Date: Apr 2020
Posts: 43

#15

02 Jun 2020, 20:43

Why does Stata only show me 4 outcomes (i.e., 1, 2, 3, 7) when I have 9 health outcomes in my survey (and I want to see the kappa for agreement between the 55 expert raters for each of the 9 health outcomes):

Data [There are 55 tests (55 experts in my survey) and 9 subjects (9 health outcomes being rated), I only pasted a sample of the data]:

Code:

subject    test1    test2    test3    test4    test5
1    2    3    2    3    2
2    2    2    2    2    3
3    2    2    2    2    3

Code I used:

Code:

kap test1-test55

Stata output:

Code:

There are 55 raters per subject:

         Outcome |    Kappa          Z     Prob>Z
-----------------+-------------------------------
               1 |    0.0583       6.74    0.0000
               2 |    0.0060       0.69    0.2447
               3 |    0.0742       8.57    0.0000
               7 |   -0.0020      -0.23    0.5925
-----------------+-------------------------------
        combined |    0.0440       6.65    0.0000

Announcement