Fleiss kappa or ICC for interrater agreement (multiple readers, dichotomous outcome) and correct stata comand

Berit Nielsen

Join Date: Jan 2018

Posts: 2
#1

Fleiss kappa or ICC for interrater agreement (multiple readers, dichotomous outcome) and correct stata comand

18 Jan 2018, 01:16

106 units are all assessed by the same 5 readers. Units were judged either positive or negative (dichotomous outcome).

What is the best applied statistical test to look at interrater agreement?

1) Is ICC (two-way random effect model, single rater, agreement) usefull, or is that only to apply to continous or categorical data with >2 possible ratings?
2) Is Fleiss kappa the statistical test of choice, and if so,
a) is the correct stata command kappa pos neg (when data are organised in: Column 1; subject id, column 2; number of positive reads (pos), column 3; number of negative reads (neg))
b) Does this test allow that oner reader is the basis/gold standard (the one others should agree with)

Will be very grateful for your input
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3822
#2

18 Jan 2018, 04:06

106 units are all assessed by the same 5 readers. Units were judged either positive or negative (dichotomous outcome).

What is the best applied statistical test to look at interrater agreement?

Perhaps there is not one best statistic, but several approaches.

1) Is ICC (two-way random effect model, single rater, agreement) usefull, or is that only to apply to continous or categorical data with >2 possible ratings?

I think the ICC would, in general, be an option. If your readers are considered a random sample and if interest lies in the population of readers, then the two-way random effects model is appropriate. If the five readers are the only ones of interest, then use a mixed effects model. Stata's icc command estimates these models; however, with the data-setup that you describe in 2), I do not see how you could estimate an ICC.

2) Is Fleiss kappa the statistical test of choice,

Fleiss kappa is one of many chance-corrected agreement coefficients. These coefficients are all based on the (average) observed proportion of agreement. Given the design that you describe, i.e., five readers assign binary ratings, there cannot be less than 3 out of 5 agreements for a given subject. That means that agreement has, by design, a lower bound of 0.6. Keeping this in mind, you can compute Fleiss kappa with the syntax that you suggest

Code:

kappa pos neg

I encourage you to download kappaetc (from SSC) that estimates Fleiss kappa and other chance-corrected agreement coefficients. The syntax is nearly identical

Code:

kappaetc pos neg , frequency

Compare the different coefficients to get a sense of how sensitive your results are, depending on your choice of the best applied statistic.

b) Does this test allow that oner reader is the basis/gold standard (the one others should agree with)

In general: no. A golden standard requires that the reader, who is assumed to give the golden standard, be identified. Fleiss kappa does not require unique readers. Since readers are unique in your case, you could probably still use Fleiss kappa. However, besides complications that arise because you have more than two readers, the reader or gold standard for each subject must still be identified in the data and this is not possible if you only have recorded the number of positive an negative ratings.

Best
Daniel

Last edited by daniel klein; 18 Jan 2018, 04:58. Reason: wrong delimiters and spelling
1 like
Comment
Berit Nielsen

Join Date: Jan 2018

Posts: 2
#3

18 Jan 2018, 06:41

Thank you very much for your comments.

Our readers are considered a random sample and interest lies in the population of readers.
I do actually have the recordings of each individual reader, and so have the opportunity to set up data as required for the specific test, eg ICC.
Also one reader is considered the expert, and hence could be our gold standard. Is the stata output 'individual', when applying the ICC command, not referring to a single reader who is the considered gold standard? Or did i misunderstand the concept of 'single reader'....
If correct, how do I tell stata, which variable (reader) should be considered gold standard?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#4

18 Jan 2018, 07:09

If you want to consider readers as a random sample, then you are considering them to be exchangeable. Considering one's judgments to set a "gold standard" isn't really consistent with that.

Anyway, if you want to estimate an intraclass correlation (ICC) coefficient considering both units and judges as random, and with judgments a dichotomous outcome, then you could fit a cross-classified random effects logistic regression model and compute the ICC coefficient from the fitted variances. Something like the following. (Begin at the "Begin here" comment. The stuff before is just to set up an artificial dataset that mimics yours.)

.ÿversionÿ15.1

.ÿ
.ÿclearÿ*

.ÿ
.ÿsetÿseedÿ`=strreverse("1426289")'

.ÿ
.ÿquietlyÿsetÿobsÿ106

.ÿgenerateÿbyteÿuidÿ=ÿ_n

.ÿgenerateÿdoubleÿv_uÿ=ÿrnormal(0,ÿ2)

.ÿ
.ÿ//ÿ"two-wayÿrandomÿeffectÿmodel"
.ÿtempfileÿtmpfil0

.ÿquietlyÿsaveÿ`tmpfil0'

.ÿ
.ÿdropÿ_all

.ÿquietlyÿsetÿobsÿ5

.ÿgenerateÿbyteÿjidÿ=ÿ_n

.ÿgenerateÿdoubleÿj_uÿ=ÿrnormal(0,ÿsqrt(1/10))

.ÿ
.ÿcrossÿusingÿ`tmpfil0'

.ÿ
.ÿgenerateÿdoubleÿxbuÿ=ÿv_uÿ+ÿj_u

.ÿgenerateÿbyteÿoutcomeÿ=ÿrbinomial(1,ÿinvlogit(xbu))

.ÿ
.ÿ*
.ÿ*ÿBeginÿhere
.ÿ*
.ÿmelogitÿoutcomeÿ||ÿ_all:ÿR.jidÿ||ÿuid:,ÿintmethod(mvaghermite)ÿintpoints(3)ÿnolog

Mixed-effectsÿlogisticÿregressionÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿobsÿÿÿÿÿ=ÿÿÿÿÿÿÿÿ500

-------------------------------------------------------------
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|ÿÿÿÿÿNo.ÿofÿÿÿÿÿÿÿObservationsÿperÿGroup
ÿGroupÿVariableÿ|ÿÿÿÿÿGroupsÿÿÿÿMinimumÿÿÿÿAverageÿÿÿÿMaximum
----------------+--------------------------------------------
ÿÿÿÿÿÿÿÿÿÿÿ_allÿ|ÿÿÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿ500ÿÿÿÿÿÿ500.0ÿÿÿÿÿÿÿÿ500
ÿÿÿÿÿÿÿÿÿÿÿÿuidÿ|ÿÿÿÿÿÿÿÿ100ÿÿÿÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿ5.0ÿÿÿÿÿÿÿÿÿÿ5
-------------------------------------------------------------

Integrationÿmethod:ÿmvaghermiteÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿIntegrationÿpts.ÿÿ=ÿÿÿÿÿÿÿÿÿÿ3

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿWaldÿchi2(0)ÿÿÿÿÿÿ=ÿÿÿÿÿÿÿÿÿÿ.
Logÿlikelihoodÿ=ÿ-309.97038ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿProbÿ>ÿchi2ÿÿÿÿÿÿÿ=ÿÿÿÿÿÿÿÿÿÿ.
------------------------------------------------------------------------------
ÿÿÿÿÿoutcomeÿ|ÿÿÿÿÿÿCoef.ÿÿÿStd.ÿErr.ÿÿÿÿÿÿzÿÿÿÿP>|z|ÿÿÿÿÿ[95%ÿConf.ÿInterval]
-------------+----------------------------------------------------------------
ÿÿÿÿÿÿÿ_consÿ|ÿÿÿ.0038208ÿÿÿ.2414489ÿÿÿÿÿ0.02ÿÿÿ0.987ÿÿÿÿ-.4694104ÿÿÿÿÿ.477052
-------------+----------------------------------------------------------------
_all>jidÿÿÿÿÿ|
ÿÿÿvar(_cons)|ÿÿÿ.0555947ÿÿÿ.0770155ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ.0036801ÿÿÿÿÿ.839856
-------------+----------------------------------------------------------------
uidÿÿÿÿÿÿÿÿÿÿ|
ÿÿÿvar(_cons)|ÿÿÿ2.819757ÿÿÿ.7354933ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ1.691173ÿÿÿÿÿ4.70149
------------------------------------------------------------------------------
LRÿtestÿvs.ÿlogisticÿmodel:ÿchi2(2)ÿ=ÿ73.20ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿProbÿ>ÿchi2ÿ=ÿ0.0000

Note:ÿLRÿtestÿisÿconservativeÿandÿprovidedÿonlyÿforÿreference.

.ÿdisplayÿinÿsmclÿasÿtextÿ"ICCÿ=ÿ"ÿasÿresultÿ%04.2fÿ///
>ÿÿÿÿÿÿÿÿÿ_b[/var(_cons[uid])]ÿ/ÿ(_b[/var(_cons[uid])]ÿ+ÿ_b[/var(_cons[_all>jid])]ÿ+ÿ_pi^2ÿ/ÿ3)
ICCÿ=ÿ0.46

.ÿ
.ÿexit

endÿofÿdo-file

.

But keep Daniel's excellent advice in mind in that there might not be one best statistic, that you need to carefully consider just what it is that you want to accomplish.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3822

18 Jan 2018, 07:24

Is the stata output 'individual', when applying the ICC command, not referring to a single reader who is the considered gold standard?

No, it is not. The term "individual" refers to the ratings that are compared. You would use the "average" ICC if you had teams of readers and wanted to compare averages of ratings. I do not believe that such a situation would occur often, but then again, I am not really doing much inter-rater reliability analysis.

Anyway, comparing to a gold standard is straightforward when only two readers are involved. With more than two readers, I have no clear idea how to do this, technically. For agreement coefficients, one possible way that I can think of is to have the gold standard ratings repeated for each of the other four readers. Here is an example

Code:

// example data
webuse p615b , clear

rename rater1 gold
rename rater# score#

list

reshape long score , i(subject) j(rater)

sort subject rater
list in 1/10 , sepby(subject)

This gives

Code:

[...]
. list

     +----------------------------------------------------+
     | subject   gold   score2   score3   score4   score5 |
     |----------------------------------------------------|
  1. |       1      1        2        2        2        2 |
  2. |       2      1        1        3        3        3 |
  3. |       3      3        3        3        3        3 |
  4. |       4      1        1        1        1        3 |
  5. |       5      1        1        1        3        3 |
     |----------------------------------------------------|
  6. |       6      1        2        2        2        2 |
  7. |       7      1        1        1        1        1 |
  8. |       8      2        2        2        2        3 |
  9. |       9      1        3        3        3        3 |
 10. |      10      1        1        1        3        3 |
     +----------------------------------------------------+

.
. reshape long score , i(subject) j(rater)
(note: j = 2 3 4 5)

[...]

. list in 1/10 , sepby(subject)

     +--------------------------------+
     | subject   rater   gold   score |
     |--------------------------------|
  1. |       1       2      1       2 |
  2. |       1       3      1       2 |
  3. |       1       4      1       2 |
  4. |       1       5      1       2 |
     |--------------------------------|
  5. |       2       2      1       1 |
  6. |       2       3      1       3 |
  7. |       2       4      1       3 |
  8. |       2       5      1       3 |
     |--------------------------------|
  9. |       3       2      3       3 |
 10. |       3       3      3       3 |
     +--------------------------------+
[...]

You could then estimate agreement with

Code:

. kap gold score

             Expected
Agreement   Agreement     Kappa   Std. Err.         Z      Prob>Z
-----------------------------------------------------------------
  47.50%      31.00%     0.2391     0.0758       3.15      0.0008

The standard error and p-values will be way too optimistic, because you now have 40 observations when you actually only have 10 subjects. With kappaetc (SSC*), we can "correct" for this using importance weights

Code:

. kappaetc gold score [iweight = 1/4]

Interrater agreement                             Number of subjects =      10
                                                Ratings per subject =       2
                                        Number of rating categories =       3
------------------------------------------------------------------------------
                     |   Coef.   Std. Err.   t    P>|t|   [95% Conf. Interval]
---------------------+--------------------------------------------------------
   Percent Agreement |  0.4750    0.1665   2.85   0.019     0.0984     0.8516
Brennan and Prediger |  0.2125    0.2497   0.85   0.417    -0.3523     0.7773
Cohen/Conger's Kappa |  0.2391    0.1746   1.37   0.204    -0.1559     0.6342
       Fleiss' Kappa |  0.1153    0.2790   0.41   0.689    -0.5159     0.7465
           Gwet's AC |  0.2535    0.2485   1.02   0.334    -0.3086     0.8156
Krippendorff's alpha |  0.1596    0.2790   0.57   0.581    -0.4717     0.7908
------------------------------------------------------------------------------

Something similar might be possible for the ICC; but see Joseph's valid point about exchangeable readers above.

Best
Daniel

* The output that you will get looks different, because I have an updated version of kappaetc that I am going to release soon.

Last edited by daniel klein; 18 Jan 2018, 07:38. Reason: added reference to Joseph's good point about exchangeable readers

Comment

daniel klein

Join Date: Mar 2014

Posts: 3822
#6

18 Jan 2018, 07:50

There is a minor bug Joseph's code that is irrelevant for the point he makes. The line

Code:

generate byte uid = _n

will produce missing values for all observations > 100. It should be

Code:

generate long uid = _n

where you could also omit the long in this case.

Best
Daniel
Comment
daniel klein

Join Date: Mar 2014

Posts: 3822
#7

18 Jan 2018, 09:19

Some additional thoughts on the usage of a gold standard. Different questions can be asked when a gold standard is given.

The approach outlined in #5 assess the extent of agreement between each reader/rater and the gold standard, separately. It does not actually address agreement among the (four) readers in any way. Gwet (2014, Ch. 11) discusses the concept of validity coefficients (as opposed to reliability coefficients), which basically assess the extent of agreement among readers/raters to classify a given subject into the "true" category, i.e., the golden standard. I have not implemented this concept (mainly because the variance expressions have not been worked out for all coefficients).

One could also ask for the extent of agreement among the readers/raters for each category of the gold standard. Gwet (2014, Ch. 11) calls this "conditional" reliability analysis. You could get this with

Code:

bysort reader1 : kappaetc reader2-reader5

assuming that reader1 sets the gold standard.

Only Berit can decide which questions should be answered.

Best
Daniel

Gwet, K. L. (2014). Handbook of Inter-Rater Reliability. Gaithersburg, MD: Advanced Analytics, LLC.

Last edited by daniel klein; 18 Jan 2018, 09:23.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#8

19 Jan 2018, 02:12

Originally posted by daniel klein View Post

There is a minor bug Joseph's code . . .

Yes. Sorry about that. I ought to have known better, and actually did suspect that I was in the neighborhood of the maximum, but was lulled into believing that the limit must be higher because I didn't see any (6 missing values generated) warning.
Comment
Federico Tedeschi

Join Date: Mar 2015

Posts: 137
#9

16 Jul 2020, 03:30

Originally posted by daniel klein View Post

Fleiss kappa does not require unique readers.

Since the assumption that a new sample of coders is selected each time is not met, I think it's better to look for alternatives when the raters are always the same.
Here:
https://www.researchgate.net/publica...w_and_Tutorial
and in particular in the Subsection: "Common kappa-like variants for 3 or more coders", methods to calculate IRR with 3+ raters are discussed.
Comment

Announcement