intra rater reliability, kap

sar reis

Join Date: Feb 2018

Posts: 5
#1

intra rater reliability, kap

10 Feb 2018, 12:58

I have a categorical dataset with two reviewers. I need to determine their agreement. Would I use kap var1 var2? Here is a hypothetical dataset that would be similar to the format of what I'm looking at. And could you point me to where I would go to determine how to interpret/properly word the results? Thanks!

var1 var2
0 1
1 0
0 0
0 0
1 1
1 0
0 1
0 0
1 1
0 0
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

10 Feb 2018, 13:18

Assuming that

1. you have the same two raters assessing the same items (call them R1 and R2), and,
2. each item is rated exactly once by each rater, and,
3. each observation in the above data represents one item, and,
4. var1 is the rating assigned by R1, and
5. var2 is the rating assigned by R2. then

yes, -kap var1 var2- will give you Cohen's kappa as a measure of agreement.

For the example data you show, the output you get is:

Code:

. kap var1 var2 Expected Agreement Agreement Kappa Std. Err. Z Prob>Z ----------------------------------------------------------------- 60.00% 52.00% 0.1667 0.3162 0.53 0.2991

I would report this as: Cohen's kappa for the two raters was 0.1667 (s.e. 0.3162), indicating slight agreement between the raters.

In a detailed report, I would also mention that the expected agreement was 52%, and the observed agreement was 60%.

I probably would not mention the Z statistic or p-value in any report: they're really not useful unless you are in a context where the null hypothesis that the two raters are just independent random number generators assigning ratings that have nothing to do with attributes of the items being rated would be plausible. But in most real-world applications of kappa, that null hypothesis is just a straw man, and reporting the p-value just muddies the waters.
Comment
sar reis

Join Date: Feb 2018

Posts: 5
#3

10 Feb 2018, 13:27

Thanks, Clyde! You are amazing!
Quick follow-up. Is there a way to look at this using text data? Similar to above (yes, all of those assumptions are correct). In other words, I am researching for a systematic review where some of the data abstracted will be text and not values. Thanks again!

var1 var2
a a
b b
cd cd
acd abd
b1 b1
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#4

10 Feb 2018, 14:34

So it's really the same thing, except that you have to create a numeric encoding for your variables, because -kap- will not work with strings.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str3(var1 var2) "a" "a" "b" "b" "cd" "cd" "acd" "abd" "b1" "b1" end encode var1, gen(nvar1) label(nvar) encode var2, gen(nvar2) label(nvar) list, noobs clean kap nvar1 nvar2

The key here is that you must specify the same label in both -encode- statements to guarantee that the same strings will receive the same numeric codes in both variables. Otherwise you would create spurious mismatches.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3824

11 Feb 2018, 00:10

Clyde gives excellent advice as usual. I wish to add a few points.

The title of this thread asks about intra-rater agreement; the query seems to be about inter-rater agreement.

Clyde lists the requirements for kap in #2 and suggests an interpretation. I tend to agree but in this example case (and otherwise, too) I would take a closer look at the z-stat and p-value. If you cannot reject the "straw man" null, then I would not report a "slight" agreement. If you are not interested in the chance-agreement (however [ill-]defined) that constitutes the null, then why report a chance-corrected agreement in the first place?

Problems with the kappa coefficient, such as dependency on marginal distribution and paradox results for high observed agreement, have been pointed out. Therefore alternative measures might be considered. See kappaetc (SSC) for a couple of such alternative measures, an alternative benchmarking method and simple way of testing less "straw man" like null hypotheses. Here are the results for the example data

Code:

. kappaetc var1 var2

Interrater agreement                             Number of subjects =      10
                                                Ratings per subject =       2
                                        Number of rating categories =       2
------------------------------------------------------------------------------
                     |   Coef.   Std. Err.   t    P>|t|   [95% Conf. Interval]
---------------------+--------------------------------------------------------
   Percent Agreement |  0.6000    0.1633   3.67   0.005     0.2306     0.9694
Brennan and Prediger |  0.2000    0.3266   0.61   0.555    -0.5388     0.9388
Cohen/Conger's Kappa |  0.1667    0.3322   0.50   0.628    -0.5849     0.9182
    Scott/Fleiss' Pi |  0.1667    0.3322   0.50   0.628    -0.5849     0.9182
           Gwet's AC |  0.2308    0.3379   0.68   0.512    -0.5336     0.9952
Krippendorff's Alpha |  0.2083    0.3322   0.63   0.546    -0.5432     0.9599
------------------------------------------------------------------------------

All alternative coefficients show similar low, though slightly higher, agreement. Compare the coefficients to the benchmark-levels suggested by Landis and Koch (1977) while taking their standard errors into account (cf. Gwet 2014).

Code:

. kappaetc , benchmark showscale noheader
------------------------------------------------------------------------------
                     |                            P cum.     Probabilistic
                     |   Coef.   Std. Err. P in.   >95%   [Benchmark Interval]
---------------------+--------------------------------------------------------
   Percent Agreement |  0.6000    0.1633   0.11   0.963     0.2000     0.4000
Brennan and Prediger |  0.2000    0.3266   0.72   1.000          .     0.0000
Cohen/Conger's Kappa |  0.1667    0.3322   0.69   1.000          .     0.0000
    Scott/Fleiss' Pi |  0.1667    0.3322   0.69   1.000          .     0.0000
           Gwet's AC |  0.2308    0.3379   0.74   1.000          .     0.0000
Krippendorff's Alpha |  0.2083    0.3322   0.73   1.000          .     0.0000
------------------------------------------------------------------------------

     Benchmark scale

           <0.0000      Poor
     0.0000-0.2000      Slight
     0.2000-0.4000      Fair
     0.4000-0.6000      Moderate
     0.6000-0.8000      Subtantial

Note that all chance-corrected coefficients indicate only "poor" agreement. Concerning a more relevant null hypothesis, test, for example, whether agreement is at least 0.2

Code:

. kappaetc , testvalue(< 0.2) noheader
------------------------------------------------------------------------------
                     |   Coef.   Std. Err.   t     P>t    [95% Conf. Interval]
---------------------+--------------------------------------------------------
   Percent Agreement |  0.6000    0.1633   2.45   0.018     0.2306     0.9694
Brennan and Prediger |  0.2000    0.3266   0.00   0.500    -0.5388     0.9388
Cohen/Conger's Kappa |  0.1667    0.3322  -0.10   0.539    -0.5849     0.9182
    Scott/Fleiss' Pi |  0.1667    0.3322  -0.10   0.539    -0.5849     0.9182
           Gwet's AC |  0.2308    0.3379   0.09   0.465    -0.5336     0.9952
Krippendorff's Alpha |  0.2083    0.3322   0.03   0.490    -0.5432     0.9599
------------------------------------------------------------------------------
 t test Ho: Coef. <=    0.2000   Ha: Coef. >   0.2000

The null hypothesis cannot be rejected at any conventional level.

Concerning the follow-up on string values, Sar Reis needs to clarify whether there is any agreement between, e.g., "acd" and "adb". This could be the case when these three letters correspond to three ratings/categories, each. Krippendorff and Craggs (2016) discuss such concepts; I am not aware of a Stata implementation of it.

Best
Daniel

Gwet, K. L. (2014). Handbook of Inter-Rater Reliability. Gaithersburg, MD: Advanced Analytics, LLC.
Krippendorff, K., and Craggs, R. (2016). The Reliability of Multi-Valued Coding of Data. Communication Methods and Measures, 10, 181, 198.
Landis, J. R., and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-174.

Last edited by daniel klein; 11 Feb 2018, 00:16.

Comment

sar reis

Join Date: Feb 2018

Posts: 5
#6

11 Feb 2018, 08:57

Thank you both! You have been incredibly helpful!
The examples above were just hypothetical and I see how they are confusing.
1. I am doing INTER. Thanks for catching that.
2. I am doing a systematic review. Some data abstracted will be nominal, some will be text. I want to be able to report the reliability between the two...absolute and not average. I had already tried to access the Gwet book, but it is currently under request through interlibrary loan.
3. The review will have two phases. The first will be to determine if each study should be retained. (Y/N) Then, there will be approximately 40 text/string entries and 30 nominal entries for each of the actual articles retained.

Thanks again. I appreciate any clarification you can provide.
Sar
Comment
sar reis

Join Date: Feb 2018

Posts: 5
#7

15 Feb 2018, 09:26

UPDATE...

The string variables would be words and the "acd abd" example above was not a good one. The information in the string is NOT coded/does NOT respond to ratings/categories.

Would it be appropriate to report both kap and kappaetc?

Thanks again. I appreciate any clarification you can provide.
Sar
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#8

15 Feb 2018, 09:47

Originally posted by sar reis View Post

The information in the string is NOT coded/does NOT respond to ratings/categories.

How is agreement defined then? Can you clarify or give a better example?

Would it be appropriate to report both kap and kappaetc?

kap estimates Cohen's Kappa for two (unique) raters and Scott's Pi/Fleiss K for more than two (nonunique) raters; kappaetc reports both (and other) coefficients, so kap does not add anything here. Use kap only if (a) for whatever reason you do not want to use user-written programs or (b) for whatever reason you want to report the analytic (asymptotic) standard error of kappa.

Best
Daniel
Comment
sar reis

Join Date: Feb 2018

Posts: 5
#9

15 Feb 2018, 14:11

Agreement would be defined as if they match.

Rater 1: 3 x 3 randomized controlled trial
Rater 2: 3 x 3 randomized controlled trial
Coded: 0 0 (match)

Rater 1: 2 x 2 randomized controlled trial
Rater 2: 12 group randomized trial
Coded: 0 1 (don't match)

I'm open to other ways of coding. Ideas?

So, if I use kappaetc, which of the coefficients do you recommend reporting?

Thanks!
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#10

15 Feb 2018, 14:48

Agreement would be defined as if they match.

Rater 1: 3 x 3 randomized controlled trial
Rater 2: 3 x 3 randomized controlled trial
Coded: 0 0 (match)

Rater 1: 2 x 2 randomized controlled trial
Rater 2: 12 group randomized trial
Coded: 0 1 (don't match)

I'm open to other ways of coding. Ideas?

I am not sure, I completely follow this. The "3 x 3 randomized controlled trial" is an example for the contents of the string variables, right? From a technical point of view, it is hard to work with strings because of typos, leading and trailing spaces and many other things (cf. Herrin and Poen 2008). Clyde gives good advice in #4 to use encode and convert to numeric with pre-defined value labels. However, you seem to prefer coding the examples by hand, which is also fine if feasible. Concerning the proposed coding scheme, you may want to consider whether all disagreements are equally severe, e.g., compared to "2 x 2 randomized controlled trial", is "12 group randomized trial" more or less in agreement than, say, "3 x 3 randomized controlled trial". However, I believe that such questions can only be answered by you or someone who knows more about the content of your study.

So, if I use kappaetc, which of the coefficients do you recommend reporting?

Well, one of the ideas behind kappaetc is to report (or at least look at) more than one coefficient if you do not have strong theoretical and/or methodological reasons to prefer one of them; this is why the command always estimates all coefficients. Ask yourself: are the coefficients all similar? If so, then your (probably somehow arbitrary) choice of a specific coefficient does not seem to alter substantive conclusions you draw. Do the coefficients differ substantially? If so, you may want to spend more time investigating the underlying assumptions and calculations to get a better understanding of why the coefficients produce different answers. You can then make a more informed choice. From a more practical point of view, you may want to skim the relevant literature in your field: what have others reported?

Best
Daniel

Herrin, J., and Poen, E. 2008. Stata tip 64: Cleaning up user-entered string variables. The Stata Journal, 8(3), 444-445.
Comment

Yuchen Hou

Join Date: Dec 2020
Posts: 1

#11

05 Dec 2020, 16:08

Originally posted by daniel klein View Post

Code:

. kappaetc var1 var2

Interrater agreement Number of subjects = 10
Ratings per subject = 2
Number of rating categories = 2
------------------------------------------------------------------------------
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------------------+--------------------------------------------------------
Percent Agreement | 0.6000 0.1633 3.67 0.005 0.2306 0.9694
Brennan and Prediger | 0.2000 0.3266 0.61 0.555 -0.5388 0.9388
Cohen/Conger's Kappa | 0.1667 0.3322 0.50 0.628 -0.5849 0.9182
Scott/Fleiss' Pi | 0.1667 0.3322 0.50 0.628 -0.5849 0.9182
Gwet's AC | 0.2308 0.3379 0.68 0.512 -0.5336 0.9952
Krippendorff's Alpha | 0.2083 0.3322 0.63 0.546 -0.5432 0.9599
------------------------------------------------------------------------------

Code:

. kappaetc , benchmark showscale noheader
------------------------------------------------------------------------------
| P cum. Probabilistic
| Coef. Std. Err. P in. >95% [Benchmark Interval]
---------------------+--------------------------------------------------------
Percent Agreement | 0.6000 0.1633 0.11 0.963 0.2000 0.4000
Brennan and Prediger | 0.2000 0.3266 0.72 1.000 . 0.0000
Cohen/Conger's Kappa | 0.1667 0.3322 0.69 1.000 . 0.0000
Scott/Fleiss' Pi | 0.1667 0.3322 0.69 1.000 . 0.0000
Gwet's AC | 0.2308 0.3379 0.74 1.000 . 0.0000
Krippendorff's Alpha | 0.2083 0.3322 0.73 1.000 . 0.0000
------------------------------------------------------------------------------

Benchmark scale

<0.0000 Poor
0.0000-0.2000 Slight
0.2000-0.4000 Fair
0.4000-0.6000 Moderate
0.6000-0.8000 Subtantial

Note that all chance-corrected coefficients indicate only "poor" agreement. Concerning a more relevant null hypothesis, test, for example, whether agreement is at least 0.2

Code:

. kappaetc , testvalue(< 0.2) noheader
------------------------------------------------------------------------------
| Coef. Std. Err. t P>t [95% Conf. Interval]
---------------------+--------------------------------------------------------
Percent Agreement | 0.6000 0.1633 2.45 0.018 0.2306 0.9694
Brennan and Prediger | 0.2000 0.3266 0.00 0.500 -0.5388 0.9388
Cohen/Conger's Kappa | 0.1667 0.3322 -0.10 0.539 -0.5849 0.9182
Scott/Fleiss' Pi | 0.1667 0.3322 -0.10 0.539 -0.5849 0.9182
Gwet's AC | 0.2308 0.3379 0.09 0.465 -0.5336 0.9952
Krippendorff's Alpha | 0.2083 0.3322 0.03 0.490 -0.5432 0.9599
------------------------------------------------------------------------------
t test Ho: Coef. <= 0.2000 Ha: Coef. > 0.2000

Hi Klein, Can you explain why the probabilistic benchmark interval limits for some coefficients are missing (or zero) but the cumulative probability is 1? I have a similar output (as below) as you posted above, having some difficulties explaining them.

Code:

Interrater agreement                             Number of subjects =     349
                                           Ratings per subject: min =       1
                                                                avg =  1.7249
                                                                max =       2
                                        Number of rating categories =       4
------------------------------------------------------------------------------
                     |                            P cum.     Probabilistic
                     |   Coef.  Std. Err.  P in.   >95%   [Benchmark Interval]
---------------------+--------------------------------------------------------
   Percent Agreement |  0.9644    0.0339   1.00   1.000          .     0.0000
Brennan and Prediger |  0.9526    0.0351   1.00   1.000          .     0.0000
Cohen/Conger's Kappa |  0.9431    0.0363   1.00   1.000          .     0.0000
    Scott/Fleiss' Pi |  0.9427    0.0364   1.00   1.000          .     0.0000
           Gwet's AC |  0.9552    0.0348   1.00   1.000          .     0.0000
Krippendorff's Alpha |  0.9437    0.0184   1.00   0.999     0.8000     1.0000
------------------------------------------------------------------------------

Thank you,

Yuchen

Last edited by Yuchen Hou; 05 Dec 2020, 16:12.

Comment

daniel klein

Join Date: Mar 2014

Posts: 3824
#12

06 Dec 2020, 06:00

Originally posted by Yuchen Hou View Post

Hi Klein, Can you explain why the probabilistic benchmark interval limits for some coefficients are missing (or zero) but the cumulative probability is 1?

Thanks for bringing this up. I have just posted an answer, including a workaround here.
Comment

Announcement