Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • intra rater reliability, kap

    I have a categorical dataset with two reviewers. I need to determine their agreement. Would I use kap var1 var2? Here is a hypothetical dataset that would be similar to the format of what I'm looking at. And could you point me to where I would go to determine how to interpret/properly word the results? Thanks!

    var1 var2
    0 1
    1 0
    0 0
    0 0
    1 1
    1 0
    0 1
    0 0
    1 1
    0 0

  • #2
    Assuming that

    1. you have the same two raters assessing the same items (call them R1 and R2), and,
    2. each item is rated exactly once by each rater, and,
    3. each observation in the above data represents one item, and,
    4. var1 is the rating assigned by R1, and
    5. var2 is the rating assigned by R2. then

    yes, -kap var1 var2- will give you Cohen's kappa as a measure of agreement.

    For the example data you show, the output you get is:
    Code:
    . kap var1 var2
    
                 Expected
    Agreement   Agreement     Kappa   Std. Err.         Z      Prob>Z
    -----------------------------------------------------------------
      60.00%      52.00%     0.1667     0.3162       0.53      0.2991
    I would report this as: Cohen's kappa for the two raters was 0.1667 (s.e. 0.3162), indicating slight agreement between the raters.

    In a detailed report, I would also mention that the expected agreement was 52%, and the observed agreement was 60%.

    I probably would not mention the Z statistic or p-value in any report: they're really not useful unless you are in a context where the null hypothesis that the two raters are just independent random number generators assigning ratings that have nothing to do with attributes of the items being rated would be plausible. But in most real-world applications of kappa, that null hypothesis is just a straw man, and reporting the p-value just muddies the waters.

    Comment


    • #3
      Thanks, Clyde! You are amazing!
      Quick follow-up. Is there a way to look at this using text data? Similar to above (yes, all of those assumptions are correct). In other words, I am researching for a systematic review where some of the data abstracted will be text and not values. Thanks again!

      var1 var2
      a a
      b b
      cd cd
      acd abd
      b1 b1

      Comment


      • #4
        So it's really the same thing, except that you have to create a numeric encoding for your variables, because -kap- will not work with strings.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str3(var1 var2)
        "a"   "a"  
        "b"   "b"  
        "cd"  "cd" 
        "acd" "abd"
        "b1"  "b1" 
        end
        
        encode var1, gen(nvar1) label(nvar)
        encode var2, gen(nvar2) label(nvar)
        
        list, noobs clean
        
        kap nvar1 nvar2
        The key here is that you must specify the same label in both -encode- statements to guarantee that the same strings will receive the same numeric codes in both variables. Otherwise you would create spurious mismatches.

        Comment


        • #5
          Clyde gives excellent advice as usual. I wish to add a few points.

          The title of this thread asks about intra-rater agreement; the query seems to be about inter-rater agreement.

          Clyde lists the requirements for kap in #2 and suggests an interpretation. I tend to agree but in this example case (and otherwise, too) I would take a closer look at the z-stat and p-value. If you cannot reject the "straw man" null, then I would not report a "slight" agreement. If you are not interested in the chance-agreement (however [ill-]defined) that constitutes the null, then why report a chance-corrected agreement in the first place?

          Problems with the kappa coefficient, such as dependency on marginal distribution and paradox results for high observed agreement, have been pointed out. Therefore alternative measures might be considered. See kappaetc (SSC) for a couple of such alternative measures, an alternative benchmarking method and simple way of testing less "straw man" like null hypotheses. Here are the results for the example data

          Code:
          . kappaetc var1 var2
          
          Interrater agreement                             Number of subjects =      10
                                                          Ratings per subject =       2
                                                  Number of rating categories =       2
          ------------------------------------------------------------------------------
                               |   Coef.   Std. Err.   t    P>|t|   [95% Conf. Interval]
          ---------------------+--------------------------------------------------------
             Percent Agreement |  0.6000    0.1633   3.67   0.005     0.2306     0.9694
          Brennan and Prediger |  0.2000    0.3266   0.61   0.555    -0.5388     0.9388
          Cohen/Conger's Kappa |  0.1667    0.3322   0.50   0.628    -0.5849     0.9182
              Scott/Fleiss' Pi |  0.1667    0.3322   0.50   0.628    -0.5849     0.9182
                     Gwet's AC |  0.2308    0.3379   0.68   0.512    -0.5336     0.9952
          Krippendorff's Alpha |  0.2083    0.3322   0.63   0.546    -0.5432     0.9599
          ------------------------------------------------------------------------------
          All alternative coefficients show similar low, though slightly higher, agreement. Compare the coefficients to the benchmark-levels suggested by Landis and Koch (1977) while taking their standard errors into account (cf. Gwet 2014).

          Code:
          . kappaetc , benchmark showscale noheader
          ------------------------------------------------------------------------------
                               |                            P cum.     Probabilistic
                               |   Coef.   Std. Err. P in.   >95%   [Benchmark Interval]
          ---------------------+--------------------------------------------------------
             Percent Agreement |  0.6000    0.1633   0.11   0.963     0.2000     0.4000
          Brennan and Prediger |  0.2000    0.3266   0.72   1.000          .     0.0000
          Cohen/Conger's Kappa |  0.1667    0.3322   0.69   1.000          .     0.0000
              Scott/Fleiss' Pi |  0.1667    0.3322   0.69   1.000          .     0.0000
                     Gwet's AC |  0.2308    0.3379   0.74   1.000          .     0.0000
          Krippendorff's Alpha |  0.2083    0.3322   0.73   1.000          .     0.0000
          ------------------------------------------------------------------------------
          
               Benchmark scale
          
                     <0.0000      Poor
               0.0000-0.2000      Slight
               0.2000-0.4000      Fair
               0.4000-0.6000      Moderate
               0.6000-0.8000      Subtantial
          Note that all chance-corrected coefficients indicate only "poor" agreement. Concerning a more relevant null hypothesis, test, for example, whether agreement is at least 0.2

          Code:
          . kappaetc , testvalue(< 0.2) noheader
          ------------------------------------------------------------------------------
                               |   Coef.   Std. Err.   t     P>t    [95% Conf. Interval]
          ---------------------+--------------------------------------------------------
             Percent Agreement |  0.6000    0.1633   2.45   0.018     0.2306     0.9694
          Brennan and Prediger |  0.2000    0.3266   0.00   0.500    -0.5388     0.9388
          Cohen/Conger's Kappa |  0.1667    0.3322  -0.10   0.539    -0.5849     0.9182
              Scott/Fleiss' Pi |  0.1667    0.3322  -0.10   0.539    -0.5849     0.9182
                     Gwet's AC |  0.2308    0.3379   0.09   0.465    -0.5336     0.9952
          Krippendorff's Alpha |  0.2083    0.3322   0.03   0.490    -0.5432     0.9599
          ------------------------------------------------------------------------------
           t test Ho: Coef. <=    0.2000   Ha: Coef. >   0.2000
          The null hypothesis cannot be rejected at any conventional level.

          Concerning the follow-up on string values, Sar Reis needs to clarify whether there is any agreement between, e.g., "acd" and "adb". This could be the case when these three letters correspond to three ratings/categories, each. Krippendorff and Craggs (2016) discuss such concepts; I am not aware of a Stata implementation of it.

          Best
          Daniel

          Gwet, K. L. (2014). Handbook of Inter-Rater Reliability. Gaithersburg, MD: Advanced Analytics, LLC.
          Krippendorff, K., and Craggs, R. (2016). The Reliability of Multi-Valued Coding of Data. Communication Methods and Measures, 10, 181, 198.
          Landis, J. R., and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-174.
          Last edited by daniel klein; 11 Feb 2018, 00:16.

          Comment


          • #6
            Thank you both! You have been incredibly helpful!
            The examples above were just hypothetical and I see how they are confusing.
            1. I am doing INTER. Thanks for catching that.
            2. I am doing a systematic review. Some data abstracted will be nominal, some will be text. I want to be able to report the reliability between the two...absolute and not average. I had already tried to access the Gwet book, but it is currently under request through interlibrary loan.
            3. The review will have two phases. The first will be to determine if each study should be retained. (Y/N) Then, there will be approximately 40 text/string entries and 30 nominal entries for each of the actual articles retained.

            Thanks again. I appreciate any clarification you can provide.
            Sar

            Comment


            • #7
              UPDATE...

              The string variables would be words and the "acd abd" example above was not a good one. The information in the string is NOT coded/does NOT respond to ratings/categories.

              Would it be appropriate to report both kap and kappaetc?

              Thanks again. I appreciate any clarification you can provide.
              Sar

              Comment


              • #8
                Originally posted by sar reis View Post
                The information in the string is NOT coded/does NOT respond to ratings/categories.
                How is agreement defined then? Can you clarify or give a better example?

                Would it be appropriate to report both kap and kappaetc?
                kap estimates Cohen's Kappa for two (unique) raters and Scott's Pi/Fleiss K for more than two (nonunique) raters; kappaetc reports both (and other) coefficients, so kap does not add anything here. Use kap only if (a) for whatever reason you do not want to use user-written programs or (b) for whatever reason you want to report the analytic (asymptotic) standard error of kappa.

                Best
                Daniel

                Comment


                • #9
                  Agreement would be defined as if they match.

                  Rater 1: 3 x 3 randomized controlled trial
                  Rater 2: 3 x 3 randomized controlled trial
                  Coded: 0 0 (match)

                  Rater 1: 2 x 2 randomized controlled trial
                  Rater 2: 12 group randomized trial
                  Coded: 0 1 (don't match)

                  I'm open to other ways of coding. Ideas?

                  So, if I use kappaetc, which of the coefficients do you recommend reporting?

                  Thanks!

                  Comment


                  • #10
                    Agreement would be defined as if they match.

                    Rater 1: 3 x 3 randomized controlled trial
                    Rater 2: 3 x 3 randomized controlled trial
                    Coded: 0 0 (match)

                    Rater 1: 2 x 2 randomized controlled trial
                    Rater 2: 12 group randomized trial
                    Coded: 0 1 (don't match)

                    I'm open to other ways of coding. Ideas?
                    I am not sure, I completely follow this. The "3 x 3 randomized controlled trial" is an example for the contents of the string variables, right? From a technical point of view, it is hard to work with strings because of typos, leading and trailing spaces and many other things (cf. Herrin and Poen 2008). Clyde gives good advice in #4 to use encode and convert to numeric with pre-defined value labels. However, you seem to prefer coding the examples by hand, which is also fine if feasible. Concerning the proposed coding scheme, you may want to consider whether all disagreements are equally severe, e.g., compared to "2 x 2 randomized controlled trial", is "12 group randomized trial" more or less in agreement than, say, "3 x 3 randomized controlled trial". However, I believe that such questions can only be answered by you or someone who knows more about the content of your study.

                    So, if I use kappaetc, which of the coefficients do you recommend reporting?
                    Well, one of the ideas behind kappaetc is to report (or at least look at) more than one coefficient if you do not have strong theoretical and/or methodological reasons to prefer one of them; this is why the command always estimates all coefficients. Ask yourself: are the coefficients all similar? If so, then your (probably somehow arbitrary) choice of a specific coefficient does not seem to alter substantive conclusions you draw. Do the coefficients differ substantially? If so, you may want to spend more time investigating the underlying assumptions and calculations to get a better understanding of why the coefficients produce different answers. You can then make a more informed choice. From a more practical point of view, you may want to skim the relevant literature in your field: what have others reported?

                    Best
                    Daniel


                    Herrin, J., and Poen, E. 2008. Stata tip 64: Cleaning up user-entered string variables. The Stata Journal, 8(3), 444-445.

                    Comment


                    • #11
                      Originally posted by daniel klein View Post
                      Clyde gives excellent advice as usual. I wish to add a few points.

                      The title of this thread asks about intra-rater agreement; the query seems to be about inter-rater agreement.

                      Clyde lists the requirements for kap in #2 and suggests an interpretation. I tend to agree but in this example case (and otherwise, too) I would take a closer look at the z-stat and p-value. If you cannot reject the "straw man" null, then I would not report a "slight" agreement. If you are not interested in the chance-agreement (however [ill-]defined) that constitutes the null, then why report a chance-corrected agreement in the first place?

                      Problems with the kappa coefficient, such as dependency on marginal distribution and paradox results for high observed agreement, have been pointed out. Therefore alternative measures might be considered. See kappaetc (SSC) for a couple of such alternative measures, an alternative benchmarking method and simple way of testing less "straw man" like null hypotheses. Here are the results for the example data

                      Code:
                      . kappaetc var1 var2
                      
                      Interrater agreement Number of subjects = 10
                      Ratings per subject = 2
                      Number of rating categories = 2
                      ------------------------------------------------------------------------------
                      | Coef. Std. Err. t P>|t| [95% Conf. Interval]
                      ---------------------+--------------------------------------------------------
                      Percent Agreement | 0.6000 0.1633 3.67 0.005 0.2306 0.9694
                      Brennan and Prediger | 0.2000 0.3266 0.61 0.555 -0.5388 0.9388
                      Cohen/Conger's Kappa | 0.1667 0.3322 0.50 0.628 -0.5849 0.9182
                      Scott/Fleiss' Pi | 0.1667 0.3322 0.50 0.628 -0.5849 0.9182
                      Gwet's AC | 0.2308 0.3379 0.68 0.512 -0.5336 0.9952
                      Krippendorff's Alpha | 0.2083 0.3322 0.63 0.546 -0.5432 0.9599
                      ------------------------------------------------------------------------------
                      All alternative coefficients show similar low, though slightly higher, agreement. Compare the coefficients to the benchmark-levels suggested by Landis and Koch (1977) while taking their standard errors into account (cf. Gwet 2014).

                      Code:
                      . kappaetc , benchmark showscale noheader
                      ------------------------------------------------------------------------------
                      | P cum. Probabilistic
                      | Coef. Std. Err. P in. >95% [Benchmark Interval]
                      ---------------------+--------------------------------------------------------
                      Percent Agreement | 0.6000 0.1633 0.11 0.963 0.2000 0.4000
                      Brennan and Prediger | 0.2000 0.3266 0.72 1.000 . 0.0000
                      Cohen/Conger's Kappa | 0.1667 0.3322 0.69 1.000 . 0.0000
                      Scott/Fleiss' Pi | 0.1667 0.3322 0.69 1.000 . 0.0000
                      Gwet's AC | 0.2308 0.3379 0.74 1.000 . 0.0000
                      Krippendorff's Alpha | 0.2083 0.3322 0.73 1.000 . 0.0000
                      ------------------------------------------------------------------------------
                      
                      Benchmark scale
                      
                      <0.0000 Poor
                      0.0000-0.2000 Slight
                      0.2000-0.4000 Fair
                      0.4000-0.6000 Moderate
                      0.6000-0.8000 Subtantial
                      Note that all chance-corrected coefficients indicate only "poor" agreement. Concerning a more relevant null hypothesis, test, for example, whether agreement is at least 0.2

                      Code:
                      . kappaetc , testvalue(< 0.2) noheader
                      ------------------------------------------------------------------------------
                      | Coef. Std. Err. t P>t [95% Conf. Interval]
                      ---------------------+--------------------------------------------------------
                      Percent Agreement | 0.6000 0.1633 2.45 0.018 0.2306 0.9694
                      Brennan and Prediger | 0.2000 0.3266 0.00 0.500 -0.5388 0.9388
                      Cohen/Conger's Kappa | 0.1667 0.3322 -0.10 0.539 -0.5849 0.9182
                      Scott/Fleiss' Pi | 0.1667 0.3322 -0.10 0.539 -0.5849 0.9182
                      Gwet's AC | 0.2308 0.3379 0.09 0.465 -0.5336 0.9952
                      Krippendorff's Alpha | 0.2083 0.3322 0.03 0.490 -0.5432 0.9599
                      ------------------------------------------------------------------------------
                      t test Ho: Coef. <= 0.2000 Ha: Coef. > 0.2000
                      The null hypothesis cannot be rejected at any conventional level.

                      Concerning the follow-up on string values, Sar Reis needs to clarify whether there is any agreement between, e.g., "acd" and "adb". This could be the case when these three letters correspond to three ratings/categories, each. Krippendorff and Craggs (2016) discuss such concepts; I am not aware of a Stata implementation of it.

                      Best
                      Daniel

                      Gwet, K. L. (2014). Handbook of Inter-Rater Reliability. Gaithersburg, MD: Advanced Analytics, LLC.
                      Krippendorff, K., and Craggs, R. (2016). The Reliability of Multi-Valued Coding of Data. Communication Methods and Measures, 10, 181, 198.
                      Landis, J. R., and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-174.
                      Hi Klein, Can you explain why the probabilistic benchmark interval limits for some coefficients are missing (or zero) but the cumulative probability is 1? I have a similar output (as below) as you posted above, having some difficulties explaining them.

                      Code:
                      Interrater agreement                             Number of subjects =     349
                                                                 Ratings per subject: min =       1
                                                                                      avg =  1.7249
                                                                                      max =       2
                                                              Number of rating categories =       4
                      ------------------------------------------------------------------------------
                                           |                            P cum.     Probabilistic
                                           |   Coef.  Std. Err.  P in.   >95%   [Benchmark Interval]
                      ---------------------+--------------------------------------------------------
                         Percent Agreement |  0.9644    0.0339   1.00   1.000          .     0.0000
                      Brennan and Prediger |  0.9526    0.0351   1.00   1.000          .     0.0000
                      Cohen/Conger's Kappa |  0.9431    0.0363   1.00   1.000          .     0.0000
                          Scott/Fleiss' Pi |  0.9427    0.0364   1.00   1.000          .     0.0000
                                 Gwet's AC |  0.9552    0.0348   1.00   1.000          .     0.0000
                      Krippendorff's Alpha |  0.9437    0.0184   1.00   0.999     0.8000     1.0000
                      ------------------------------------------------------------------------------
                      Thank you,

                      Yuchen
                      Last edited by Yuchen Hou; 05 Dec 2020, 16:12.

                      Comment


                      • #12
                        Originally posted by Yuchen Hou View Post
                        Hi Klein, Can you explain why the probabilistic benchmark interval limits for some coefficients are missing (or zero) but the cumulative probability is 1?
                        Thanks for bringing this up. I have just posted an answer, including a workaround here.

                        Comment

                        Working...
                        X