Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fleiss kappa or ICC for interrater agreement (multiple readers, dichotomous outcome) and correct stata comand

    106 units are all assessed by the same 5 readers. Units were judged either positive or negative (dichotomous outcome).

    What is the best applied statistical test to look at interrater agreement?

    1) Is ICC (two-way random effect model, single rater, agreement) usefull, or is that only to apply to continous or categorical data with >2 possible ratings?
    2) Is Fleiss kappa the statistical test of choice, and if so,
    a) is the correct stata command kappa pos neg (when data are organised in: Column 1; subject id, column 2; number of positive reads (pos), column 3; number of negative reads (neg))
    b) Does this test allow that oner reader is the basis/gold standard (the one others should agree with)

    Will be very grateful for your input

  • #2
    106 units are all assessed by the same 5 readers. Units were judged either positive or negative (dichotomous outcome).

    What is the best applied statistical test to look at interrater agreement?
    Perhaps there is not one best statistic, but several approaches.

    1) Is ICC (two-way random effect model, single rater, agreement) usefull, or is that only to apply to continous or categorical data with >2 possible ratings?
    I think the ICC would, in general, be an option. If your readers are considered a random sample and if interest lies in the population of readers, then the two-way random effects model is appropriate. If the five readers are the only ones of interest, then use a mixed effects model. Stata's icc command estimates these models; however, with the data-setup that you describe in 2), I do not see how you could estimate an ICC.

    2) Is Fleiss kappa the statistical test of choice,
    Fleiss kappa is one of many chance-corrected agreement coefficients. These coefficients are all based on the (average) observed proportion of agreement. Given the design that you describe, i.e., five readers assign binary ratings, there cannot be less than 3 out of 5 agreements for a given subject. That means that agreement has, by design, a lower bound of 0.6. Keeping this in mind, you can compute Fleiss kappa with the syntax that you suggest

    Code:
    kappa pos neg
    I encourage you to download kappaetc (from SSC) that estimates Fleiss kappa and other chance-corrected agreement coefficients. The syntax is nearly identical

    Code:
    kappaetc pos neg , frequency
    Compare the different coefficients to get a sense of how sensitive your results are, depending on your choice of the best applied statistic.

    b) Does this test allow that oner reader is the basis/gold standard (the one others should agree with)
    In general: no. A golden standard requires that the reader, who is assumed to give the golden standard, be identified. Fleiss kappa does not require unique readers. Since readers are unique in your case, you could probably still use Fleiss kappa. However, besides complications that arise because you have more than two readers, the reader or gold standard for each subject must still be identified in the data and this is not possible if you only have recorded the number of positive an negative ratings.

    Best
    Daniel
    Last edited by daniel klein; 18 Jan 2018, 04:58. Reason: wrong delimiters and spelling

    Comment


    • #3
      Thank you very much for your comments.

      Our readers are considered a random sample and interest lies in the population of readers.
      I do actually have the recordings of each individual reader, and so have the opportunity to set up data as required for the specific test, eg ICC.
      Also one reader is considered the expert, and hence could be our gold standard. Is the stata output 'individual', when applying the ICC command, not referring to a single reader who is the considered gold standard? Or did i misunderstand the concept of 'single reader'....
      If correct, how do I tell stata, which variable (reader) should be considered gold standard?

      Comment


      • #4
        If you want to consider readers as a random sample, then you are considering them to be exchangeable. Considering one's judgments to set a "gold standard" isn't really consistent with that.

        Anyway, if you want to estimate an intraclass correlation (ICC) coefficient considering both units and judges as random, and with judgments a dichotomous outcome, then you could fit a cross-classified random effects logistic regression model and compute the ICC coefficient from the fitted variances. Something like the following. (Begin at the "Begin here" comment. The stuff before is just to set up an artificial dataset that mimics yours.)

        .ÿversionÿ15.1

        .ÿ
        .ÿclearÿ*

        .ÿ
        .ÿsetÿseedÿ`=strreverse("1426289")'

        .ÿ
        .ÿquietlyÿsetÿobsÿ106

        .ÿgenerateÿbyteÿuidÿ=ÿ_n

        .ÿgenerateÿdoubleÿv_uÿ=ÿrnormal(0,ÿ2)

        .ÿ
        .ÿ//ÿ"two-wayÿrandomÿeffectÿmodel"
        .ÿtempfileÿtmpfil0

        .ÿquietlyÿsaveÿ`tmpfil0'

        .ÿ
        .ÿdropÿ_all

        .ÿquietlyÿsetÿobsÿ5

        .ÿgenerateÿbyteÿjidÿ=ÿ_n

        .ÿgenerateÿdoubleÿj_uÿ=ÿrnormal(0,ÿsqrt(1/10))

        .ÿ
        .ÿcrossÿusingÿ`tmpfil0'

        .ÿ
        .ÿgenerateÿdoubleÿxbuÿ=ÿv_uÿ+ÿj_u

        .ÿgenerateÿbyteÿoutcomeÿ=ÿrbinomial(1,ÿinvlogit(xbu))

        .ÿ
        .ÿ*
        .ÿ*ÿBeginÿhere
        .ÿ*
        .ÿmelogitÿoutcomeÿ||ÿ_all:ÿR.jidÿ||ÿuid:,ÿintmethod(mvaghermite)ÿintpoints(3)ÿnolog

        Mixed-effectsÿlogisticÿregressionÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿobsÿÿÿÿÿ=ÿÿÿÿÿÿÿÿ500

        -------------------------------------------------------------
        ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|ÿÿÿÿÿNo.ÿofÿÿÿÿÿÿÿObservationsÿperÿGroup
        ÿGroupÿVariableÿ|ÿÿÿÿÿGroupsÿÿÿÿMinimumÿÿÿÿAverageÿÿÿÿMaximum
        ----------------+--------------------------------------------
        ÿÿÿÿÿÿÿÿÿÿÿ_allÿ|ÿÿÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿ500ÿÿÿÿÿÿ500.0ÿÿÿÿÿÿÿÿ500
        ÿÿÿÿÿÿÿÿÿÿÿÿuidÿ|ÿÿÿÿÿÿÿÿ100ÿÿÿÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿ5.0ÿÿÿÿÿÿÿÿÿÿ5
        -------------------------------------------------------------

        Integrationÿmethod:ÿmvaghermiteÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿIntegrationÿpts.ÿÿ=ÿÿÿÿÿÿÿÿÿÿ3

        ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿWaldÿchi2(0)ÿÿÿÿÿÿ=ÿÿÿÿÿÿÿÿÿÿ.
        Logÿlikelihoodÿ=ÿ-309.97038ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿProbÿ>ÿchi2ÿÿÿÿÿÿÿ=ÿÿÿÿÿÿÿÿÿÿ.
        ------------------------------------------------------------------------------
        ÿÿÿÿÿoutcomeÿ|ÿÿÿÿÿÿCoef.ÿÿÿStd.ÿErr.ÿÿÿÿÿÿzÿÿÿÿP>|z|ÿÿÿÿÿ[95%ÿConf.ÿInterval]
        -------------+----------------------------------------------------------------
        ÿÿÿÿÿÿÿ_consÿ|ÿÿÿ.0038208ÿÿÿ.2414489ÿÿÿÿÿ0.02ÿÿÿ0.987ÿÿÿÿ-.4694104ÿÿÿÿÿ.477052
        -------------+----------------------------------------------------------------
        _all>jidÿÿÿÿÿ|
        ÿÿÿvar(_cons)|ÿÿÿ.0555947ÿÿÿ.0770155ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ.0036801ÿÿÿÿÿ.839856
        -------------+----------------------------------------------------------------
        uidÿÿÿÿÿÿÿÿÿÿ|
        ÿÿÿvar(_cons)|ÿÿÿ2.819757ÿÿÿ.7354933ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ1.691173ÿÿÿÿÿ4.70149
        ------------------------------------------------------------------------------
        LRÿtestÿvs.ÿlogisticÿmodel:ÿchi2(2)ÿ=ÿ73.20ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿProbÿ>ÿchi2ÿ=ÿ0.0000

        Note:ÿLRÿtestÿisÿconservativeÿandÿprovidedÿonlyÿforÿreference.

        .ÿdisplayÿinÿsmclÿasÿtextÿ"ICCÿ=ÿ"ÿasÿresultÿ%04.2fÿ///
        >ÿÿÿÿÿÿÿÿÿ_b[/var(_cons[uid])]ÿ/ÿ(_b[/var(_cons[uid])]ÿ+ÿ_b[/var(_cons[_all>jid])]ÿ+ÿ_pi^2ÿ/ÿ3)
        ICCÿ=ÿ0.46

        .ÿ
        .ÿexit

        endÿofÿdo-file


        .


        But keep Daniel's excellent advice in mind in that there might not be one best statistic, that you need to carefully consider just what it is that you want to accomplish.

        Comment


        • #5
          Is the stata output 'individual', when applying the ICC command, not referring to a single reader who is the considered gold standard?
          No, it is not. The term "individual" refers to the ratings that are compared. You would use the "average" ICC if you had teams of readers and wanted to compare averages of ratings. I do not believe that such a situation would occur often, but then again, I am not really doing much inter-rater reliability analysis.

          Anyway, comparing to a gold standard is straightforward when only two readers are involved. With more than two readers, I have no clear idea how to do this, technically. For agreement coefficients, one possible way that I can think of is to have the gold standard ratings repeated for each of the other four readers. Here is an example

          Code:
          // example data
          webuse p615b , clear
          
          rename rater1 gold
          rename rater# score#
          
          list
          
          reshape long score , i(subject) j(rater)
          
          sort subject rater
          list in 1/10 , sepby(subject)
          This gives

          Code:
          [...]
          . list
          
               +----------------------------------------------------+
               | subject   gold   score2   score3   score4   score5 |
               |----------------------------------------------------|
            1. |       1      1        2        2        2        2 |
            2. |       2      1        1        3        3        3 |
            3. |       3      3        3        3        3        3 |
            4. |       4      1        1        1        1        3 |
            5. |       5      1        1        1        3        3 |
               |----------------------------------------------------|
            6. |       6      1        2        2        2        2 |
            7. |       7      1        1        1        1        1 |
            8. |       8      2        2        2        2        3 |
            9. |       9      1        3        3        3        3 |
           10. |      10      1        1        1        3        3 |
               +----------------------------------------------------+
          
          .
          . reshape long score , i(subject) j(rater)
          (note: j = 2 3 4 5)
          
          [...]
          
          . list in 1/10 , sepby(subject)
          
               +--------------------------------+
               | subject   rater   gold   score |
               |--------------------------------|
            1. |       1       2      1       2 |
            2. |       1       3      1       2 |
            3. |       1       4      1       2 |
            4. |       1       5      1       2 |
               |--------------------------------|
            5. |       2       2      1       1 |
            6. |       2       3      1       3 |
            7. |       2       4      1       3 |
            8. |       2       5      1       3 |
               |--------------------------------|
            9. |       3       2      3       3 |
           10. |       3       3      3       3 |
               +--------------------------------+
          [...]
          You could then estimate agreement with

          Code:
          . kap gold score
          
                       Expected
          Agreement   Agreement     Kappa   Std. Err.         Z      Prob>Z
          -----------------------------------------------------------------
            47.50%      31.00%     0.2391     0.0758       3.15      0.0008
          The standard error and p-values will be way too optimistic, because you now have 40 observations when you actually only have 10 subjects. With kappaetc (SSC*), we can "correct" for this using importance weights

          Code:
          . kappaetc gold score [iweight = 1/4]
          
          Interrater agreement                             Number of subjects =      10
                                                          Ratings per subject =       2
                                                  Number of rating categories =       3
          ------------------------------------------------------------------------------
                               |   Coef.   Std. Err.   t    P>|t|   [95% Conf. Interval]
          ---------------------+--------------------------------------------------------
             Percent Agreement |  0.4750    0.1665   2.85   0.019     0.0984     0.8516
          Brennan and Prediger |  0.2125    0.2497   0.85   0.417    -0.3523     0.7773
          Cohen/Conger's Kappa |  0.2391    0.1746   1.37   0.204    -0.1559     0.6342
                 Fleiss' Kappa |  0.1153    0.2790   0.41   0.689    -0.5159     0.7465
                     Gwet's AC |  0.2535    0.2485   1.02   0.334    -0.3086     0.8156
          Krippendorff's alpha |  0.1596    0.2790   0.57   0.581    -0.4717     0.7908
          ------------------------------------------------------------------------------
          Something similar might be possible for the ICC; but see Joseph's valid point about exchangeable readers above.

          Best
          Daniel


          * The output that you will get looks different, because I have an updated version of kappaetc that I am going to release soon.
          Last edited by daniel klein; 18 Jan 2018, 07:38. Reason: added reference to Joseph's good point about exchangeable readers

          Comment


          • #6
            There is a minor bug Joseph's code that is irrelevant for the point he makes. The line

            Code:
            generate byte uid = _n
            will produce missing values for all observations > 100. It should be

            Code:
            generate long uid = _n
            where you could also omit the long in this case.

            Best
            Daniel

            Comment


            • #7
              Some additional thoughts on the usage of a gold standard. Different questions can be asked when a gold standard is given.

              The approach outlined in #5 assess the extent of agreement between each reader/rater and the gold standard, separately. It does not actually address agreement among the (four) readers in any way. Gwet (2014, Ch. 11) discusses the concept of validity coefficients (as opposed to reliability coefficients), which basically assess the extent of agreement among readers/raters to classify a given subject into the "true" category, i.e., the golden standard. I have not implemented this concept (mainly because the variance expressions have not been worked out for all coefficients).

              One could also ask for the extent of agreement among the readers/raters for each category of the gold standard. Gwet (2014, Ch. 11) calls this "conditional" reliability analysis. You could get this with

              Code:
              bysort reader1 : kappaetc reader2-reader5
              assuming that reader1 sets the gold standard.

              Only Berit can decide which questions should be answered.

              Best
              Daniel


              Gwet, K. L. (2014). Handbook of Inter-Rater Reliability. Gaithersburg, MD: Advanced Analytics, LLC.
              Last edited by daniel klein; 18 Jan 2018, 09:23.

              Comment


              • #8
                Originally posted by daniel klein View Post
                There is a minor bug Joseph's code . . .
                Yes. Sorry about that. I ought to have known better, and actually did suspect that I was in the neighborhood of the maximum, but was lulled into believing that the limit must be higher because I didn't see any (6 missing values generated) warning.

                Comment


                • #9
                  Originally posted by daniel klein View Post




                  Fleiss kappa does not require unique readers.
                  Since the assumption that a new sample of coders is selected each time is not met, I think it's better to look for alternatives when the raters are always the same.
                  Here:
                  https://www.researchgate.net/publica...w_and_Tutorial
                  and in particular in the Subsection: "Common kappa-like variants for 3 or more coders", methods to calculate IRR with 3+ raters are discussed.

                  Comment

                  Working...
                  X