Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interrater agreement and Fleiss' kappa

    Hi everybody, I'm Francesco Bianchi, an italian medical resident of the preventive medicine school of Bari. I use Stata SE14 software and I have an issue with a research topic (sorry for my english's skills!)

    15 expert physician have evaluated 20 videos of colonscopy of patients affected by chronic disease. For every video they assigned two indipendent scores to define the gravity of the patology: score A (from 0 to 4, categorical variable) and score B of new invention (from 0 to 50, continue discrete variable). Objective of the study is to evaluate the interrater agreement of values of score A and B among physicians.

    To evaluate the agreement of score A and B (for each video), I'm going to use the Fleiss' kappa (I already downloaded the kappaetc command), with 95%CI and p-value.

    My doubts are:
    1) Is correct to evaluate the Fleiss' kappa for both scores?
    2) If the point 1) is correct, I'll obtain for each score 20 kappas (1 for each video); how can I obtain a final global kappa (for each score), with 95%CI and p-value?

    I specify that the two scores are independent and that then I will eventually get two kappa not correlated.
    I hope I was clear.

    Thank you in advance,
    Francesco

  • #2
    I am not sure I completely follow your layout or what you are trying to estimate. You say

    15 expert physician have evaluated 20 videos [...]. For every video they assigned two indipendent scores [...]: score A (from 0 to 4, categorical variable) and score B of new invention (from 0 to 50, continue discrete variable).
    Usually, one would define inter-rater agreement as agreement among the physicians using scoring method A and/or agreement among physicians using scoring method B. This would result in two kappa (or some alternative agreement) coefficients. One would then be in a position to say whether agreement among physicians is higher/lower when using method A than when using method B.

    Since you plan to come up with 20 kappa coefficients you seem to have something else in mind. Perhaps you want to estimate agreement between the two methods, A and B. If so, how do you define agreement between the two scoring methods? Could you clarify?

    EDIT:

    On second thought, you say that you

    obtain for each score 20 kappas
    (emphasis mine)., i.e., a total of 40 kappa values. If this is where you (think) you are going, you are probably misunderstanding something. When you estimate one kappa value per video (I am not even entirely sure the current version of kappaetc [SSC, by the way] allows you to do this) you cannot estimate a standard error and you cannot estimate a confidence interval, either.

    Best
    Daniel
    Last edited by daniel klein; 06 Jun 2018, 14:21.

    Comment


    • #3
      I don't want to estimate the agreement between the two methods. I need two kappa, one for each score.

      An example (values are not real):

      SCORE A
      video 1 kappa = 0.7; 95%CI 0.6 - 0.8; p = 0.000
      video 2 kappa = 0.8; 95%CI 0.4 - 0.9; p = 0.010
      video 3 kappa = 0.6; 95%CI 0.4 - 0.7; p = 0.067
      .
      .
      .
      .
      video 20 kappa = 0.7; 95%CI 0.4 - 0.9; p = 0.003
      From all these values I need a kappa final value of score A, with 95%CI and p-value (if possible)


      SCORE B
      video 1 kappa = 0.4; 95%CI 0.2 - 0.5; p = 0.000
      video 2 kappa = 0.3; 95%CI 0.1 - 0.4; p = 0.010
      video 3 kappa = 0.5; 95%CI 0.4 - 0.7; p = 0.067
      .
      .
      .
      .
      video 20 kappa = 0.6; 95%CI 0.4 - 0.9; p = 0.003
      From all these values I need a kappa final value of score B, with 95%CI and p-value (if possible)

      A mean of the kappa of each video could be a method to obtain what I'm looking for? (Onestly, I don't think so...)








      Comment


      • #4
        Can you show/explain how your data is set up? Can you also show the commands you used? I cannot really understand how you obtain standard errors and CIs if there is only one subject (i.e., video) that is rated/classified.

        From what I understand, there are two ways to record your data (the raw ratings, I assume).

        1. You have 16 variables and 40 observations. 1 variable records the method being used, the other 15 variables represent the physicians and hold their respective ratings. Observations 1-20 represent the 20 videos ratuigs using method "A" (for those the first variable would take on the value indicating "A"), observations 21-40 represent the same 20 videos ratings using method "B".

        2. You have 30 variables and 20 observations. Variables 1-15 hold the physicians ratings using method "A" and variables 16-30 hold their ratings using method "B". The observations represent the videos being rated.


        Best
        Daniel
        Last edited by daniel klein; 06 Jun 2018, 14:38.

        Comment


        • #5
          Maybe I have a solution. I could build two dataset:
          - dataset A = 15 variables (physician) and 20 observation (videos) --> kappa of score A
          - dataset B = 15 variables (physician) and 20 observation (videos) --> kappa of score B

          I think, this should work.

          Can I use the Fleiss' kappa for both scores or something else?
          Thank you

          Comment


          • #6
            If you feel more comfortable with two separate datasets, then go for it; as I have explained in #4 that is not necessary, though.

            Now that we have pretty much clarified the technical details of data preparation, let us turn to your question (and one of your statements).

            Can you use Fleiss' kappa for both cases?

            I would say that, in principle yes, you can use Fleiss' kappa (or any other agreement coefficient) in both cases. I would say so, because in both cases you have set of predefined, distinct rating categories. One alternative might be an intraclass correlation coefficient (ICC) for interval scale measures. While scale B can probably be regarded as an interval scale, it seems questionable whether scale A would qualify as such. Either way, since your goal is to compare agreement between the two cases, you should use comparable (ideally: the same) measure/statistic for both cases.

            Sticking with an agreement coefficient, say, kappa, you still need to think about the level of measurement of both scales and what it implies. For example, using scale A, you would probably have good reasons to assume that values 3 and 4 are more in agreement than values 1 and 4. Perhaps you want some sort of weights to account for such partial agreement (ordinal weights might be appropriate here). Likewise, using scale B, values 41 and 42 are more similar to each other than 39 and 42. Here, you almost definitely want a weighted kappa because with only 20 subjects (videos) the raters cannot even assign half of the possible 50 values of the scale; you would not expect "exact" (i.e., unweighted) agreement to be high in this a situation. There are more tricky questions: Is the difference between 3 and 4 on scale A similar to the difference between 41 and 42 on scale B? Should the same set of weights be used in both cases?

            Concerning your statement that the two kappa values are independent (not correlated), I disagree. The two coefficients are based on the exact same subjects (videos) so they are definitely correlated.

            Best
            Daniel
            Last edited by daniel klein; 07 Jun 2018, 03:59.

            Comment


            • #7
              There are more tricky questions: Is the difference between 3 and 4 on scale A similar to the difference between 41 and 42 on scale B? Should the same set of weights be used in both cases?
              I think not

              Concerning your statement that the two kappa values are independent (not correlated), I disagree. The two coefficients are based on the exact same subjects (videos) so they are definitely correlated.
              What you suggest?

              Thank you

              Comment


              • #8
                Your first quote of my post includes two questions; so is it an "I think not" to both? If so, I tend to agree. For starters, I would probably use ordinal weights for scale A and linear or quadratic weights for scale B. Perhaps you have a good understanding of the rating scales in question and can come up with suitable customized weights. Either way, I would play around with different sets of weights a bit, to get a feeling for the impact of a particular choice on the substantive conclusions.

                Concerning the dependent kappa values, you need to decide whether this is a problem. It could be a problem, if you wish to perform a statistical test of (mean) differences. In this case, I suggest you read Gwet (2016) and see whether this is a way to proceed. kappaetc has this procedure implemented but it will not allow you to test coefficients that are based on different rating categories and/or weights for (dis)agreement. I can copy a work-around from the certification script that I use to certify the results that kappaetc provides.

                Best
                Daniel


                Gwet, K., L. 2016. Testing the Difference of Correlated Agreement Coefficients for Statistical Significance. Educational and Psychological Measurement, 76, 609-637.
                Last edited by daniel klein; 07 Jun 2018, 05:40.

                Comment


                • #9
                  Originally posted by daniel klein View Post
                  For starters, I would probably use ordinal weights for scale A and linear or quadratic weights for scale B. Perhaps you have a good understanding of the rating scales in question and can come up with suitable customized weights. Either way, I would play around with different sets of weights a bit, to get a feeling for the impact of a particular choice on the substantive conclusions.
                  Can you help me with a code? Thank you

                  Comment


                  • #10
                    Could you be more specific, please? What exactly did you try/type? What exactly did Stata do in response? Why do you feel it is not what you wanted?

                    The way you have asked the last two questions is not exactly motivating for me; I would still help you but when you ask for (detailled/customized) code you need to supply details.

                    Best way to answer the above questions is to provide an example using dataex.

                    Best
                    Daniel

                    Comment

                    Working...
                    X