Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to determine interrater agreement with many missing values?

    Dear all,

    I want to calculate the interrater agreement (e.g. Fleiss kappa/Krippendorf's alpha), however my dataset contains many missing values. A profile (varname=profileid) has been rated on average by 5 raters on trustworthiness (varname=trustworth7, etc.). In total, I have 189 raters and 259 profiles. A profile has many missing values, because it has only been rated by a small subset of the total population of raters. When I use the command kappaetc I get very low values (e.g. Krippendorf's alpha = 0.0061) which I think has to do with all the missing values. Does anyone know how to tackle this?

    See below for an example of my dataset with some profiles and the trustworthinessscores of three raters.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int profileid float(trustworth7 trustworth8 trustworth9)
      1 .        .         .
      3 .        .         .
      5 .        .         .
      6 .        .         .
      7 .        .         .
      8 .        .         .
      9 .        .         .
     10 .        .         .
     11 .        .         .
     13 .        .         .
     15 .        .         .
     17 .        .         .
     18 .        .         .
     19 .        .         .
     20 .        .         .
     21 .        .         .
     22 .        .         .
     23 .        .         .
     24 .        .         .
     25 .        .         .
     26 .        .         .
     27 .        .         .
     28 .        .         .
     29 .        .         .
     30 .        .         .
     31 .        .         .
     32 .        .         .
     34 .        .         .
     35 .        .         .
     36 .        .         .
     37 .        .         .
     40 .        .         .
     41 .        .         .
     42 .        .         .
     45 .        .         .
     47 .        .         .
     49 .        .         .
     50 .        .         .
     51 .        .         .
     52 .        .         .
     53 .        .         .
     54 .        .         .
     57 .        .         .
     58 .        .         .
     60 .        .         .
     61 .        .         .
     63 .        .         .
     64 .        .         .
     65 .        .  2.833333
     66 .        .         .
     69 .        .         .
     70 .        .         .
     72 .        .         .
     74 .        .         .
     75 .        .         .
     76 .        .         .
     80 .        .         .
     82 .        .         .
     83 .        .         .
     84 .        .         .
     87 .        .  3.666667
     88 .        .         .
     89 .        .         .
     92 .        .         .
     94 .        .         .
     99 .        .         .
    102 .        .         .
    103 .        .         .
    104 .        .         .
    105 .        .         .
    106 .        .         .
    108 .        .         .
    109 .        .         .
    110 .        .         .
    112 .        .         .
    113 .        .         .
    114 .        .         .
    115 .        .         .
    116 .        .         .
    117 .        .         .
    120 .        .         .
    122 .        .         .
    124 2 5.666667         .
    125 .        .         .
    126 .        .         .
    127 .        .         .
    128 .        .         .
    129 2 5.666667         .
    130 .        .         .
    131 .        .         .
    132 .        . 2.1666667
    133 .        .         .
    135 .        .         .
    137 .        .         .
    139 .        .         .
    142 2 5.666667         .
    143 .        .         .
    144 .        .         .
    145 .        .         .
    146 .        .         .
    end

  • #2
    Maarten:
    with such a large amount of missing values, I would first wonder if any statistical procedure makes any sense (even -mi-, if feasible, would probably be questionable).

    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thanks Carlo for your observation. Would there be a way to handle this issue in order to calculate a reliable kappa?

      Comment


      • #4

        Maarten:
        unfortunately, I do not think so.
        Perhaps you can try a sort of scenario approach (underlying its limitations and the likely lack of representativeness of the data generating process):
        - calculate the kappa for complete case only (first scenario);
        - determine the mechanism underlying the missingness of your data, deal with them accordingly, and then recalculate the kappa on the completed cases (I say completed and not imputed because if your data are missing not at random you should consider other research strategy: see https://www.crcpress.com/Flexible-Im...viewContainer; a secon edition has been recently released) (second scenario).
        The amount of missing values palys an obvious role in the defensibility of what above.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Hi Carlo,

          Thanks for your suggestions. The reason why the raters did not rate all the profiles is that an individual rater only rated 10 profiles from the pool of 259 profiles. That's the reason why a profile has so many missing values. On average, a profile has been rated by 5 raters (out of 189) and for the rest there are missing values. Perhaps I should arrange my data differently?

          Comment


          • #6
            Maarten:
            thanks for providing further clarifications.
            If there's a study protocol reason for missing values (which, all in all are not missing in technical sense) things are more defensible.
            The issue now is: is there a way to arrange your data so that each profile has 5 rates made by whoever raters who picked up the profile randomly?
            Last edited by Carlo Lazzaro; 29 Aug 2018, 06:12.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Thanks Carlo for pinning down the issue at hand. I guess that is the challenge now. I would welcome any suggestions on this.

              Comment


              • #8
                To the discussion above, let me add that it is implicit in the documentation for kappaetc (user-written from SSC) as well as for kappa (part of base Stata) that ratings are categorical variables. Your ratings seem to be some multiple of 1/6, while in Stata categorical data is usually represented by integers, which assures that rounding issues do not cause problems. It is possible, but by no means certain, that representing ratings by fractions is causing problems.

                Perhaps creating transformed ratings
                Code:
                generate tw7 = round(6*trustworth7)
                would yield different results.

                Note also that help kappaetc tells you
                Code:
                    kappaetc also assumes that all possible rating categories are observed in the data. This
                    assumption is crucial. If some of the rating categories are not used by any of the raters, the
                    full set of conceivable ratings must be specified in the categories() option. Failing to do so
                    might produce incorrect results for all weighted agreement coefficients; Brennan and Prediger's
                    coefficient and Gwet's AC will be incorrectly estimated, even if no weights are used.
                You leave us to guess what your precise command was, but this may also be a source of concern.

                Finally, it appears that the maximum rating is at least 34/6. Let's assume the minimum is 0. That's at least 35 possible rating values. I'd expect virtually no agreement with that many possibilities, and note the "close" doesn't count in kappa.

                I'm sorry to say that your rater agreement techniques are likely inappropriate to your data, consisting of 189 raters each assigning one of 35+ values to just 10 of 259 profiles.

                Comment


                • #9
                  Thanks William for your feedback. Indeed rounding the values is something I should have done, as well as indicating the categories (1-7). After using your solution I get a percentage agreement of 0.3 and a kappa of 0.0362. I'm just wondering what this means for my study. On the one hand I question if I have done the right thing with the analysis (e.g. organising the data), or do the raters just have a low level of agreement. Also, the fact that a profile is judged by 5 raters on average would seem to me that getting a high percentage of agreement is not easy.

                  Comment


                  • #10
                    I am afraid William's conclusion in #8 might be true. However, let me comment on some statements for (hopefully) further clarification.

                    The reason why the raters did not rate all the profiles is that an individual rater only rated 10 profiles from the pool of 259 profiles. That's the reason why a profile has so many missing values
                    This sounds like it is plausible to assume that the mechanism producing the missing values is MCAR. In this case, you should not base the estimation on the complete cases only. Use all available cases (the default in kappaetc).

                    let me add that it is implicit in the documentation for kappaetc (user-written from SSC) [...] that ratings are categorical variables.
                    [...]
                    and note the "close" doesn't count in kappa .
                    There might be a slight misunderstanding here, although Willimans conclusion might still be correct. Concerning agreement coefficients, such as kappa, Krippendorff's alpha, etc., you would typically have categorical variables; that is, however, not required. You may (and should) use weights for (dis)agreement that are appropriate for the data's level of measurement. If you have data on the nominal scale, use identity weights, which is the same as using no weights. If your data are ordinal, interval or ratio scale, use oridnal, linear or quadratic, or ratio weights. These weights are designed to make the "close count" where it is sensible.

                    Using weights for (dis)agreement, theoretically, makes agreement coefficients applicable to any data level of measurement. I say theoretically because it is assumed that the rating categories are predetermined, i.e., known before the rating process starts. As an example, suppose you want raters to agree on the number of trees in different pictures. The number of trees is clearly measured on a ratio scale. Yet, if you have collected all pictures, you will know the maximum and minimum of the number of trees in these pictures. Thus, the conceivable rating categories are predetermined.

                    The paragraph that William quotes from the help of kappaetc is about such predetermined rating categories. Although it might (falsely) be implied, rating "categories" do not mean categorical ratings. Whether the ratings are predetermined in Maarten's case is not clear from what he describes.

                    If ratings are not predetermined, as might often be the case for interval or ratio data, Maarten should consider using an ICC as an estimate of rater reliability. This is implemented in kappaetc, too. I have to admit,m however, that I am not quite sure that the test statistics for the ICCs are valid for the huge amount of missing data; in fact, I doubt they are, but this is not entirely clear from the literature that underlies kappaetc.

                    Best
                    Daniel

                    Comment


                    • #11
                      Dear Daniel,

                      Thank you so much for your additional explanation. I think I'm getting close now. I indeed have ordinal categories (7 point Likert scale) and I have added this in the syntax, see below. I think the ouput, which I also added below, makes more sense now. Although I have some questions regarding the interpretation and use of it. According to the output, there is a high percentage of agreement between the raters (i.e. 90.42%), though the kappas are near to zero. I guess this might be due to the 'paradox of Kappa' which describes that when raters have a high percentage of agreement, but rarely rate a profile in a certain category, causes a low kappa. In that case, would it be better to use a different statistic, such as Gwet's AC?

                      Code:
                      kappaetc rnd_trustworth7    rnd_trustworth8, , categories(1/7) wgt(ordinal)
                      Interrater agreement Number of subjects = 259
                      (weighted analysis) Ratings per subject: min = 4
                      avg = 6.7799
                      max = 17
                      Number of rating categories = 7
                      Coef. Std. Err. t P>t [95% Conf. Interval]
                      Percent Agreement 0.9042 0.0037 246.06 0.000 0.8969 0.9114
                      Brennan and Prediger 0.6088 0.0150 40.57 0.000 0.5792 0.6383
                      Cohen/Conger's Kappa 0.0872 0.0333 2.62 0.009 0.0216 0.1528
                      Scott/Fleiss' Kappa 0.0899 0.0232 3.87 0.000 0.0441 0.1356
                      Gwet's AC 0.7087 0.0139 50.97 0.000 0.6813 0.7361
                      Krippendorff's Alpha 0.0932 0.0207 4.50 0.000 0.0525 0.1339

                      Comment


                      • #12
                        Originally posted by Maarten Huurne View Post
                        According to the output, there is a high percentage of agreement between the raters (i.e. 90.42%), though the kappas are near to zero. I guess this might be due to the 'paradox of Kappa' which describes that when raters have a high percentage of agreement, but rarely rate a profile in a certain category, causes a low kappa. In that case, would it be better to use a different statistic, such as Gwet's AC?
                        Such possible differences between agreement coefficients are one of the reasons why kappaetc reports more than one them; the implication is to think about and discuss the underlying reasons for the observed differences, just as you have done here. Instead of picking one coefficient and report it, consider adding relevant references, for further reading, and include a brief discussion in the paper (or report or whatever you are writing). A discussion of this sort makes your research much more transparent for others.

                        Best
                        Daniel

                        Comment


                        • #13
                          Thanks Daniel for your answer!

                          Comment

                          Working...
                          X