Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating Fleiss Kappa


    Each subject represents a rater. I want to know the agreement for the raters for each test. Why am I getting negatives for the Fleiss' kappa for each of the 9
    tests? The score for each test is between 1-9.
    Click image for larger version

Name:	Screen Shot 2020-06-02 at 1.45.03 AM.png
Views:	2
Size:	169.1 KB
ID:	1556433
    Click image for larger version

Name:	Screen Shot 2020-06-02 at 1.44.56 AM.png
Views:	2
Size:	49.4 KB
ID:	1556432

  • #2
    Originally posted by Lucy Kay View Post
    Each subject represents a rater. [...]
    Why am I getting negatives for the Fleiss' kappa for each of the 9
    tests?
    Both kap and kappa want observations to represent subjects to be rated. The variables either hold the ratings (kap) or the number of raters that have classified the subject into a category (kappa).

    I am not entirely sure about your setup because I am having difficulties mapping the terms "raters" and "tests" to the usually used terms "subjects" and "raters". It appears as if you are using the term raters to denote what is usually referred to as subjects (as also indicated by the variable name). Subjects are classified into categories or assigned scores; sometimes by (human) raters/coders/judges, sometimes by (medical/psychological) tests. Thus, the term test would usually denote raters/coders/judges.

    Depending on what exactly the raters and tests are in your data, you might need to xpose. If your data records ratings/test-scores, you want kap, not kappa.

    For your next posting, please review the FAQ, especially regarding screen-shots and how to present data and code.

    Comment


    • #3
      Hi, you are correct in that I mean "tests" when I say subjects and "subjects" when I say tests1, test 2 etc. Could you expand on when I would use "xpose" and how I could apply that to my dataset functionally? I will change kappa to kap as it aligns with my data records and have also reviewed the FAQ. Thank you for your time!

      Edit. I tried this sequence but the test1-test9 variables are no longer found:

      . xpose, clear varname

      . kap test1-test9
      variable test1 not found
      r(111);
      Last edited by Lucy Kay; 02 Jun 2020, 10:13.

      Comment


      • #4
        Just type:

        Code:
        help xpose
        Best regards,

        Marcos

        Comment


        • #5
          Also, check the name of the variables after xpose.
          Best regards,

          Marcos

          Comment


          • #6
            I tried the xpose, clear varname code to convert my data and it says that subject1-subject9 are variables in my do-file but the variable subject1 is still not recognized. Also, I am not really sure what the purpose of the xpose code is in this case. For the record, I am trying to calculate Fleiss kappa to find inter-rater agreement for each of my 9 tests; my data has 55 raters scoring 9 tests on an integer scale from 1-9.

            Click image for larger version

Name:	Screen Shot 2020-06-02 at 1.35.48 PM.png
Views:	2
Size:	17.3 KB
ID:	1556559

            Attached Files
            Last edited by Lucy Kay; 02 Jun 2020, 11:45.

            Comment


            • #7
              I believe adding a little content might help to further clarify the data setup. I will stick with a slightly modified version of the first example of Stata's kap command. Suppose that three radiologists have classified 5 xeromammograms into one of four categories: normal, benign disease, suspicion, cancer. As I understand it, you would have the xeromammograms as variables and radiologists as observations. kap expects the setup the other way round.

              Here is how (I think) your data looks like:

              Code:
              // Step 0: example data
              clear
              input rad xm1 xm2 xm3 xm4 xm5
              1 1 4 3 2 1
              2 2 4 2 2 1
              3 1 3 3 1 1
              end
              
              // here is the dataset
              list
              To get the data into shape, we use reshape twice:

              Code:
              // Step 1: get into shape
              reshape long xm , i(rad)
              
              // look what this has done
              list
              
              // now adjust the names
              rename (rad _j xm) (_j xm rad)
              
              // and look again
              list
              
              // now reshape back
              reshape wide rad , i(xm) j(_j)
              
              // and look at the final result
              list
              We could have replaced the two reshape commands with one xpose command. However, I believe that the reshape approach is more instructive. Also, you will use reshape much more often than xpose.

              Now, calculate the kappa coefficient

              Code:
              // Step 2: calculate kappa
              kap rad1-rad3
              Run the code snippets above in the order and try to follow along.

              A couple of additional thoughts: Having 9 rating categories, you might want to use weights for (dis-)agreement. I believe that Stata's kap command will not let you use weights with more than two raters; at least older releases had that limitation. If you want weighted agreement coefficients, download kappaetc from SSC. For example, linear weights could be applied typing

              Code:
              ssc install kappaetc
              kappaetc rad1-rad3 , wgt(linear)
              where, obviously, you type the first line only once.

              Last, please stop posting screenshots. Review the FAQ and use [CODE] delimiters to show code and output, as I have done above.
              Last edited by daniel klein; 02 Jun 2020, 12:10. Reason: formatting

              Comment


              • #8
                Thanks, I tried what you said and have gotten a combined kappa value and a kappa value for each for the 9 tests in my questionnaire. Do you know why I am getting negative kappa values for 3 of the 9 tests? Is this expected (because I have read that a negative kappa represents agreement worse than expected, or disagreement) or is there an error in my calculations with Stata?

                Edit: Nvm, apparently kappa values can be between -1 to 1 so this is normal.
                Last edited by Lucy Kay; 02 Jun 2020, 13:20.

                Comment


                • #9
                  Originally posted by Lucy Kay View Post
                  [...]
                  I have read that a negative kappa represents agreement worse than expected
                  Yes, that follows from the mathematical definiton:

                  \[
                  \kappa = \frac{p_o-p_e}{1-p_e}
                  \]

                  where \(p_o\) is observed agreement and \(p_e\) is expected agreement.

                  I [...] have gotten [...] a kappa value for each for the 9 tests in my questionnaire
                  No, you have not. The individual kappa values do not refer to the 'tests' but to the rating categories! More precisely, the values represent the kappa values that would be obtained if all categories except one were combined. You might get confused here because the number of rating categories happens to be the same as the number of 'tests' in your data.
                  Last edited by daniel klein; 02 Jun 2020, 13:33.

                  Comment


                  • #10
                    But I wanted the agreement for the 55 raters for each of the 9 tests, not for each rating category! I don't think the latter gives me any important information.

                    What do you think about just using this code with my original dataset to get the agreement for the 55 raters for each of the 9 tests:
                    Code:
                    kap test1-test9
                    Last edited by Lucy Kay; 02 Jun 2020, 14:49.

                    Comment


                    • #11
                      Technically, agreement is calculated row-wise. Thus, sticking with your initial attempt would give you something like the average intra-rater agreement, i.e., this would measure how much a rater's score on test1 agrees with the same rater's score on tests2 to test9.

                      I think we need more content here. What or who are the raters and what are the tests? Tell us a bit more about the substantive question that you are trying to answer.

                      Comment


                      • #12
                        The raters are 55 experts who were asked to rate the importance of 9 health outcomes on an integer scale of 1-9. I want to figure out how much the rater's scores agree with each other for each of the 9 health outcomes.

                        My study is exploratory so I am just trying to get some basic statistics on my survey results. I think looking at agreement between raters is important to see how much we can trust the median for each health outcome.

                        Comment


                        • #13
                          I do not understand the part about the median of the health outcomes, but that might not be important here.

                          I believe that most concepts of chance-agreement are not really applicable to the case of only one subject being rated. For example, Fleiss' bases his notion of chance-agreement on the frequencies with which the rating categories are used by the raters. If there is only one subject, there is no intra-rater variation in the rating categories. Mathematically, there are only two possible values for Fleiss' kappa in this situation: The upper bound is 0 and it is reached if all raters choose one category. In that case, observed agreement equals expected agreement. Both kap and kappaetc will exit with error in that case because there is no variation in the data, whatsoever; I forgot how to get the lower bound of kappa and cannot derive quickly now. However, assume you have 9 observations, representing your health outcomes, and a variable, subject, that is numbered 1, 2, ..., 9. Further, the expert ratings are help in variables rater1, rater2, .., rater55. You can get the lower kappa by typing

                          Code:
                          bysort subject , rc0 : kappaetc rater1-rater55
                          This code will give you a kappa value for each subject (test). As stated above, I do not believe this is useful information because you could derive the two possible values based on the number of subjects, raters, and rating categories without even observing a single rating. If you really wanted this, you should probably either report observed agreement or the Brennan and Prediger coefficient (which you may call PABAK if you like that better).

                          However, typically, your interest would be in the classification process as a whole.
                          Last edited by daniel klein; 02 Jun 2020, 16:45. Reason: removed incorrect calculation

                          Comment


                          • #14
                            I could not let go. I am sure there is a simpler way but here is the lower limit for observed agreement

                            Code:
                            // number of raters
                            local R 55
                            
                            // number of categories
                            local C 9
                            
                            // ---------------------------------------------------------------------
                            
                            // how many times can the categries be repeated?
                            local full_categories = floor(`R'/`C')
                            // how many categories are left over; cannot be repeated?
                            local left_categories = mod(`R', `C')
                            // combinations for full categories
                            local full = (`C'-`left_categories')*max(0, comb(`full_categories', 2))
                            // combinations for the left categories
                            local left = `left_categories'*max(0, comb((`full_categories'+1), 2))
                            // fraction of total
                            display (`full'+`left') / comb(`R', 2)
                            which yields

                            Code:
                            . // fraction of total
                            . display (`full'+`left') / comb(`R', 2)
                            .09494949
                            meaning that with 55 raters choosing among 9 rating categories, it is not possible to observe less than about 10 percent agreement.

                            Comment


                            • #15
                              Why does Stata only show me 4 outcomes (i.e., 1, 2, 3, 7) when I have 9 health outcomes in my survey (and I want to see the kappa for agreement between the 55 expert raters for each of the 9 health outcomes):

                              Data [There are 55 tests (55 experts in my survey) and 9 subjects (9 health outcomes being rated), I only pasted a sample of the data]:
                              Code:
                              subject    test1    test2    test3    test4    test5
                              1    2    3    2    3    2
                              2    2    2    2    2    3
                              3    2    2    2    2    3
                              Code I used:
                              Code:
                              kap test1-test55
                              Stata output:
                              Code:
                              There are 55 raters per subject:
                              
                                       Outcome |    Kappa          Z     Prob>Z
                              -----------------+-------------------------------
                                             1 |    0.0583       6.74    0.0000
                                             2 |    0.0060       0.69    0.2447
                                             3 |    0.0742       8.57    0.0000
                                             7 |   -0.0020      -0.23    0.5925
                              -----------------+-------------------------------
                                      combined |    0.0440       6.65    0.0000

                              Comment

                              Working...
                              X