Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating inter-rater reliability/ICC for ratings of clock time

    I have three days where three different measurement systems [variable name: system] provided a measure of clock time at which an event occurred (i.e., 5:42 AM, 5:43 AM, 5:42 AM) and a duration (i.e., 407 minutes, 413 minutes, 436 minutes, variable name: duration) over the course of three consecutive nights [variable name; night]. I want to test the the reliability of the systems to determine the time of the event. I converted my HH:MM time variables to decimals [decimaltime] in Excel as well as recoded the clock times into integers in STATA using . gen double [integertime] = clock(time, "hm"). That doesn't seem to make a difference.

    I tried ICC three different ways and still have some issues with interpretation. Here are the means by system for reference:

    -----------------------------------------------------------------
    Over | Mean Std. Err. [95% Conf. Interval]

    ----------------+------------------------------------------------
    decimaltime |
    system1 | .244397 .0041795 .23605 .252744
    system2 | .2409406 .0041865 .2325795 .2493017
    system3 | .242519 .0042903 .2339508 .2510873

    ----------------+------------------------------------------------
    duration |
    system1 | 407.5909 11.99015 383.6449 431.5369
    system2 | 413.9146 7.2453 399.4447 428.3845
    system3 | 436.1818 4.518426 427.1579 445.2057

    2) I combined the subject id and the repeated measure night to create an idbynight variable. I'm pretty sure this way is wrong because I have repeated measures.

    Time
    icc decimaltime system idbynight, abs

    Intraclass correlations
    Two-way random-effects model
    Absolute agreement

    Random effects: system Number of targets = 3
    Random effects: idbynight Number of raters = 22

    --------------------------------------------------------------
    decimaltime | ICC [95% Conf. Interval]
    -----------------------+--------------------------------------
    Individual | .0019337 -.003774 .2284833
    Average | .040882 -.0901752 .8669374
    --------------------------------------------------------------
    F test that
    ICC=0.00: F(2.0, 42.0) = 1.34 Prob > F = 0.273

    Duration
    . icc duration system idbynight, abs

    Intraclass correlations
    Two-way random-effects model
    Absolute agreement

    Random effects: system Number of targets = 3
    Random effects: idbynight Number of raters = 22

    --------------------------------------------------------------
    duration | ICC [95% Conf. Interval]
    -----------------------+--------------------------------------
    Individual | .0998525 .0051437 .8478667
    Average | .7093394 .1021304 .99191
    --------------------------------------------------------------
    F test that
    ICC=0.00: F(2.0, 42.0) = 4.58 Prob > F = 0.016



    #2) I tried mixed model with estat icc. This way seems the most accurate from what I've read. The results for decimaltime make sense looking at the mixed model results, but seem really poor for duration given how similar the systems were. I am also not sure how to determine F and p values from this output.

    Time
    mixed decimaltime system##night || id:

    Performing EM optimization:

    Performing gradient-based optimization:

    Iteration 0: log likelihood = 222.61294
    Iteration 1: log likelihood = 222.61294

    Computing standard errors:

    Mixed-effects ML regression Number of obs = 66
    Group variable: id Number of groups = 8

    Obs per group:
    min = 6
    avg = 8.3
    max = 9

    Wald chi2(8) = 10.83
    Log likelihood = 222.61294 Prob > chi2 = 0.2114

    ---------------------------------------------------------------------------------
    decimaltime | Coef. Std. Err. z P>|z| [95% Conf. Interval]
    ----------------+----------------------------------------------------------------
    system |
    system2 | -.0016926 .0032166 -0.53 0.599 -.007997 .0046118
    system3 | -.0018227 .0032166 -0.57 0.571 -.0081272 .0044817
    |
    night |
    2 | .0004009 .0033478 0.12 0.905 -.0061607 .0069626
    3 | -.0002163 .0033478 -0.06 0.948 -.0067779 .0063453
    |
    system#night |
    system2#2 | .0014446 .0047086 0.31 0.759 -.0077841 .0106733
    system2#3 | -.0069878 .0047086 -1.48 0.138 -.0162165 .0022409
    system3#2 | -.0002605 .0047086 -0.06 0.956 -.0094892 .0089682
    system3#3 | .000087 .0047086 0.02 0.985 -.0091417 .0093157
    |
    _cons | .2437065 .0068456 35.60 0.000 .2302894 .2571236
    ---------------------------------------------------------------------------------

    ------------------------------------------------------------------------------
    Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
    -----------------------------+------------------------------------------------
    id: Identity |
    var(_cons) | .0003335 .0001694 .0001232 .0009025
    -----------------------------+------------------------------------------------
    var(Residual) | .0000414 7.69e-06 .0000288 .0000596
    ------------------------------------------------------------------------------
    LR test vs. linear model: chibar2(01) = 111.07 Prob >= chibar2 = 0.0000

    .
    . estat icc

    Residual intraclass correlation

    ------------------------------------------------------------------------------
    Level | ICC Std. Err. [95% Conf. Interval]
    -----------------------------+------------------------------------------------
    id | .8896069 .0532137 .73589 .9588595
    ------------------------------------------------------------------------------

    .

    Duration
    . mixed duration system##night || id:

    Performing EM optimization:

    Performing gradient-based optimization:

    Iteration 0: log likelihood = -326.76031
    Iteration 1: log likelihood = -326.76031

    Computing standard errors:

    Mixed-effects ML regression Number of obs = 66
    Group variable: id Number of groups = 8

    Obs per group:
    min = 6
    avg = 8.3
    max = 9

    Wald chi2(8) = 15.75
    Log likelihood = -326.76031 Prob > chi2 = 0.0461

    ------------------------------------------------------------------------------
    duration | Coef. Std. Err. z P>|z| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    system |
    system2 | 3.386726 15.46029 0.22 0.827 -26.91488 33.68834
    system3 | 25.0625 15.46029 1.62 0.105 -5.23911 55.36411
    |
    night |
    2 | -24.08278 16.07087 -1.50 0.134 -55.5811 7.415554
    3 | 13.57833 16.07087 0.84 0.398 -17.92 45.07666
    |
    system#night |
    system2#2 | 21.12467 22.63155 0.93 0.351 -23.23234 65.48169
    system2#3 | -11.89424 22.63155 -0.53 0.599 -56.25126 32.46277
    system3#2 | 21.72321 22.63155 0.96 0.337 -22.6338 66.08023
    system3#3 | -10.63393 22.63155 -0.47 0.638 -54.99094 33.72309
    |
    _cons | 411.125 13.48498 30.49 0.000 384.6949 437.5551
    ------------------------------------------------------------------------------

    ------------------------------------------------------------------------------
    Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
    -----------------------------+------------------------------------------------
    id: Identity |
    var(_cons) | 498.6761 309.0827 147.9921 1680.346
    -----------------------------+------------------------------------------------
    var(Residual) | 956.0821 177.5108 664.4419 1375.731
    ------------------------------------------------------------------------------
    LR test vs. linear model: chibar2(01) = 14.56 Prob >= chibar2 = 0.0001

    .
    . estat icc

    Residual intraclass correlation

    ------------------------------------------------------------------------------
    Level | ICC Std. Err. [95% Conf. Interval]
    -----------------------------+------------------------------------------------
    id | .3427897 .1485578 .1252824 .6551052
    ------------------------------------------------------------------------------

    .


    #3) I reshaped the data and used kappaetc. Same problem with duration results as in #2, except I can find the F and p values way more easily.

    Time
    . kappaetc decimaltime* , icc(mixed) i(idbynight)

    Interrater reliability Number of subjects = 22
    Two-way mixed-effects model Ratings per subject = 3
    ------------------------------------------------------------------------------
    | Coef. F df1 df2 P>F [95% Conf. Interval]
    ---------------+--------------------------------------------------------------
    ICC(3,1) | 0.8744 21.89 21.00 42.00 0.000 0.7651 0.9411
    ---------------+--------------------------------------------------------------
    sigma_s | 0.0185
    sigma_e | 0.0070
    ------------------------------------------------------------------------------

    Duration
    . kappaetc duration* , icc(mixed) i(idbynight)

    Interrater reliability Number of subjects = 22
    Two-way mixed-effects model Ratings per subject = 3
    ------------------------------------------------------------------------------
    | Coef. F df1 df2 P>F [95% Conf. Interval]
    ---------------+--------------------------------------------------------------
    ICC(3,1) | 0.3176 2.40 21.00 42.00 0.008 0.0563 0.5925
    ---------------+--------------------------------------------------------------
    sigma_s | 22.4659
    sigma_e | 32.9276
    ------------------------------------------------------------------------------


    Am I missing something or just in denial about the poor inter-rater reliability for duration? Thanks in advance.

  • #2
    Thanks for providing output and (some) code. For your next posts, try using CODE delimiters (that hash-button in the menus bar) to further improve readability.

    Originally posted by Jaime Devine View Post
    2) I combined the subject id and the repeated measure night to create an idbynight variable. I'm pretty sure this way is wrong because I have repeated measures.
    At least for kappaetc (SSC, I suppose?) that seems to mess up the data setup. You are supposed to have the three systems (the "raters") as variables while the subjects (the "events") should be represented as (repeated/multiple) observations. In Stata terminology, the raters are in wide form, the repeated measurements are in long form (similar to the xt commands). The i() option would typically specify a variable that has repeated values. The output you show indicates that kappaetc does not recognize repeated measurements. If done correctly, the header should indicate the "number of replications". Whether you really want repeated measurements depends on whatever the "events" that occur are regarded as identical over nights/days.

    Edit:

    Thinking about it, clock-time can be a bit tricky. For example, 11 AM is arguably closer to 1 PM than to 9 AM. Therefore, converting to a format that accurately reflects these differences seems crucial. I understand you did that.
    Last edited by daniel klein; 11 Jun 2020, 09:35.

    Comment


    • #3
      Thank you, clarifies the problem. The number of replications I care about is id, not idbynight, or night. Using i(id), my results make sense both for clock time and duration.

      I work with clock time analyses a lot and recoding it into a number that computers understand is always difficult. Fortunately for this analysis, the three systems all rated within a minute of each other and the rating times didn't cross midnight. However, my ICCs may have also been wonky because "A low ICC could not only reflect the low degree of rater or measurement agreement but also relate to the lack of variability among the sampled subjects, the small number of subjects, and the small number of raters being tested. (Ko and Li, 2016)" I have the event time in hours and minutes, but not seconds, so the clock time measurement may not have been precise enough in this instance.

      Thanks again for the help!

      Comment

      Working...
      X