Calculating inter-rater reliability/ICC for ratings of clock time

Jaime Devine

Join Date: Dec 2019

Posts: 4
#1

Calculating inter-rater reliability/ICC for ratings of clock time

10 Jun 2020, 13:00

I have three days where three different measurement systems [variable name: system] provided a measure of clock time at which an event occurred (i.e., 5:42 AM, 5:43 AM, 5:42 AM) and a duration (i.e., 407 minutes, 413 minutes, 436 minutes, variable name: duration) over the course of three consecutive nights [variable name; night]. I want to test the the reliability of the systems to determine the time of the event. I converted my HH:MM time variables to decimals [decimaltime] in Excel as well as recoded the clock times into integers in STATA using . gen double [integertime] = clock(time, "hm"). That doesn't seem to make a difference.

I tried ICC three different ways and still have some issues with interpretation. Here are the means by system for reference:

-----------------------------------------------------------------
Over | Mean Std. Err. [95% Conf. Interval]

----------------+------------------------------------------------
decimaltime |
system1 | .244397 .0041795 .23605 .252744
system2 | .2409406 .0041865 .2325795 .2493017
system3 | .242519 .0042903 .2339508 .2510873

----------------+------------------------------------------------
duration |
system1 | 407.5909 11.99015 383.6449 431.5369
system2 | 413.9146 7.2453 399.4447 428.3845
system3 | 436.1818 4.518426 427.1579 445.2057

2) I combined the subject id and the repeated measure night to create an idbynight variable. I'm pretty sure this way is wrong because I have repeated measures.

Time
icc decimaltime system idbynight, abs

Intraclass correlations
Two-way random-effects model
Absolute agreement

Random effects: system Number of targets = 3
Random effects: idbynight Number of raters = 22

--------------------------------------------------------------
decimaltime | ICC [95% Conf. Interval]
-----------------------+--------------------------------------
Individual | .0019337 -.003774 .2284833
Average | .040882 -.0901752 .8669374
--------------------------------------------------------------
F test that
ICC=0.00: F(2.0, 42.0) = 1.34 Prob > F = 0.273

Duration
. icc duration system idbynight, abs

Intraclass correlations
Two-way random-effects model
Absolute agreement

Random effects: system Number of targets = 3
Random effects: idbynight Number of raters = 22

--------------------------------------------------------------
duration | ICC [95% Conf. Interval]
-----------------------+--------------------------------------
Individual | .0998525 .0051437 .8478667
Average | .7093394 .1021304 .99191
--------------------------------------------------------------
F test that
ICC=0.00: F(2.0, 42.0) = 4.58 Prob > F = 0.016

#2) I tried mixed model with estat icc. This way seems the most accurate from what I've read. The results for decimaltime make sense looking at the mixed model results, but seem really poor for duration given how similar the systems were. I am also not sure how to determine F and p values from this output.

Time
mixed decimaltime system##night || id:

Performing EM optimization:

Performing gradient-based optimization:

Iteration 0: log likelihood = 222.61294
Iteration 1: log likelihood = 222.61294

Computing standard errors:

Mixed-effects ML regression Number of obs = 66
Group variable: id Number of groups = 8

Obs per group:
min = 6
avg = 8.3
max = 9

Wald chi2(8) = 10.83
Log likelihood = 222.61294 Prob > chi2 = 0.2114

---------------------------------------------------------------------------------
decimaltime | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------+----------------------------------------------------------------
system |
system2 | -.0016926 .0032166 -0.53 0.599 -.007997 .0046118
system3 | -.0018227 .0032166 -0.57 0.571 -.0081272 .0044817
|
night |
2 | .0004009 .0033478 0.12 0.905 -.0061607 .0069626
3 | -.0002163 .0033478 -0.06 0.948 -.0067779 .0063453
|
system#night |
system2#2 | .0014446 .0047086 0.31 0.759 -.0077841 .0106733
system2#3 | -.0069878 .0047086 -1.48 0.138 -.0162165 .0022409
system3#2 | -.0002605 .0047086 -0.06 0.956 -.0094892 .0089682
system3#3 | .000087 .0047086 0.02 0.985 -.0091417 .0093157
|
_cons | .2437065 .0068456 35.60 0.000 .2302894 .2571236
---------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
id: Identity |
var(_cons) | .0003335 .0001694 .0001232 .0009025
-----------------------------+------------------------------------------------
var(Residual) | .0000414 7.69e-06 .0000288 .0000596
------------------------------------------------------------------------------
LR test vs. linear model: chibar2(01) = 111.07 Prob >= chibar2 = 0.0000

.
. estat icc

Residual intraclass correlation

------------------------------------------------------------------------------
Level | ICC Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
id | .8896069 .0532137 .73589 .9588595
------------------------------------------------------------------------------

.

Duration
. mixed duration system##night || id:

Performing EM optimization:

Performing gradient-based optimization:

Iteration 0: log likelihood = -326.76031
Iteration 1: log likelihood = -326.76031

Computing standard errors:

Mixed-effects ML regression Number of obs = 66
Group variable: id Number of groups = 8

Obs per group:
min = 6
avg = 8.3
max = 9

Wald chi2(8) = 15.75
Log likelihood = -326.76031 Prob > chi2 = 0.0461

------------------------------------------------------------------------------
duration | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
system |
system2 | 3.386726 15.46029 0.22 0.827 -26.91488 33.68834
system3 | 25.0625 15.46029 1.62 0.105 -5.23911 55.36411
|
night |
2 | -24.08278 16.07087 -1.50 0.134 -55.5811 7.415554
3 | 13.57833 16.07087 0.84 0.398 -17.92 45.07666
|
system#night |
system2#2 | 21.12467 22.63155 0.93 0.351 -23.23234 65.48169
system2#3 | -11.89424 22.63155 -0.53 0.599 -56.25126 32.46277
system3#2 | 21.72321 22.63155 0.96 0.337 -22.6338 66.08023
system3#3 | -10.63393 22.63155 -0.47 0.638 -54.99094 33.72309
|
_cons | 411.125 13.48498 30.49 0.000 384.6949 437.5551
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
id: Identity |
var(_cons) | 498.6761 309.0827 147.9921 1680.346
-----------------------------+------------------------------------------------
var(Residual) | 956.0821 177.5108 664.4419 1375.731
------------------------------------------------------------------------------
LR test vs. linear model: chibar2(01) = 14.56 Prob >= chibar2 = 0.0001

.
. estat icc

Residual intraclass correlation

------------------------------------------------------------------------------
Level | ICC Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
id | .3427897 .1485578 .1252824 .6551052
------------------------------------------------------------------------------

.

#3) I reshaped the data and used kappaetc. Same problem with duration results as in #2, except I can find the F and p values way more easily.

Time
. kappaetc decimaltime* , icc(mixed) i(idbynight)

Interrater reliability Number of subjects = 22
Two-way mixed-effects model Ratings per subject = 3
------------------------------------------------------------------------------
| Coef. F df1 df2 P>F [95% Conf. Interval]
---------------+--------------------------------------------------------------
ICC(3,1) | 0.8744 21.89 21.00 42.00 0.000 0.7651 0.9411
---------------+--------------------------------------------------------------
sigma_s | 0.0185
sigma_e | 0.0070
------------------------------------------------------------------------------

Duration
. kappaetc duration* , icc(mixed) i(idbynight)

Interrater reliability Number of subjects = 22
Two-way mixed-effects model Ratings per subject = 3
------------------------------------------------------------------------------
| Coef. F df1 df2 P>F [95% Conf. Interval]
---------------+--------------------------------------------------------------
ICC(3,1) | 0.3176 2.40 21.00 42.00 0.008 0.0563 0.5925
---------------+--------------------------------------------------------------
sigma_s | 22.4659
sigma_e | 32.9276
------------------------------------------------------------------------------

Am I missing something or just in denial about the poor inter-rater reliability for duration? Thanks in advance.
Tags: ICC, inter-rater reliability, kappaetc, repeated measures, time
daniel klein

Join Date: Mar 2014

Posts: 3824
#2

11 Jun 2020, 09:26

Thanks for providing output and (some) code. For your next posts, try using CODE delimiters (that hash-button in the menus bar) to further improve readability.

Originally posted by Jaime Devine View Post

2) I combined the subject id and the repeated measure night to create an idbynight variable. I'm pretty sure this way is wrong because I have repeated measures.

At least for kappaetc (SSC, I suppose?) that seems to mess up the data setup. You are supposed to have the three systems (the "raters") as variables while the subjects (the "events") should be represented as (repeated/multiple) observations. In Stata terminology, the raters are in wide form, the repeated measurements are in long form (similar to the xt commands). The i() option would typically specify a variable that has repeated values. The output you show indicates that kappaetc does not recognize repeated measurements. If done correctly, the header should indicate the "number of replications". Whether you really want repeated measurements depends on whatever the "events" that occur are regarded as identical over nights/days.

Edit:

Thinking about it, clock-time can be a bit tricky. For example, 11 AM is arguably closer to 1 PM than to 9 AM. Therefore, converting to a format that accurately reflects these differences seems crucial. I understand you did that.

Last edited by daniel klein; 11 Jun 2020, 09:35.
Comment
Jaime Devine

Join Date: Dec 2019

Posts: 4
#3

11 Jun 2020, 10:10

Thank you, clarifies the problem. The number of replications I care about is id, not idbynight, or night. Using i(id), my results make sense both for clock time and duration.

I work with clock time analyses a lot and recoding it into a number that computers understand is always difficult. Fortunately for this analysis, the three systems all rated within a minute of each other and the rating times didn't cross midnight. However, my ICCs may have also been wonky because "A low ICC could not only reflect the low degree of rater or measurement agreement but also relate to the lack of variability among the sampled subjects, the small number of subjects, and the small number of raters being tested. (Ko and Li, 2016)" I have the event time in hours and minutes, but not seconds, so the clock time measurement may not have been precise enough in this instance.

Thanks again for the help!
Comment

Announcement

Calculating inter-rater reliability/ICC for ratings of clock time

Comment

Comment