Test-retest reliability?

Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#16

21 May 2020, 18:26

Originally posted by Chris Martin View Post

Is there a way to incorporate information about the order of waves here?

Yes. Use a set of indicator variables for the wave sequence number.

For a given person a score in five successive waves could be (2,3,4,5,6) which would suggest reasonably high test-retest reliability or (2,4,6,3,4) which would not, but I'm not sure that an icc would pick up the difference here.

It will. See below.

.ÿ
.ÿversionÿ16.1

.ÿ
.ÿclearÿ*

.ÿ
.ÿsetÿseedÿ`=strreverse("1554434")'

.ÿquietlyÿsetÿobsÿ250

.ÿ
.ÿgenerateÿintÿpidÿ=ÿ_n

.ÿgenerateÿdoubleÿpid_uÿ=ÿrnormal()

.ÿ
.ÿquietlyÿexpandÿ5

.ÿbysortÿpid:ÿgenerateÿbyteÿtimÿ=ÿ_n

.ÿ
.ÿgenerateÿdoubleÿoutÿ=ÿpid_uÿ+ÿrnormal()

.ÿ
.ÿquietlyÿmixedÿoutÿi.timÿ||ÿpid:ÿ,ÿnolrtestÿnolog

.ÿestatÿicc

Residualÿintraclassÿcorrelation

------------------------------------------------------------------------------
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿLevelÿ|ÿÿÿÿÿÿÿÿICCÿÿÿStd.ÿErr.ÿÿÿÿÿ[95%ÿConf.ÿInterval]
-----------------------------+------------------------------------------------
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿpidÿ|ÿÿÿ.5293876ÿÿÿ.0293432ÿÿÿÿÿÿ.4717407ÿÿÿÿ.5862609
------------------------------------------------------------------------------

.ÿ
.ÿquietlyÿbysortÿpidÿ(out):ÿreplaceÿtimÿ=ÿ_n

.ÿ
.ÿquietlyÿmixedÿoutÿi.timÿ||ÿpid:ÿ,ÿnolrtestÿnolog

.ÿestatÿicc

Residualÿintraclassÿcorrelation

------------------------------------------------------------------------------
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿLevelÿ|ÿÿÿÿÿÿÿÿICCÿÿÿStd.ÿErr.ÿÿÿÿÿ[95%ÿConf.ÿInterval]
-----------------------------+------------------------------------------------
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿpidÿ|ÿÿÿ.8612001ÿÿÿ.0123388ÿÿÿÿÿÿ.8352057ÿÿÿÿ.8836654
------------------------------------------------------------------------------

.ÿ
.ÿexit

endÿofÿdo-file

.
Comment

Christian Huesr

Join Date: Sep 2020
Posts: 1

#17

23 Sep 2020, 17:17

Hello statalist,

My question is very similar to Karin Jensen's. We are trying to assess the reliability of a score on radiographs. In this experiment, three raters graded (assigned a score) to 16 different radiographs (preoperative + postoperative) at two different time points. Each radiograph may be scored from 0-3. My objectives are to assess test-retest reliability, intra-rater reliability, and inter-rater reliability.

I used pearson's correlation to assess test-retest reliability of scores given to preoperative radiographs AND of scores given to postoperative radiographs.

pwcorr week1 week2 if operative_time =="preop"
pwcorr week1 week2 if operative_time =="postop"

To do this I needed to structure my data as follows:

id	operative_time	judge	week1	week2
1	preop	1	1	1
2	preop	1	2	1
3	preop	1	0	0
4	preop	1	0	0
1	postop	1	2	2
2	postop	1	3	2
3	postop	1	3	3
4	postop	1	3	3
1	preop	2	1	1
2	preop	2	1	1
3	preop	2	0	0
4	preop	2	0	0
1	postop	2	2	2
2	postop	2	2	2
3	postop	2	0	2
4	postop	2	0	0

I would like to use two-way fixed effects ICC with absolute agreement to test interrater reliability, however, i run into the issue of multiple observation and receive an error:
multiple observations per target and rater not allowed
r (498)

To circumvent this error, I created a new ID variable concatenating "id" + "week" using the following command to produce the dataset below:
gen id_week
replace id_week= id_week+"_2" if week ==2

week

id_week

operative_time

judge

score

3	1	3	postop	3	0
3	2	3_2	postop	1	0
6	1	6	postop	2	1
8	1	8	postop	1	2
2	2	2_2	postop	1	1
8	2	8_2	postop	2	2
5	1	5	postop	3	2
4	1	4	postop	1	0
2	2	2_2	postop	2	1
1	1	1	postop	2	1
3	1	3	postop	2	0
2	1	2	postop	3	2

The results are below.

Code:

. bys operative_time: icc score id2 judge, absolute

--------------------------------------------------------------------
-> operative_time = postop

Intraclass correlations
Two-way random-effects model
Absolute agreement

Random effects: id_week              Number of targets =        16
Random effects: judge            Number of raters  =         3

--------------------------------------------------------------
                 score |        ICC       [95% Conf. Interval]
-----------------------+--------------------------------------
            Individual |   .7692308       .5554833    .9036698
               Average |   .9090909       .7894251    .9656863
--------------------------------------------------------------
F test that
  ICC=0.00: F(15.0, 30.0) = 10.45             Prob > F = 0.000

Note: ICCs estimate correlations between individual measurements
      and between average measurements made on the same target.

--------------------------------------------------------------------
-> operative_time = preop

Intraclass correlations
Two-way random-effects model
Absolute agreement

Random effects: id_week              Number of targets =        16
Random effects: judge            Number of raters  =         3

--------------------------------------------------------------
                 score |        ICC       [95% Conf. Interval]
-----------------------+--------------------------------------
            Individual |   .0442478       -.134425    .3443669
               Average |   .1219512      -.5515625    .6117605
--------------------------------------------------------------
F test that
  ICC=0.00: F(15.0, 30.0) = 1.19              Prob > F = 0.333

Note: ICCs estimate correlations between individual measurements
      and between average measurements made on the same target.

My questions are
1) Did I correctly assess the reliability of this new scale? I am unsure if treating the same radiographs assessed at different time points as different observations is correct.
2) If I wanted a measure of internal consistency, would the following syntax suffice?

Code:

. bys operative_time: icc score id_week judge, mixed absolute

Thank you all in advance. Please let me know if I should correct my post since this is my first time posting.

-Christian

Comment

daniel klein

Join Date: Mar 2014

Posts: 3822
#18

23 Sep 2020, 21:57

I am not sure that ICC is the most appropriate way of assessing reliability when there are only three predetermined choices for the score. That is, of course, up to you to decide. An alternative might be kappa-like coefficients. For the latter and also for ICC with repeated measures, see kappaetc (SSC).
2 likes
Comment

Alaina DeBiasi

Join Date: May 2019
Posts: 2

#19

25 Feb 2021, 07:50

I have a similar, but more nuanced setup.

5 of the same raters scored 30 street segments on a number of metrics. The first time they scored the street segments was in person. The second time they scored the same street segments they did so virtually. So, each street segment received 2 scores for every rater.

Below is an example of what my data looks like in long format.
* Method_ID : In-person vs. Virtual
* Coder_ID : the unique ID for each coder (or judge)
* Segment_ID: the street segment that each coder rated
* Litter_Mediuim: A count of medium-sized piles of litter.

input byte(Method_ID	Coder_ID)	int	Segment_ID	byte	Litter_Medium
1 1 4783 1
1 1 6891 0
1 1 7830 0
1 1 8559 3
1 1 8997 2
1 1 9399 0
1 1 10118 0
1 1 10133 0
1 1 10621 1
1 1 11116 0
1 1 12685 0
1 1 14560 0
1 1 14939 1
1 1 15418 0
1 1 15759 1
1 1 15763 0
1 1 16077 0
1 1 17411 0
1 1 17832 0
1 1 17856 1
1 1 18098 0
1 1 18262 0
1 1 18422 0
1 1 21492 1
1 1 21496 0
1 1 22224 0
1 2 4783 3
1 2 6891 2
1 2 7830 2
1 2 8559 2
1 2 8997 0
1 2 9399 0
1 2 10118 0
1 2 10133 0
1 2 10621 1
1 2 11116 0
1 2 12685 0
1 2 14560 0
1 2 14939 2
1 2 15418 1
1 2 15759 0
1 2 15763 0
1 2 16077 0
1 2 17411 0
1 2 17832 0
1 2 17856 0
1 2 18098 0
1 2 18262 0
1 2 18422 0
1 2 21492 0
1 2 21496 0
1 2 22224 0
1 3 4783 0
1 3 6891 1
1 3 7830 1
1 3 8559 0
1 3 8997 0
1 3 9399 0
1 3 10118 0
1 3 10133 0
1 3 10621 1
1 3 11116 0
1 3 12685 0
1 3 14560 0
1 3 14939 2
1 3 15418 0
1 3 15759 0
1 3 15763 0
1 3 16077 0
1 3 17411 0
1 3 17832 0
1 3 17856 0
1 3 18098 0
1 3 18262 0
1 3 18422 0
1 3 21492 1
1 3 21496 0
1 3 22224 0
1 4 4783 0
1 4 6891 2
1 4 7830 0
1 4 8559 2
1 4 8997 3
1 4 9399 0
1 4 10118 0
1 4 10133 0
1 4 10621 1
1 4 11116 0
1 4 12685 0
1 4 14560 1
1 4 14939 2
1 4 15418 1
1 4 15759 1
1 4 15763 0
1 4 16077 1
1 4 17411 0
1 4 17832 0
1 4 17856 0
1 4 18098 0
1 4 18262 0
1 4 18422 0
1 4 21492 1
1 4 21496 0
1 4 22224 0
1 5 4783 0
1 5 6891 2
1 5 7830 1
1 5 8559 1
1 5 8997 2
1 5 9399 0
1 5 10118 0
1 5 10133 0
1 5 10621 1
1 5 11116 0
1 5 12685 0
1 5 14560 0
1 5 14939 2
1 5 15418 1
1 5 15759 1
1 5 15763 0
1 5 16077 0
1 5 17411 0
1 5 17832 0
1 5 17856 0
1 5 18098 0
1 5 18262 0
1 5 18422 0
1 5 21492 0
1 5 21496 0
1 5 22224 0
2 1 4783 1
2 1 6891 3
2 1 7830 1
2 1 8559 3
2 1 8997 0
2 1 9399 0
2 1 10118 1
2 1 10133 0
2 1 10621 0
2 1 11116 0
2 1 12685 0
2 1 14560 0
2 1 14939 0
2 1 15418 0
2 1 15759 0
2 1 15763 0
2 1 16077 0
2 1 17411 0
2 1 17832 0
2 1 17856 0
2 1 18098 0
2 1 18262 0
2 1 18422 0
2 1 21492 0
2 1 21496 0
2 1 22224 0
2 2 4783 2
2 2 6891 3
2 2 7830 1
2 2 8559 1
2 2 8997 0
2 2 9399 0
2 2 10118 0
2 2 10133 0
2 2 10621 0
2 2 11116 0
2 2 12685 0
2 2 14560 0
2 2 14939 0
2 2 15418 1
2 2 15759 0
2 2 15763 1
2 2 16077 0
2 2 17411 0
2 2 17832 0
2 2 17856 0
2 2 18098 0
2 2 18262 0
2 2 18422 0
2 2 21492 0
2 2 21496 0
2 2 22224 0
2 3 4783 1
2 3 6891 0
2 3 7830 0
2 3 8559 1
2 3 8997 0
2 3 9399 0
2 3 10118 0
2 3 10133 0
2 3 10621 0
2 3 11116 0
2 3 12685 0
2 3 14560 0
2 3 14939 0
2 3 15418 0
2 3 15759 0
2 3 15763 0
2 3 16077 0
2 3 17411 2
2 3 17832 0
2 3 17856 0
2 3 18098 0
2 3 18262 0
2 3 18422 0
2 3 21492 0
2 3 21496 0
2 3 22224 0
2 4 4783 2
2 4 6891 1
2 4 7830 0
2 4 8559 3
2 4 8997 0
2 4 9399 0
2 4 10118 0
2 4 10133 0
2 4 10621 0
2 4 11116 0
2 4 12685 0
2 4 14560 0
2 4 14939 0
2 4 15418 0
2 4 15759 0
2 4 15763 0
2 4 16077 0
2 4 17411 1
2 4 17832 0
2 4 17856 0
2 4 18098 0
2 4 18262 0
2 4 18422 0
2 4 21492 0
2 4 21496 0
2 4 22224 0
2 5 4783 0
2 5 6891 1
2 5 7830 0
2 5 8559 3
2 5 8997 1
2 5 9399 0
2 5 10118 0
2 5 10133 0
2 5 10621 0
2 5 11116 0
2 5 12685 0
2 5 14560 0
2 5 14939 0
2 5 15418 0
2 5 15759 0
2 5 15763 1
2 5 16077 0
2 5 17411 1
2 5 17832 0
2 5 17856 0
2 5 18098 0
2 5 18262 0
2 5 18422 0
2 5 21492 0
2 5 21496 0
2 5 22224 0

Like the first poster that prompted this discussion, I would like to calculate an ICC for test re-test. However, my setup is a bit more complicated: 5 judges, with 2 responses for each segment. Going off of an earlier suggestion, I used stata's mixed command. I treat Segment_ID as a random effect, and include main and interactione effects for Coder_ID and Method_ID which are fixed. I then use the postestimation command to get the ICC.

. mixed Litter_Medium Method_ID##Coder_ID Segment_ID:,

Performing EM optimization:

Performing gradient-based optimization:

Iteration 0: log likelihood = -240.92771

Iteration 1: log likelihood = -240.92771

Computing standard errors:

Mixed-effects ML regression Number of obs = 260

Group variable: Segment_ID Number of groups = 26

Obs per group:

min = 10

avg = 10.0

max = 10

Wald chi2(9) = 12.86

Log likelihood = -240.92771 Prob > chi2 = 0.1689

Litter_Medium Coef. Std. Err. z P>z [95% Conf. Interval]

2.Method_ID -.0769231 .1528733 -0.50 0.615 -.3765493 .2227031

Coder_ID

2 .0769231 .1528733 0.50 0.615 -.2227031 .3765493

3 -.1923077 .1528733 -1.26 0.208 -.4919339 .1073185

4 .1538462 .1528733 1.01 0.314 -.1457801 .4534724

5 8.88e-16 .1528733 0.00 1.000 -.2996262 .2996262

Method_ID#Coder_ID

2 2 -.0769231 .2161955 -0.36 0.722 -.5006585 .3468124

2 3 -2.44e-15 .2161955 -0.00 1.000 -.4237354 .4237354

2 4 -.2307692 .2161955 -1.07 0.286 -.6545047 .1929662

2 5 -.0769231 .2161955 -0.36 0.722 -.5006585 .3468124

_cons .4230769 .1405497 3.01 0.003 .1476046 .6985492

Random-effects Parameters Estimate Std. Err. [95% Conf. Interval]

Segment_ID: Identity

var(_cons) .2097962 .0666724 .1125353 .3911166

var(Residual) .3038133 .0280875 .2534622 .3641668

LR test vs. linear model: chibar2(01) = 82.76 Prob >= chibar2 = 0.0000

end of do-file

. do "C:\Users\Alaina\AppData\Local\Temp\STD00000000.tm p"

. estat icc

Residual intraclass correlation

Level ICC Std. Err. [95% Conf. Interval]

Segment_ID .4084741 .0808686 .2638106 .5709436

end of do-file

I also run the margins command on the interaction term followed by marginsplot.

. margins Method_ID#Coder_ID
Adjusted predictions Number of obs	=	260
Expression : Linear prediction, fixed portion, predict()

Delta-method
Margin Std. Err. z P>z	[95% Conf.	Interval]

Method_ID#Coder_ID
1 1 .4230769 .1405497 3.01 0.003	.1476046	.6985492
1 2 .5 .1405497 3.56 0.000	.2245277	.7754723
1 3 .2307692 .1405497 1.64 0.101	-.0447031	.5062415
1 4 .5769231 .1405497 4.10 0.000	.3014508	.8523954
1 5 .4230769 .1405497 3.01 0.003	.1476046	.6985492
2 1 .3461538 .1405497 2.46 0.014	.0706816	.6216261
2 2 .3461538 .1405497 2.46 0.014	.0706816	.6216261
2 3 .1538462 .1405497 1.09 0.274	-.1216261	.4293184
2 4 .2692308 .1405497 1.92 0.055	-.0062415	.5447031
2 5 .2692308 .1405497 1.92 0.055	-.0062415	.5447031

. marginsplot, x(Coder_ID)

Click image for larger version

Name: Capture.JPG
Views: 1
Size: 57.4 KB
ID: 1595044

My Questions:

Is this the right way (/best) way to calculate a test-retest ICC for the setup that I described?

The ICC provided is for Individual and Consistency. I would like an ICC for Average and Absolute, and Individual and Absolute. I know the formulas to calculate these figures by hand. I just don't know how to get STATA to show me all of the information I need.

Lastly, what is the margins output telling me? How should I interpret this output?

Announcement

Comment

Comment

Comment

Comment