Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using a survey

    So - I am trying to decide which resident to promote. Using their "Performance" as one metric (about 50% of their entire score) My idea was to normalize their survey score and then use that score in their overall total rank.

    I realized I might have a problem.

    The scoring is 1-10 from 14 reviewers. My plan was to use the median to decrease some potentially biased outlier results.

    I really am trying to be fair to all these hard-working people, so getting this right is important to me. Some reviewers gave no score (because they didnt work with that person) and thus missing data exists. That might give me trouble with interobserver reliability. There might be issues with right censoring too?

    Slight differences might mean the difference between job and no job.

    Dashes mean missing data. Each column is the same reviewer
    Resident 1: 9 7 6 7 8 - 6 7 4 6 2 5 6 7 Resident 2: 10 7 8 10 8 8 10 8 10 9 10 6 8 9 Resident 3 9 7 6 - 8 - 8 6 7 6 4 7 8 7 Resident 4 - 8 8 - 7 7 4 7 3 8 - 10 Resident 5 8 9 7 - 9 9 10 8 10 9 8 4 5 8
    Am I overthinking this?

  • #2
    No, you are not overthinking it. I don't know if the data you show is just a sample, or if these are all the residents and all of their ratings. Let me assume the latter.

    Code:
    . * Example generated by -dataex-. For more info, type help dataex
    . clear
    
    . input byte(resident_id score1 score2 score3 score4 score5 score6 score7 score8 score9 score10 score11 score12 score13 score14)
    
         reside~d    score1    score2    score3    score4    score5    score6    score7    score8    score9   score10   score11   score12   sc
    > ore13   score14
      1. 1  9 7 6  7 8 .  6 7  4 6  2  5 6 7
      2. 2 10 7 8 10 8 8 10 8 10 9 10  6 8 9
      3. 3  9 7 6  . 8 .  8 6  7 6  4  7 8 7
      4. 4  . 8 8  . 7 7  4 7  3 8  . 10 . .
      5. 5  8 9 7  . 9 9 10 8 10 9  8  4 5 8
      6. end
    
    .
    . reshape long score, i(resident_id) j(rater)
    (j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14)
    
    Data                               Wide   ->   Long
    -----------------------------------------------------------------------------
    Number of observations                5   ->   70          
    Number of variables                  15   ->   3           
    j variable (14 values)                    ->   rater
    xij variables:
                  score1 score2 ... score14   ->   score
    -----------------------------------------------------------------------------
    
    .
    . mixed score || resident:
    
    Performing EM optimization ...
    
    Performing gradient-based optimization:
    Iteration 0:   log likelihood = -120.32521  
    Iteration 1:   log likelihood = -120.32521  
    
    Computing standard errors ...
    
    Mixed-effects ML regression                     Number of obs     =         61
    Group variable: resident_id                     Number of groups  =          5
                                                    Obs per group:
                                                                  min =          9
                                                                  avg =       12.2
                                                                  max =         14
                                                    Wald chi2(0)      =          .
    Log likelihood = -120.32521                     Prob > chi2       =          .
    
    ------------------------------------------------------------------------------
           score | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
           _cons |   7.336447   .4056034    18.09   0.000     6.541479    8.131416
    ------------------------------------------------------------------------------
    
    ------------------------------------------------------------------------------
      Random-effects parameters  |   Estimate   Std. err.     [95% conf. interval]
    -----------------------------+------------------------------------------------
    resident_id: Identity        |
                      var(_cons) |   .5958305   .5095361      .1114811    3.184523
    -----------------------------+------------------------------------------------
                   var(Residual) |   2.721216   .5132214      1.880296    3.938217
    ------------------------------------------------------------------------------
    LR test vs. linear model: chibar2(01) = 6.20          Prob >= chibar2 = 0.0064
    
    . estat icc
    
    Intraclass correlation
    
    ------------------------------------------------------------------------------
                           Level |        ICC   Std. err.     [95% conf. interval]
    -----------------------------+------------------------------------------------
                     resident_id |   .1796268   .1309354      .0369559     .555426
    ------------------------------------------------------------------------------
    That final result, ICC = 0.18 (95% CI 0.04-0.56), all rounded to 2 decimal places, is a measure of inter-rater agreement. The 0.18 value tells us that the scores are much more about the rater than about the resident being rated. Only 18% of the variance in ratings is actually due to differences among the residents. The rest is due to disagreement among the raters about those residents.

    If the data you provided is just a sample, then you should carry out similar calculations with the full data.

    An evaluation with an ICC of 0.18 is really not suitable for decisions with important consequences. If you use it at all, it should account only for a very tiny percentage of their total score, nothing near 50%. I think different analysts would give you different advice about how high the ICC needs to be to be useful for your purposes, or might suggest other ways of looking at the data altogether, but I think everyone would agree that an ICC of 0.18 does not even come close to cutting it.

    While it is admirable to try to use data to build a merit-based system, it is important that the data be good quality. Using a rating system with this poor inter-rater agreement would just be a lottery masquerading as a merit-based system. If there is not enough useful information to truly support a merit-based evaluation, a lottery may well be the appropriate way to go. But in that case, make it an explicit, fully-random lottery.

    In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Wow. Yes! This is the total data set! It’s amazing … the other data we commonly use is also problematic. (Transcripts from various medical schools with varying grading methods, test scores which can rank test taking but not performance). I’ve tried so hard to be objective but seems like it’s better to realize that might be impossible.

      Comment

      Working...
      X