Comparing two devices on performance

May Blake

Join Date: Feb 2020
Posts: 133

Comparing two devices on performance

17 Jun 2024, 19:42

I have data collected from two devices. device 1 is the traditional way of identifying accidents and fatalities from a database. We are trying to show that device 2 is better. After device 1 is done with analysis, we feed the data into devices 2 to see what device 1 has missed, so the columns for device 2 are both what was missed by device 1. is there a way to show the percent by which device 2 is performing in identifying accidents and fatalities? It could be that device 2 also missed some so can I add up the columns for device 1 and 2 as the denominator?

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(id device1accidents device2accidents device1fatalities device2fatalities)
  1  5 3 1 3
  2  4 1 2 2
  3  4 1 0 2
  4  5 1 0 0
  5  5 2 1 1
  6  3 3 1 1
  7  5 2 0 1
  8  4 2 1 4
  9  6 3 0 1
 10  7 2 1 0
 11  2 4 1 0
 12  5 4 0 2
 13  5 1 3 1
 14  6 2 1 2
 15  4 1 0 0
 16  6 2 1 2
 17  6 3 2 0
 18  1 2 2 2
 19  7 4 2 1
 20  2 3 1 2
 21 11 4 1 1
 22  4 1 2 0
 23  3 1 2 2
 24  8 1 2 0
 25  3 2 1 0
 26  3 2 4 0
 27  5 5 1 1
 28  8 5 1 3
 29  3 2 1 4
 30  2 2 0 1
 31  5 3 0 0
 32  3 2 1 3
 33  8 1 1 0
 34 10 2 0 0
 35  3 5 2 1
 36  2 5 2 0
 37  5 3 0 1
 38  7 4 0 1
 39  6 1 1 1
 40  6 3 3 1
 41  2 4 1 0
 42  4 3 1 1
 43  9 3 1 1
 44  7 3 3 3
 45 11 3 1 1
 46  8 5 0 0
 47  3 3 0 1
 48  2 5 0 1
 49  3 2 0 2
 50  4 2 0 0
 51  5 5 0 1
 52  5 3 0 0
 53  4 3 0 0
 54  3 3 2 0
 55  8 6 1 1
 56  2 6 2 2
 57  1 3 1 0
 58  4 5 2 0
 59  5 2 0 0
 60  4 4 1 2
 61  3 3 3 1
 62  4 2 0 2
 63  8 4 0 2
 64  2 1 0 1
 65  2 3 0 3
 66  2 1 1 1
 67  7 3 0 1
 68  6 3 4 0
 69  7 1 1 0
 70  6 2 1 0
 71  7 4 1 1
 72  6 2 0 1
 73  4 5 1 0
 74  2 0 1 2
 75  3 3 1 2
 76  4 2 0 3
 77  7 0 5 1
 78  4 2 0 0
 79  3 2 1 0
 80  2 4 0 2
 81  6 2 1 1
 82  3 1 0 0
 83  5 1 0 1
 84  9 0 1 0
 85  6 2 2 2
 86  6 4 2 1
 87  9 3 2 2
 88  4 5 4 0
 89  9 5 3 2
 90  2 3 2 2
 91 10 4 2 0
 92  6 1 1 2
 93  9 3 0 0
 94  4 4 1 1
 95  1 4 2 2
 96  6 3 0 2
 97  8 2 0 0
 98  6 2 3 1
 99  2 3 1 2
100  3 3 1 1
end

------------------ copy up to and including the previous line ------------------

Listed 100 out of 100 observations

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

17 Jun 2024, 21:08

I would characterize this as an inadequate way to assess the performance of the two devices. There are several problems with it:
You do not have a reference criterion, i.e. a "ground truth" that determines the actual status of all the events that were assessed by the devices. Just because device2 calls something an accident that device1 didn't doesn't mean that device 2 was right and device1 was wrong--it could be the other way around. You need some kind of ground truth to which both devices can be compared.

Even if you had a ground truth to compare both devices, the fact that device2 classifies more of the events as accidents doesn't necessarily make it better. Often, even usually, when one device makes more "hits" than another, it often also will be generating more false positives. Even if missed accidents are a more serious problem than false positive calls of accidents are, if the number of false positives generated to save each miss is too large, the overall balance may then favor the that misses more.

In fact, if your only criterion is which device picks up more accidents, then don't bother with any device at all. Just call everything an accident (or fatality) and you'll never miss anything! You can see that a rational approach to comparing detection systems requires looking both at missed hits and false positives.
1 like
Comment
May Blake

Join Date: Feb 2020

Posts: 133
#3

18 Jun 2024, 08:31

Thank you Clyde, as this is a sample data I may actually have not given accurate representation of the real data. In my real data, device 1 is actually the ground truth and the traditional method via manual assessment. Second, I used accidents and fatalities but the actual data is actually assessments of endocarditis which is very tangible. So device 1 data is manual endocarditis review for nodes via surgery and device 2 is a CT like imaging used to identify the number of nodes of endo.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

18 Jun 2024, 09:16

So, this puts things in a very different light. And it makes the problem in some ways more difficult.

Consider a situation where the surgical inspection reveals 5 nodes and the device identifies 4. At first glance, this looks like the device found 4 out of 5, or 80%. But, not so fast. It may be that the four "nodes" reported by the device are not actually nodes at all but 4 artifacts that it has mistaken for nodes, and, on top of that, it has missed all 5 of the actual nodes present. This would give it a 0% sensitivity (proportion of actual nodes found) and a 0% positive predictive value (proportion of findings that are actually nodes). Of course, that's the worst case scenario, but the actual performance could be anywhere between that and a best case of 80% sensitivity (4 of 5 found) with 100% positive predictive value (every finding was an actual node.) So it isn't possible to estimate the usual performance characteristics using count data like this. One would have to have more fine-grained data which characterized every individual finding as truly a node or not.

There is another issue here. I notice that the ground truth never reports 0 nodes. I suppose this is because you wouldn't do surgery if there isn't actually any endocarditis. While this makes good clinical sense, it also makes it impossible to know how device2 performs when there are no endocarditis nodes at all. There are cases where the number of nodes "found" by device 2 exceeds the true number, so we know that device2 does report false positives--but we are unable to know how many false positive nodes it typically reports when the truth is that there are none at all, which is a key question if one proposes to use device1 to detect the presence of endocarditis in the first place.

All of that said, you may care more about the accuracy of the total number of nodes reported by device2, and less about the accuracy of each specific characterization. If you draw a scatterplot of device2accidents (vertical axis) vs device1accidents (horizontal axis), you can see that the points fill out the space of that graph pretty evenly. And a linear regression of device2accidents on device1accidents shows a coefficient of 0.015, with R² = 0.007, and adjusted R² negative. So this is the picture one would expect to see if the findings on device2 are actually independent of the ground truth. In other words, device2 is just generating results that have no meaningful connection to what's actually on the valve. (The numbers differ slightly for device2fatalities vs device1fatalities, but the results are qualitatively pretty much the same.) If the example data is representative, I feel comfortable concluding that device 2's quantitative findings are uninformative about the ground truth number of nodes.
Comment
May Blake

Join Date: Feb 2020

Posts: 133
#5

18 Jun 2024, 16:49

Thank you Clyde for the detailed explanation. As I read your explanation, my actual data actually does have some zeros in the ground truth. I generated a random dataset that does not line up very well with what I am working with.
Comment

Announcement

Comparing two devices on performance

Comment

Comment

Comment

Comment