-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-modeling-binary-task.Rmd
976 lines (688 loc) · 67.2 KB
/
03-modeling-binary-task.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
# Modeling the binary Task {#binary-task-model}
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(ggplot2)
library(kableExtra)
```
## How much finished 95% {#binary-task-model-how-much-finished}
## Introduction {#binary-task-model-intro}
Chapter \@ref(binary-task-model) introduced measures of performance associated with the binary task. Described in this chapter is a 2-parameter statistical model for the binary task. It shows how one can predict quantities like sensitivity and specificity based on the values of the parameters of a statistical model of the binary task. Introduced are the fundamental concepts of a **decision variable** and a **reporting threshold** or simple **threshold** that occur frequently in this book. It is shown that the reporting threshold can be altered by varying the experimental conditions. The receiver-operating characteristic (ROC) plot is introduced. It is shown how the dependence of sensitivity and specificity on the reporting threshold can be exploited by a measure of performance that is independent of reporting threshold.
The dependence of variability of the operating point on the numbers of cases is examined. The concept of random sampling is introduced and it is shown that the results become more stable with larger numbers of cases, i.e., larger sample sizes. These are perhaps intuitively obvious concepts but it is important to see them demonstrated, Online Appendix 3.A. Formulae for confidence intervals for estimates of sensitivity and specificity are derived and the calculations are shown explicitly using `R` code.
## Decision variable and reporting threshold {#binary-task-model-z-sample-model}
The model for the binary task involves three assumptions:
1. Each case has an associated decision variable sample.
1. The observer (e.g., radiologist) adopts a case-independent reporting threshold.
1. An adequate number of training session(s) are used to get the observer to a steady state.
1. The observer is "blinded" to the truth of each case while the researcher is not.
### Existence of a decision variable
**Assumption 1:** Each case presentation is associated with the occurrence (or realization) of a value of a random scalar *sensory variable* which is a unidirectional measure of evidence of disease.
* The sensory variable is sensed internally by the observer and as such cannot be directly measured. Alternative terminology includes “psychophysical variable”, “perceived variable”, “perceptual variable” or “confidence level”. The last term is the most common. It is a subjective variable since it depends on the observer: the same case shown to different observers could evoke different values of the sensory variable. Since one cannot measure it anyway, it would be a very strong assumption to assume that the sensations are identical. In this book the term “latent decision variable” or simply “decision variable” is used, which hopefully gets away from the semantics and focuses instead on what the variable is used for, namely making decisions. The symbol $Z$ is used and specific realized values are termed z-samples. It is a random variable in the sense that it varies randomly from case to case.
* The decision variable rank-orders cases with respect to evidence for presence of disease. Unlike a traditional rank-ordering scheme, where "1" is the highest rank, the scale is inverted with larger values corresponding to greater evidence of disease. Without loss of generality, one assumes that the decision variable ranges from $-\infty$ to $+\infty$ with large positive values indicative of strong evidence for presence of disease and large negative values indicative of strong evidence for absence of disease. The zero value indicates no evidence for presence or absence of disease. ^[The $-\infty$ to $+\infty$ scale is not an assumption. The decision variable scale could just as well range from a to b, where a < b; with appropriate re-scaling of the decision variable, there will be no changes in the rank-orderings, and the scale can be made extend from $-\infty$ to $+\infty$.] In this book such a decision scale, with increasing values corresponding to increasing evidence of disease, is termed **positive-directed**.
### Existence of a reporting threshold
**Assumption 2:** The radiologist adopts a single case-independent reporting threshold $\zeta$ and states: "case is diseased" if $z \ge \zeta$ or "case is non-diseased" if $z < \zeta$.
* Unlike the random Z-sample, which varies from case to case, the reporting threshold is held fixed for the duration of the study. In some of the older literature the reporting threshold is sometimes referred to as "response bias". The term "bias" which has a negative connotation, whereas, in fact, the choice of reporting threshold depends on a rational assessment of costs and benefits of different outcomes.
* The choice of reporting threshold depends on the conditions of the study: perceived or known disease prevalence, cost-benefit considerations, instructions regarding dataset characteristics, personal interpreting style, etc. There is a transient "learning curve" during which observer is assumed to find the optimal threshold and henceforth holds it constant for the duration of the study. The learning is expected to stabilize after a sufficiently long training interval.
* Data should only be collected in the fixed threshold state, i.e., at the end of the training session.
* If a second study is conducted under different conditions, the observer is assumed to determine after a new training session the optimal threshold for the new conditions and henceforth holds it constant for the duration of the second study.
From Assumption 2, it follows that:
\begin{equation}
\text{1-Sp} \equiv \text{FPF}=P(Z\ge \zeta|T=1)
(\#eq:binary-task-model-fpf)
\end{equation}
\begin{equation}
\text{Se} \equiv \text{TPF}=P(Z\ge \zeta|T=2)
(\#eq:binary-task-model-tpf)
\end{equation}
**Explanation:** $P(Z\ge \zeta|T=1)$ is the probability that the Z-sample for a non-diseased case is greater than or equal to $\zeta$. According to Assumption 2 these cases are incorrectly classified as diseased, i.e., they are FP decisions, therefore the corresponding probability is false positive fraction $\text{FPF}$, which is the complement of specificity $\text{Sp}$. Likewise, $P(Z\ge \zeta|T=2)$ denotes the probability that the Z-sample for a diseased case is greater than or equal to $\zeta$. These cases are correctly classified as diseased, i.e., these are TP decisions and the corresponding probability is true positive fraction $\text{TPF}$, which is sensitivity $\text{Se}$.
There are several concepts implicit in Eqn. \@ref(eq:binary-task-model-fpf) and Eqn. \@ref(eq:binary-task-model-tpf).
* The Z-samples have an associated probability distribution. The diseased cases are not homogenous: in some the disease is easy to detect, perhaps even obvious, in others the signs of disease are subtler, and in some, the disease is almost impossible to detect. Likewise, non-diseased cases are not homogenous.
* The probability distributions depend on the truth state $T$. The distribution of the Z-samples for non-diseased cases is in general different from that for the diseased cases. Generally, the distribution for $T = 2$ is shifted to the right of that for $T = 1$ (assuming a **positive-directed** decision variable scale). Later in this chapter, specific distributional assumptions will be employed to obtain analytic expressions for the right hand sides of Eqn. \@ref(eq:binary-task-model-fpf) and Eqn. \@ref(eq:binary-task-model-tpf).
* The equations imply that via choice of the reporting threshold $\zeta$, both $\text{Se}$ and $\text{Sp}$ are under the control of the observer. The lower the reporting threshold the higher the sensitivity and the lower the specificity and the converse is also true. Ideally both sensitivity and specificity should be large, i.e., unity. The tradeoff between sensitivity and specificity says that there is no “free lunch”: the price paid for increased sensitivity is decreased specificity and vice-versa.
### Adequacy of the training session
**Assumption 3:** The observer has complete knowledge of the distributions of actually non-diseased and actually diseased cases and makes rational decision based on this knowledge. Knowledge of the distributions is entirely consistent with not knowing which distribution a specific z-sample is coming from.
How an observer can be induced to change the reporting threshold is the subject of the following two examples.
## Changing the reporting threshold: I {#binary-task-model-example-1}
Suppose that in the first study a radiologist interprets a set of cases subject to the instructions that it is important to identify actually diseased cases and to worry less about misdiagnosing actually non-diseased cases. One way to do this would be to reward the radiologist with $10 for each TP decision but only $1 for each TN decision. For simplicity, assume there is no penalty imposed for incorrect decisions and that the case set contains equal numbers of non-diseased and diseased cases and the radiologist is informed of these experimental conditions. It is also assumed that the radiologist is allowed to reach a steady state and responds rationally to the payoff agreement. Under these circumstances, the radiologist is expected to set the reporting threshold at a small value so that even slight evidence of presence of disease is enough to result in a "case is diseased" decision. The low reporting threshold also implies that considerable evidence of lack of disease is needed before a "case is non-diseased" decision is rendered. The radiologist is expected to achieve relatively high sensitivity but specificity will be low. As a concrete example, if there are 100 non-diseased cases and 100 diseased cases, assume the radiologist makes 90 TP decisions; since the threshold for presence of presence of disease is small, this number is close to the maximum possible value, namely 100. Assume further that 10 TN decisions are made; since the implied threshold for evidence of absence of disease is large, this number is close to the minimum possible value, namely 0. Therefore, sensitivity is 90 percent and specificity is 10 percent. The radiologist earns 90 x $10 + 10 x $1 = $910 for participating in this study.
Next, suppose the study is repeated with the same cases but this time the payoff is $1 for each TP decision and $10 for each TN decision. Suppose, further, that sufficient time has elapsed from the previous study so that memory effects can be neglected. Now the roles of sensitivity and specificity are reversed. The radiologist's incentive is to be correct on actually non-diseased cases without worrying too much about missing actually diseased cases. The radiologist is expected to set the reporting threshold at a large value so that considerable evidence of disease-presence is required to result in a "case is diseased" decision but even slight evidence of absence of disease is enough to result in a "case is non-diseased" decision. This radiologist is expected to achieve relatively low sensitivity but specificity will be higher. Assume the radiologist makes 90 TN decisions and 10 TP decisions, earning $910 for the second study. The corresponding sensitivity is 10 percent and specificity is 90 percent.
The incentives in the first study caused the radiologist to accept low specificity in order to achieve high sensitivity. The incentives in the second study caused the radiologist to accept low sensitivity in order to achieve high specificity.
## Changing the reporting threshold: II {#binary-task-model-example-2}
Suppose one asks the same radiologist to interpret a set of cases, but this time the reward for a correct decision is always $1 regardless of the truth state of the case and, as before, there are is no penalty for incorrect decisions. However, the radiologist is told that disease prevalence is only 0.005 and that this is the actual prevalence in the experimental study, i.e., the experimenter is not deceiving the radiologist. ^[Even if the experimenter attempts to deceive the radiologist, by claiming for example that there are roughly equal numbers of non-diseased and diseased cases, after interpreting a few tens of cases the radiologist will know that a deception is involved. Deception in such studies is not a good idea, as the observer’s performance is not being measured in a “steady state condition” – the observer’s performance will change as the observer “learns” the true disease prevalence.] In other words, only five out of every 1000 cases are actually diseased. This information will cause the radiologist to adopt a high threshold thereby becoming more reluctant to state: “case is diseased”. By simply diagnosing all cases as non-diseased, i.e., without using any image information, the radiologist will be correct on every disease absent case and earn $995, which is close to the maximum $1000 the radiologist can earn by using case information to the full and being correct on disease-present and disease-absent cases.
The example is not as contrived as might appear at first sight. However, in screening mammography, the cost of missing a breast cancer, both in terms of loss of life and a possible malpractice suite, is usually perceived to be higher than the cost of a false positive. This can result in a shift towards higher sensitivity at the expense of lower specificity.
If a new study were conducted with a highly enriched set of cases, where the disease prevalence is 0.995 (i.e., only 5 out of every 1000 cases are actually non-diseased), then the radiologist would adopt a low threshold. By simply calling every case “non-diseased”, the radiologist earns $995.
These examples show that by manipulating the relative costs of correct vs. incorrect decisions and / or by varying disease prevalence one can influence the radiologist’s reporting threshold. These examples apply to laboratory studies. Clinical interpretations are subject to different cost-benefit considerations that are generally not under the researcher's control: actual disease prevalence, the reputation of the radiologist, malpractice, etc.
## The equal-variance binormal model {#binary-task-model-equal-variance-binormal-model}
Notation $N(\mu,\sigma^2)$ is the normal (or "Gaussian") distribution with mean $\mu$ and variance $\sigma^2$. Here is the model for the Z-samples:
1. The Z-samples for non-diseased cases are distributed $Z \sim N(0,1)$.
1. The Z-samples for diseased cases are distributed $Z \sim N(\mu,1)$ with $\mu \ge 0$.
1. A case is diagnosed as diseased if $z \ge \zeta$ and non-diseased otherwise.
The constraint $\mu \ge 0$ is needed so that the observer's performance is at least as good as chance. A large negative value for this parameter would imply an observer so predictably bad that the observer is good; one simply reverses the observer's decision ("diseased" to "non-diseased" and vice versa) to get near-perfect performance.
The model described above is termed the equal-variance binormal model. ^[If the common variance is not unity, one can re-scale the decision axis to achieve unit-variance without changing the predictions of the model.] A more general model termed the unequal-variance binormal model is generally used for modeling human observer data, discussed later, but for the moment, one does not need that complication. The equal-variance binormal model is defined by:
\begin{equation}
\left.
\begin{aligned}
Z_{k_tt} \sim& N(\mu_t,1) \\
\mu_1=& 0\\
\mu_2=& \mu
\end{aligned}
\right \}
(\#eq:binary-task-model-equal-variance)
\end{equation}
In Eqn. \@ref(eq:binary-task-model-equal-variance) the subscript $t$ denotes the truth with $t = 1$ denoting a non-diseased case and $t = 2$ denoting a diseased case. The variable $Z_{k_tt}$ denotes the random Z-sample for case $k_tt$, where $k_t$ is the index for cases with truth state $t$. For example $k_11=21$ denotes the 21st non-diseased case and $k_22=3$ denotes the 3rd diseased case. To explicate $k_11=21$ further the label $k_1$ indexes the case while the label $1$ indicates the truth state of the case. The label $k_t$ ranges from $1,2,...,K_t$ where $K_t$ is the number of cases with disease state $t$.
As you can see I am departing from the usual convention in this field which labels the cases with a single index $k$ ranging from 1 to $K_1+K_2$ and one is left guessing as to the truth-state of each case. Also, the proposed notation extends readily to the FROC paradigm where two states of truth have to be distinguished one at the case level and the other at the location level.
The first line in Eqn. \@ref(eq:binary-task-model-equal-variance) states that $Z_{k_tt}$ is a random sample from the $N(\mu_t,1)$ distribution, which has unit variance regardless of the value of $t$ (this is the reason for calling it the equal-variance binormal model). The remaining lines in Eqn. \@ref(eq:binary-task-model-equal-variance) defines $\mu_1$ as zero and $\mu_2$ as $\mu$. Taken together, these equations state that non-diseased case Z-samples are distributed $N(0,1)$ and diseased case Z-samples are distributed $N(\mu,1)$. The name binormal arises from the two normal distributions underlying this model.
A few facts concerning the normal distribution are summarized next.
## The normal distribution {#binary-task-model-normal-distribution}
A probability density function (pdf) or density of a continuous random variable is a function giving the relative chance that the random variable takes on a given value. For a continuous distribution, the probability of the random variable being exactly equal to a given value is zero. The probability of the random variable falling in a range of values is given by the integral of this variable’s pdf function over that range. For the normal distribution $N(\mu,\sigma^2)$ the pdf is denoted $\phi(z|\mu,\sigma)$.
By definition,
\begin{equation}
\phi\left ( z|\mu,\sigma \right )=P(z<Z<(z+dz)|Z \sim N(\mu,\sigma^2))
(\#eq:binary-task-model-phi-def)
\end{equation}
The right hand side of Eqn. \@ref(eq:binary-task-model-phi-def) is the probability that the random variable $Z$ sampled from $N(\mu,\sigma^2)$ falls between the fixed limits z and z + dz. For this reason $\phi(z|\mu,\sigma)$ is termed the probability density function.
The special case $N(0,1)$ is referred to as the **unit normal distribution**; it has zero mean and unit variance and the corresponding pdf is denoted $\phi(z)$. The defining equation for the pdf of the unit normal distribution is:
\begin{equation}
\phi\left ( z \right )=\frac{1}{\sqrt{2\pi}}\exp\left ( -\frac{z^2}{2} \right )
(\#eq:binary-task-model-phi)
\end{equation}
The integral of $\phi(t)$ from $-\infty$ to $z$, as in Eqn. \@ref(eq:binary-task-model-Phi), is the probability that a sample from the unit normal distribution is less than $z$. Regarded as a function of $z$, this is termed the cumulative distribution function (CDF) and is denoted by $\Phi$ (sometimes the term probability distribution function is used for what we are terming the CDF). The function $\Phi(z)$ is defined by:
\begin{equation}
\Phi\left ( z \right )=\int_{-\infty }^{z}\phi(t)dt
(\#eq:binary-task-model-Phi)
\end{equation}
```{r binary-task-model-plots1, fig.cap= "pdf-CDF plots for the unit normal distribution. The red curve is the pdf and the blue line is the CDF. The dashed line is the reporting threshold $\\zeta = 1$.", fig.show='hold', echo = FALSE}
x <- seq(-3,3,0.01)
pdfData <- data.frame(z = x, pdfcdf = dnorm(x))
cdfData <- data.frame(z = x, pdfcdf = pnorm(x))
pdfcdfPlot <- ggplot(
mapping = aes(x = z, y = pdfcdf)) +
geom_line(data = pdfData, color = "red") +
geom_line(data = cdfData, color = "blue") +
geom_vline(xintercept = 1, linetype = 2) +
xlab(label = "z") + ylab(label = "pdf/CDF")
print(pdfcdfPlot)
```
Fig. \@ref(fig:binary-task-model-plots1) shows plots, as functions of z, of the CDF and the pdf for the unit normal distribution. Since z-samples outside ±3 are unlikely, the plotted range, from -3 to +3 includes most of the distribution. The pdf is the familiar bell-shaped curve, centered at zero; the corresponding `R` function is `dnorm()` the density of the unit normal distribution. The CDF $\Phi(z)$ increases monotonically from 0 to unity as z increases from $-\infty$ to $+\infty$. It is the sigmoid (S-shaped) shaped curve in Fig. \@ref(fig:binary-task-model-plots1); the corresponding `R` function is `pnorm()`. The dashed line corresponds to the reporting threshold $\zeta = 1$. The area under the pdf to the left of $\zeta$ equals the value of CDF at the selected $\zeta$ (`pnorm(1)` = 0.841).
A related function is the inverse of Eqn. \@ref(eq:binary-task-model-Phi). Suppose the left hand side of Eqn. \@ref(eq:binary-task-model-Phi) is denoted $p$, which is a probability in the range 0 to 1.
\begin{equation}
p \equiv \Phi\left ( z \right )=\int_{-\infty }^{z}\phi(t)dt
(\#eq:binary-task-model-Phi2)
\end{equation}
The inverse of $\Phi(z)$ is that function which when applied to $p$ yields the upper limit $z$ in Eqn. \@ref(eq:binary-task-model-Phi2), i.e.,
\begin{equation}
\Phi^{-1}(p) = z
(\#eq:binary-task-model-PhiInvDef)
\end{equation}
Since $p \equiv \Phi(z)$ it follows that
\begin{equation}
\Phi(\Phi^{-1}(z))=z
(\#eq:binary-task-model-PhiInvDef2)
\end{equation}
This nicely satisfies the property of an inverse function. The inverse is known in statistical terminology as the quantile function, implemented in `R` as the `qnorm()`. Think of `pnorm()` as a probability and `qnorm()` as value on the z-axis.
To summarize the convention used in `R`, `norm` implies the unit normal distribution, `p` denotes a probability distribution function or CDF, `q` denotes a quantile function and `d` denotes a density function.
```{r, attr.source = ".numberLines"}
qnorm(0.025)
qnorm(1-0.025)
pnorm(qnorm(0.025))
qnorm(pnorm(-1.96))
```
Line 1 demonstrates the identity:
\begin{equation}
\Phi^{-1}(0.025)=-1.959964
(\#eq:binary-task-model-Phi-Inv-alpha-by2)
\end{equation}
Line 3 demonstrates the identity:
\begin{equation}
\Phi^{-1}(1-0.025)=+1.959964
(\#eq:binary-task-model-phi-inv-alpha)
\end{equation}
Lines 5 and 7 demonstrate that `pnorm` and `qnorm`, applied in either order, are inverses of each other.
Eqn. \@ref(eq:binary-task-model-Phi-Inv-alpha-by2) means that the (rounded) value -1.96 is such that the area under the pdf to the left of this value is 0.025. Similarly, Eqn. \@ref(eq:binary-task-model-phi-inv-alpha) means that the (rounded) value +1.96 is such that the area under the pdf to the left of this value is 1-0.025 = 0.975. In other words, -1.96 captures, to its left, the 2.5th percentile of the unit-normal distribution, and 1.96 captures, to its left, the 97.5th percentile of the unit-normal distribution, Fig. \@ref(fig:binary-task-model-shaded-tails). Since between them they capture 95 percent of the unit-normal pdf, these two values can be used to estimate 95 percent confidence intervals.
```{r binary-task-model-shaded-tails, fig.cap= "Illustrating that 95 percent of the total area under the unit normal pdf is contained in the range |z| < 1.96; this is used to construct a 95 percent confidence interval for an estimate of a suitably normalized statistic. The area in each shaded tail is 0.025.", fig.show='hold', echo=FALSE}
mu <- 0;sigma <- 1
zeta <- -qnorm(0.025)
step <- 0.1
LL<- -3
UL <- mu + 3*sigma
x.values <- seq(zeta,UL,step)
cord.x <- c(zeta, x.values,UL)
cord.y <- c(0,dnorm(x.values),0)
z <- seq(LL, UL, by = step)
curveData <- data.frame(z = z, pdfs = dnorm(z))
shadeData <- data.frame(z = cord.x, pdfs = cord.y)
shadedTails <- ggplot(mapping = aes(x = z, y = pdfs)) +
geom_polygon(data = shadeData, color = "grey", fill = "grey")
zeta <- qnorm(0.025)
x.values <- seq(LL, zeta,step)
cord.x <- c(LL, x.values,zeta)
cord.y <- c(0,dnorm(x.values),0)
shadeData <- data.frame(z = cord.x, pdfs = cord.y)
shadedTails <- shadedTails +
geom_polygon(
data = shadeData, color = "grey", fill = "grey") +
xlab(label = "z")
shadedTails <- shadedTails +
geom_line(data = curveData, color = "black")
print(shadedTails)
```
> If one knows that a variable is distributed as a unit-normal random variable, then the observed value minus 1.96 defines the lower limit of its 95 percent confidence interval, and the observed value plus 1.96 defines the upper limit of its 95 percent confidence interval.
## Analytic expressions for specificity and sensitivity {#binary-task-model-sensitivity-specificity}
Specificity corresponding to threshold $\zeta$ is the probability that a Z-sample from a non-diseased case is smaller than $\zeta$. By definition, this is the CDF corresponding to the threshold $\zeta$. In other words:
\begin{equation}
\left.
\begin{aligned}
Sp\left ( \zeta \right ) \equiv& P\left ( Z_{k_11} < \zeta\mid Z_{k_11} \sim N\left ( 0,1 \right )\right ) \\
=& \Phi\left ( \zeta \right )
\end{aligned}
\right \}
(\#eq:binary-task-model-specificity)
\end{equation}
The expression for sensitivity can be derived. Consider that the random variable obtaining by shifting the origin to $\mu$. A little thought should convince the reader that $Z_{k_22}-\mu$ is distributed as $N(0,1)$. Therefore, the desired probability is:
\begin{equation}
\left.
\begin{aligned}
Se\left ( \zeta \right ) \equiv& P\left ( Z_{k_22} \geq \zeta\right ) \\
=&P\left (\left ( Z_{k_22} -\mu \right ) \geq\left ( \zeta -\mu \right )\right ) \\
=&1-P\left (\left ( Z_{k_22} -\mu \right ) < \left ( \zeta -\mu \right )\right ) \\
=& 1-\Phi\left ( \zeta -\mu \right )
\end{aligned}
\right \}
(\#eq:binary-task-model-sensitivity)
\end{equation}
A little thought, based on the definition of the CDF function and the symmetry of the unit-normal pdf function, should convince the reader that:
\begin{equation}
\left.
\begin{aligned}
1-\Phi(\zeta)=& -\Phi(\zeta)\\
1-\Phi(\zeta-\mu) =& \Phi(\mu-\zeta)
\end{aligned}
\right \}
(\#eq:binary-task-model-compact)
\end{equation}
Instead of carrying the "1 minus" around one can use the more compact notation. Summarizing, the analytical formulae for the specificity and sensitivity for the equal-variance binormal model are:
\begin{equation}
Sp\left ( \zeta \right ) = \Phi(\zeta)\\
Se\left ( \zeta \right ) = \Phi(\mu-\zeta)
(\#eq:binary-task-model-se-sp)
\end{equation}
> In Eqn. \@ref(eq:binary-task-model-se-sp) the threshold $\zeta$ appears with different signs because specificity is the area under a pdf to the left of the threshold while sensitivity is the area to the right.
Sensitivity and specificity are restricted to the range 0 to 1. The observer’s performance could be characterized by specifying sensitivity *and* specificity, i.e., a pair of numbers. If both sensitivity and specificity of an imaging system are greater than the corresponding values for another system, then the first system is unambiguously better than the second. But if sensitivity is greater for the first but specificity is greater for the second the comparison is ambiguous. A scalar measure is desirable that combines sensitivity and specificity into a single measure of diagnostic performance.
The parameter $\mu$ satisfies the requirements of a scalar performancer measure, termed a figure of merit (FOM). Eqn. \@ref(eq:binary-task-model-se-sp) can be solved for $\mu$ as follows. Inverting the equations yields:
\begin{equation}
\left.
\begin{aligned}
\zeta =&\Phi^{-1} \left (\text{Sp}\left ( \zeta \right ) \right )\\
\mu - \zeta =& \Phi^{-1} \left (\text{Se}\left ( \zeta \right ) \right )
\end{aligned}
\right \}
(\#eq:binary-task-model-solve-mu-zeta)
\end{equation}
Eliminating $\zeta$ yields:
\begin{equation}
\left.
\begin{aligned}
\mu =& \Phi^{-1} \left (\text{Sp}\left ( \zeta \right ) \right ) + \Phi^{-1} \left (\text{Se}\left ( \zeta \right ) \right )
\end{aligned}
\right \}
(\#eq:binary-task-model-solve-mu)
\end{equation}
This is a useful relation, as it converts a *pair* of numbers into a *scalar* performance measure. Now it is almost trivial to compare two modalities: the one with the higher $\mu$ is better. In reality, the comparison is not trivial since like sensitivity and specificity $\mu$ has to be estimated from a finite dataset and one must account for sampling variability.
```{r binary-task-model-shaded-plots, fig.cap= "The equal-variance binormal model for $\\mu = 3$ and $\\zeta = 1$; the blue curve, centered at zero, is the pdf of non-diseased cases and the red one, centered at $\\mu = 3$, is the pdf of diseased cases. The left edge of the blue shaded region represents the threshold $\\zeta = 1$. The red shaded area including the common portion with the vertical red lines is sensitivity. The blue shaded area including the common portion with the vertical red lines is 1-specificity.", fig.show='hold', echo=FALSE}
options(digits=3)
mu <- 3;sigma <- 1
zeta <- 1
step <- 0.1
lowerLimit<- -1 # lower limit
upperLimit <- mu + 3*sigma # upper limit
z <- seq(lowerLimit, upperLimit, by = step)
pdfs <- dnorm(z)
seqNor <- seq(zeta,upperLimit,step)
cord.x <- c(zeta, seqNor,upperLimit)
# need two y-coords at each end point of range;
# one at zero and one at value of function
cord.y <- c(0,dnorm(seqNor),0)
curveData <- data.frame(z = z, pdfs = pdfs)
shadeData <- data.frame(z = cord.x, pdfs = cord.y)
shadedPlots <- ggplot(mapping = aes(x = z, y = pdfs)) +
geom_line(data = curveData, color = "blue") +
geom_polygon(data = shadeData, color = "blue", fill = "blue")
crossing <- uniroot(function(x) dnorm(x) - dnorm(x,mu,sigma),
lower = 0, upper = 3)$root
crossing <- max(c(zeta, crossing))
seqAbn <- seq(crossing,upperLimit,step)
cord.x <- c(seqAbn, rev(seqAbn))
# reason for reverse
# we want to explicitly define the polygon
# we dont want R to close it
cord.y <- c()
for (i in seq(1,length(cord.x)/2)) {
cord.y <- c(cord.y,dnorm(cord.x[i],mu, sigma))
}
for (i in seq(1,length(cord.x)/2)) {
cord.y <- c(cord.y,dnorm(cord.x[length(cord.x)/2+i]))
}
pdfs <- dnorm(z, mu, sigma)
curveData <- data.frame(z = z, pdfs = pdfs)
shadeData <- data.frame(z = cord.x, pdfs = cord.y)
shadedPlots <- shadedPlots +
geom_line(data = curveData, color = "red") +
geom_polygon(data = shadeData, color = "red", fill = "red")
seqAbn <- seq(zeta,upperLimit,step)
for (i in seqAbn) {
# define xs and ys of two points, separated only along y-axis
vlineData <- data.frame(x1 = i,
x2 = i,
y1 = 0,
y2 = dnorm(i, mu, sigma))
# draw vertical line between them
shadedPlots <- shadedPlots +
geom_segment(aes(x = x1, xend = x2, y = y1, yend = y2),
data = vlineData, color = "red")
}
shadedPlots <- shadedPlots + xlab(label = "z-sample")
print(shadedPlots)
```
Fig. \@ref(fig:binary-task-model-shaded-plots) shows the equal-variance binormal model for $\mu = 3$ and $\zeta = 1$. The blue-shaded area, including the "common" portion with the vertical red lines, is the probability that a z-sample from a non-diseased case exceeds $\zeta = 1$, which is the complement of specificity, i.e., false positive fraction, which is 1 - `pnorm(1)` = `r 1-pnorm(1)`. The red shaded area, including the "common" portion with the vertical red lines, is the probability that a z-sample from a diseased case exceeds $\zeta = 1$, which is sensitivity or true positive fraction, which is `pnorm(3-1)`= `r pnorm(3-1)`.
See Appendix \@ref(binary-task-model-demo) for a demonstration of the concepts of sensitivity and specificity using `R`.
## Inverse variation of sensitivity and specificity {#binary-task-model-sensitivity-specificity-inverse-variation}
The variation of sensitivity and specificity is modeled in the binormal model by the threshold parameter $\zeta$. From Eqn. \@ref(eq:binary-task-model-specificity) specificity at threshold $\zeta$ is $\text{Sp} = \Phi(\zeta)$ and sensitivity is $\text{Se} = \Phi(\mu-\zeta)$. Since the threshold $\zeta$ appears with different signs the dependence of sensitivity on $\zeta$ will be the opposite of that of specificity. In Fig. \@ref(fig:binary-task-model-shaded-plots), the left edge of the blue shaded region represents the threshold $\zeta = 1$. As $\zeta = 1$ is moved towards the left, specificity decreases but sensitivity increases. Specificity decreases because less of the non-diseased distribution lies to the left of the lowered threshold, in other words fewer non-diseased cases are correctly diagnosed as non-diseased. Sensitivity increases because more of the diseased distribution lies to the right of the lowered threshold, in other words more diseased cases are correctly diagnosed as diseased.
If Observer 1 has higher sensitivity than Observer 2 but lower specificity it is difficult to unambiguously compare them; it is not impossible [@RN2637]. The unambiguous comparison is difficult for the following reason: assuming the Observer 2 can be coaxed into adopting a lower threshold, thereby decreasing specificity to match that of Observer 1 then it is possible that the Observer 2's sensitivity, formerly smaller, could (and here is the ambiguity because if might not happen) now be greater than that of Observer 1.
A single figure of merit is desirable to the sensitivity - specificity analysis. It is possible to leverage the inverse variation of sensitivity and specificity by combing them into a single scalar measure, as was done with the $\mu$ parameter in the previous section, Eqn. \@ref(eq:binary-task-model-solve-mu).
An equivalent way is by using the area under the ROC plot, discussed next.
## The ROC curve {#binary-task-model-roc-curve}
The receiver operating characteristic (ROC) is defined as the plot of sensitivity (y-axis) vs. 1-specificity (x-axis). Equivalently, it is the plot of TPF (y-axis) vs. FPF (x-axis). From Eqn. \@ref(eq:binary-task-model-se-sp) it follows that:
\begin{equation}
\left.
\begin{aligned}
\text{FPF}\left ( \zeta \right ) &\equiv 1 - \text{Sp}\left ( \zeta \right ) \\
&=\Phi\left ( -\zeta \right )\\
\text{TPF}\left ( \zeta \right ) &\equiv \text{Se}\left ( \zeta \right ) \\
&=\Phi\left (\mu -\zeta \right )\\
\end{aligned}
\right \}
(\#eq:binary-task-model-fpf-tpf)
\end{equation}
Specifying $\zeta$ selects a particular operating point on this curve and varying $\zeta$ from $+\infty$ to $-\infty$ causes the operating point to trace out the ROC curve from the origin (0,0) to (1,1). Note that as $\zeta$ increases the operating point moves down the curve. The operating point $O(\zeta|\mu)$ for the equal variance binormal model is:
\begin{equation}
O\left ( \zeta \mid \mu \right ) = \left ( \Phi(-\zeta), \Phi(\mu-\zeta) \right ) \\
(\#eq:binary-task-model-op-pt)
\end{equation}
The operating point predicted by the above equation lies exactly on the theoretical ROC curve. This condition can only be achieved with very large numbers of cases. With finite datasets the operating point will almost never be exactly on the theoretical curve.
> The ROC curve is the locus of the operating point for fixed $\mu$ and variable $\zeta$. Fig. \@ref(fig:binary-task-model-rocs) shows examples of equal-variance binormal model ROC curves for different values of $\mu$. Each has the property that TPF is a monotonically increasing function of FPF and the slope decreases monotonically as the operating point moves up the curve. As $\mu$ increases the curves get progressively upward-left shifted, approaching the top-left corner of the ROC plot. In the limit $\mu = \infty$ the curve degenerates into two line segments, a vertical one connecting the origin to (0,1) and a horizontal one connecting (0,1) to (1,1) – the ROC plot for a perfect observer.
```{r binary-task-model-rocs, fig.cap= "ROC plots predicted by the equal variance binormal model for different values of $\\mu$. As $\\mu$ increases the intersection of the curve with the negative diagonal moves closer to the ideal operating point, (0,1) at which sensitivity and specificity are both equal to unity.", fig.show='hold', echo=FALSE}
mu <- 0;zeta <- seq(-5, mu + 5, 0.05)
FPF <- pnorm(-zeta)
rocPlot <- ggplot(mapping = aes(x = FPF, y = TPF))
for (mu in 0:3){
TPF <- pnorm(mu-zeta)
curveData <- data.frame(FPF = FPF, TPF = TPF)
rocPlot <- rocPlot +
geom_line(data = curveData, linewidth = 1) +
xlab("FPF")+ ylab("TPF" ) +
annotate("text",
x = pnorm(-mu/2) + 0.11,
y = pnorm(mu/2) + 0.0,
label = paste0("mu == ", mu),
parse = TRUE, size = 8)
next
}
rocPlot <- rocPlot +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0))
rocPlot <- rocPlot +
geom_abline(slope = -1,
intercept = 1,
linetype = 2,
linewidth = 1)
print(rocPlot)
```
### The chance diagonal {#binary-task-model-chance-diagonal}
In Fig. \@ref(fig:binary-task-model-rocs) the ROC curve for $\mu=0$ is the positive diagonal of the ROC plot, termed the **chance diagonal**. Along this curve $\text{TPF = FPF}$ and the observer’s performance is at chance level. For $\mu=0$ the pdf of the diseased distribution is identical to that of the non-diseased distribution: both are centered at the origin. Therefore, no matter the choice of threshold $\zeta$, $\text{TPF = FPF}$. Setting $\mu=0$ in Eqn. \@ref(eq:binary-task-model-fpf-tpf) yields:
$$\text{TPF}\left ( \zeta \right )=\text{FPF}\left ( \zeta \right )=\Phi\left ( -\zeta \right )$$
In this case the red and blue curves in Fig. \@ref(fig:binary-task-model-shaded-plots) coincide. The observer is unable to find any difference between the two distributions. This can happen if the cancers are of such low visibility that diseased cases are indistinguishable from non-diseased ones, or the observer’s skill level is so poor that the observer is unable to make use of distinguishing characteristics between diseased and non-diseased cases that do exist and which experts exploit.
### The guessing observer {#binary-task-model-guessing-observer}
If the cases are indeed impossibly difficult and/or the observer has zero skill at discriminating between them, the observer has no option but to guess. This rarely happens in the clinic, as too much is at stake and this paragraph is intended to make a pedagogical point: the observer can move the operating point along the change diagonal. If there is no special incentive, the observer tosses a coin and if the coin lands head up, the observer states: “case is diseased” and otherwise states: “case is non-diseased”. When this procedure is averaged over many non-diseased and diseased cases, it will result in the operating point (0.5, 0.5). ^[Many cases are assumed as otherwise, due to sampling variability, the operating point will not be on the theoretical ROC curve.] To move the operating point downward, e.g., to (0.1, 0.1) the observer randomly selects an integer number between 1 and 10, equivalent to a 10-sided "coin". Whenever a one "shows up", the observer states “case is diseased” and otherwise the observer states “case is non-diseased”. To move the operating point to (0.2, 0.2) whenever a one or two "shows up", the observer states “case is diseased” and otherwise the observer states “case is non-diseased”. One can appreciate that simply by changing the probability of stating “case is diseased” the observer can place the operating point anywhere on the chance diagonal but wherever the operating point is placed, it will satisfy TPF = FPF.
### Symmetry with respect to negative diagonal {#binary-task-model-symmetry-wrt-negative-diagonal}
A characteristic of the ROC curves shown in Fig. \@ref(fig:binary-task-model-rocs) is that they are symmetric with respect to the negative diagonal, i.e., the line joining (0,1) and (1,0) which is shown as the dotted straight line in Fig. \@ref(fig:binary-task-model-rocs). The symmetry property is due to the equal variance nature of the binormal model and is not true for models considered in later chapters. The intersection between the ROC curve and the negative diagonal corresponds to $\zeta = \mu/2$, in which case the operating point is:
\begin{equation}
\left.
\begin{aligned}
\text{FPF}\left ( \zeta \right ) &=\Phi\left ( -\mu/2 \right )\\
\text{TPF}\left ( \zeta \right ) &=\Phi\left (\mu/2 \right )\\
\end{aligned}
\right \}
(\#eq:binary-task-model-neg-diag)
\end{equation}
The first equation implies:
$$1-\text{FPF}\left ( \zeta \right ) =1-\Phi\left ( -\mu/2 \right )= \Phi\left ( \mu/2 \right )$$
Therefore,
\begin{equation}
\text{TPF}\left ( \zeta \right ) = 1-\text{FPF}\left ( \zeta \right )
(\#eq:binary-task-model-neg-diag2)
\end{equation}
This equation describes a straight line with unit intercept and slope equal to minus 1, which is the negative diagonal. Since $\text{TPF = Se}$ and $\text{FPF = 1-Sp}$ another way of stating this is that at the intersection with the negative diagonal sensitivity equals specificity.
### Area under the ROC curve {#binary-task-model-auc-roc-important}
>The area AUC (abbreviation for area under curve) under the ROC curve suggests itself as a measure of performance that is independent of threshold and therefore circumvents the ambiguity issue of comparing sensitivity/specificity pairs, and has other advantages.
It is defined by the following integrals:
\begin{equation}
\begin{aligned}
A_{z;\sigma = 1} &= \int_{0}^{1}TPF(\zeta)d(FPF(\zeta))\\
&=\int_{0}^{1}FPF(\zeta)d(TPF(\zeta))\\
\end{aligned}
(\#eq:binary-task-model-az-eq-var)
\end{equation}
Eqn. \@ref(eq:binary-task-model-az-eq-var) has the following equivalent interpretations:
* The first form performs the integration using thin vertical strips, e.g., extending from x to x + dx, where x is a temporary symbol for FPF. The area can be interpreted as the average TPF over all possible values of FPF.
* The second form performs the integration using thin horizontal strips, e.g., extending from y to y + dy, where y is a temporary symbol for TPF. The area can be interpreted as the average FPF over all possible values of TPF.
By convention, the symbol $A_z$ is used for the area under the unequal-variance binormal model predicted ROC curve. The more expressive term area under curve or AUC is used to include this and other methods of estimating the area under the ROC curve.
In Eqn. \@ref(eq:binary-task-model-az-eq-var), the extra subscript $\sigma = 1$ is necessary to distinguish it from another that corresponding to the unequal variance binormal model to be derived later. It can be shown that:
\begin{equation}
A_{z;\sigma = 1} = \Phi\left ( \frac{\mu} {\sqrt{2}} \right )
(\#eq:binary-task-model-az-var)
\end{equation}
Since the ROC curve is bounded by the unit square, $A_z$ must be between zero and one. If $\mu$ is non-negative, $A_{z;\sigma = 1}$ must be between 0.5 and 1. The chance diagonal, corresponding to $\mu = 0$, yields $A_{z;\sigma = 1} = 0.5$, while the perfect ROC curve, corresponding to $\mu = \infty$ yields $A_{z;\sigma = 1} = 1$.
> Since it is a scalar quantity, $A_z$ can be used to unambiguously quantify performance than is possible using sensitivity - specificity pairs.
### Properties of the equal-variance binormal model ROC curve {#binary-task-model-properties-roc}
1. The ROC curve is completely contained within the unit square. This follows from the fact that both axes of the plot are probabilities.
1. The operating point rises monotonically from (0,0) to (1,1).
1. Since $\mu$ is positive, the slope of the equal-variance binormal model curve at the origin (0,0) is infinite and the slope at (1,1) is zero, and the slope along the curve is always non-negative and decreases monotonically as the operating point moves up the curve.
1. $A_z$ is a monotone increasing function of $\mu$. It varies from 0.5 to 1 as $\mu$ varies from zero to infinity.
### Comments {#binary-task-model-comments}
Property 2: since the operating point can both be expressed in terms of $\Phi$ functions, which are monotone in their arguments, and in each case the argument $\zeta$ appears with a negative sign it follows that as $\zeta$ is lowered both TPF and FPF increase. The operating point corresponding to $\zeta - d\zeta$ is to the upper right of that corresponding $\zeta$ to (assuming $d\zeta > 0$).
Property 3: The slope of the ROC curve can be derived by differentiation ($\mu$ is constant):
\begin{equation}
\left.
\begin{aligned}
\frac{d(TPF)}{d(FPF)}&=\frac{d(\Phi(\mu-\zeta))}{d(\Phi(-\zeta))}\\
&=\frac{\phi(\mu-\zeta)}{\phi(-\zeta)}\\
&=\exp(\mu(\zeta-\mu/2)) \propto \exp(\mu \zeta)\\
\end{aligned}
\right \}
(\#eq:binary-task-model-slope1)
\end{equation}
The above derivation uses the fact that the differential of the CDF function yields the pdf function, i.e.,
$$d\Phi(\zeta)=P\left ( \zeta < Z < \zeta + d \zeta \right ) = \phi(\zeta)d\zeta$$
Since the slope of the ROC curve can be expressed as a power of $e$ it is always non-negative. Provided $\mu > 0$, in the limit $\zeta\rightarrow \infty$ the slope at the origin approaches $\infty$. Eqn. \@ref(eq:binary-task-model-slope1) also implies that in the limit $\zeta\rightarrow -\infty$ the slope of the ROC curve at the end-point (1,1) approaches zero, i.e., the slope is a monotone increasing function of $\zeta$. As $\zeta$ decrease from $+\infty$ to $-\infty$, the slope decreases monotonically from $+\infty$ to 0.
Fig. \@ref(fig:binary-task-model-analytical-roc) is the ROC curve for the equal-variance binormal model for $\mu = 3$. The entire curve is defined by varying $\zeta$. Specifying a particular value of $\zeta$ corresponds to specifying a particular point on the ROC curve. In Fig. TBA 3.5 the open circle corresponds to the operating point (0.159, 0.977) defined by $\zeta = 1$: `pnorm(-1)` = `r pnorm(-1)`; pnorm(3-1) = `r pnorm(3-1)`. The operating point lies exactly on the curve as this is a predicted operating point.
```{r binary-task-model-analytical-roc, fig.cap= "ROC curve predicted by equal variance binormal model for $\\mu = 3$. The circled operating point corresponds to $\\zeta = 1$. The operating point falls exactly on the curve, as these are analytical curves. With finite numbers of cases this is not observed in practice.", fig.show='hold', echo=FALSE}
mu <- 3;zeta <- seq(-4,mu+3,0.05)
FPF <- pnorm(-zeta)
TPF <- pnorm(mu -zeta)
FPF <- c(1, FPF, 0);TPF <- c(1, TPF, 0)
curveData <- data.frame(FPF = FPF, TPF = TPF)
OpX <- pnorm(-1)
OpY <- pnorm(mu-1)
pointData <- data.frame(FPF = OpX, TPF = OpY)
rocPlot <- ggplot(
mapping = aes(x = FPF, y = TPF)) +
xlab("FPF")+ ylab("TPF" ) +
geom_line(data = curveData, linewidth = 1) +
geom_point(data = pointData, size = 2) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0))
print(rocPlot)
```
### Physical interpretation of the mu-parameter {#binary-task-model-mu-parameter-intepretation}
The $\mu$ parameter is equivalent [@macmillan2004detection] to a signal detection theory variable denoted $d'$ in the literature (pronounced “dee-prime”). It can be thought of as the *perceptual signal to noise ratio* (pSNR) of diseased cases relative to non-diseased ones. It is a measure of reader expertise and / or ease of detectability of the disease. SNR is a term widely used in engineering, specifically in signal detection theory [@green1966signal; @egan1975book]. It dates to the early 1940s when one had the problem [@marcum1947statistical, @marcum1960statistical] of detecting faint radar reflections from a plane against a background of noise. The radar radio "receiver" is the origin of the term in Receiver Operating Characteristic.
The reader may be aware of the "rule-of-thumb" that if SNR exceeds three the target is likely to be detected. It will be shown later that the area under the ROC curve is the probability that a diseased case Z-sample is greater than that of a non-diseased one. The following code snippet shows that for $\mu = 3$, the probability of detection is 98.3 percent.
```{r, echo=FALSE}
cat("pnorm(3/sqrt(2)) = ", pnorm(3/sqrt(2)))
```
For electrical signals, SNR can be measured with instruments but, in the context of decisions made by humans, what is measured is the *perceptual* SNR. Physical characteristics that differentiate non-diseased from diseased cases, and how well they are displayed will affect it; in addition the eye-sight of the observer is an obvious factor; not so obvious is how information is processed by the cognitive system, and the role of the observer’s expertise.
> To this day I find it remarkable that an objective SNR-like quantity can be teased out of subjective observer decisions.
## Confidence intervals for an operating point {#binary-task-model-confidence-intervals}
* A $(1-\alpha)$ confidence interval (CI) of a statistic is the range that is expected to contain the true value with probability $(1-\alpha)$.
* It should be clear that a 99 percent CI is wider than a 95 percent CI, and that a 90 percent CI is narrower; in general, the higher the confidence that the interval contains the true value, the wider the range of the CI.
* Calculation of a parametric confidence interval requires a distributional assumption (non-parametric estimation methods, which use resampling methods, are described later). With a distributional assumption the parameters of the distribution can be estimated and since the distribution accounts for variability, the needed confidence interval estimate follows.
* With TPF and FPF, each of which involves a ratio of two integers, it is convenient to assume a *binomial* distribution for the following reason:
+ The diagnosis "non-diseased" vs. "diseased" represents a Bernoulli trial, i.e., one whose outcome is binary.
+ A Bernoulli trial is like a coin-toss, a special coin whose probability of landing "diseased" face up is $p$ which is not necessarily 0.5 as with a real coin.
+ It is a theorem in statistics that the total number of Bernoulli outcomes of one type, e.g., $n(FP)$, is a binomial-distributed random variable, with success probability $\widehat{FPF}$ and trial size $K_1$.
\begin{equation}
n(FP) \sim B\left ( K_1, \widehat{FPF} \right )
(\#eq:binary-task-model-BinDistrFPF)
\end{equation}
$B(n,p)$ denotes the binomial distribution with success probability $p$ and trial size $n$:
\begin{equation}
\left.
\begin{aligned}
k \sim& B\left ( n, p \right )\\
k=& 0,1,2,...,n\\
\end{aligned}
\right \}
(\#eq:binary-task-model-bin-distr)
\end{equation}
Eqn. \@ref(eq:binary-task-model-bin-distr) states that $k$ is a random sample from the binomial distribution $B(n,p)$. For reference, the probability mass function $\text{pmf}$ of $B(n,p)$ is defined by (the subscript $Bin$ denotes a binomial distribution):
\begin{equation}
\text{pmf}_{Bin}\left ( k;n,p \right )=\binom{n}{k}p^k(1-p)^{n-k}
(\#eq:binary-task-model-bin-distr2)
\end{equation}
For a discrete distribution, one has probability *mass* function in contrast to a continuous distribution where one has a probability *density* function.
The binomial coefficient $\binom{n}{k}$ appearing in Eqn. \@ref(eq:binary-task-model-bin-distr2), to be read as "$n$ pick $k$", is defined by:
\begin{equation}
\binom{n}{k}=\frac{n!}{k!(n-k)!}
(\#eq:binary-task-model-bin-coeff)
\end{equation}
From the properties of the binomial distribution the variance of n(FP) is given by:
\begin{equation}
\sigma_{n(FP)}^2=K_1\widehat{\text{FPF}}\left ( 1 - \widehat{\text{FPF}} \right )
(\#eq:binary-task-model-var-n-FP)
\end{equation}
It follows that $\text{FPF}$ has mean $\widehat{\text{FPF}}$ and variance $\sigma_{\text{FPF}}^2$ given by (since $Var(aX) = a^2 Var(X)$ where $a$ is a constant):
\begin{equation}
\sigma_{\text{FPF}}^2 = \frac{\widehat{\text{FPF}}\left ( 1 - \widehat{\text{FPF}} \right )}{K_1}
(\#eq:binary-task-model-var-fpf)
\end{equation}
For large $K_1$ the distribution of $\text{FPF}$ approaches a normal distribution:
\begin{equation}
\left.
\begin{aligned}
FPF \sim N\left ( \widehat{\text{FPF}}, \sigma_{\text{FPF}}^2 \right )\end{aligned}
\right \}
(\#eq:binary-task-model-distr-fpf)
\end{equation}
Eqn. \@ref(eq:binary-task-model-distr-fpf) allows us to write down the approximate symmetric confidence interval for $\widehat{\text{FPF}}$, i.e., $\pm z_{\alpha/2} \times \sigma_{\text{FPF}}$ around $\widehat{\text{FPF}}$.
\begin{equation}
CI_{1-\alpha}^{\text{FPF}}=\left ( \widehat{\text{FPF}} - z_{\alpha/2} \sigma_{\text{FPF}}, \widehat{\text{FPF}} + z_{\alpha/2} \sigma_{\text{FPF}} \right )
(\#eq:binary-task-model-ci-fpf)
\end{equation}
In Eqn. \@ref(eq:binary-task-model-ci-fpf) $z_{\alpha/2}$ is the upper $\alpha/2$ quantile of the unit normal distribution: it is defined such that the area to the *right* under the unit normal distribution pdf from $z_{\alpha/2}$ to $+\infty$ equals $\alpha/2$. For example $z_{0.025} = 1.96$, see Fig. \@ref(fig:binary-task-model-shaded-tails). In general $z_{\alpha/2} = -\Phi^{-1}(\alpha/2)$. For example `-qnorm(0.025)` = `r -qnorm(0.025)`.
These relations involving $z_{\alpha/2}$ follow:
\begin{equation}
\left.
\begin{aligned}
z_{\alpha/2} &=\Phi^{-1}\left ( 1-\alpha/2 \right )\\
&= - \Phi^{-1}\left (\alpha/2 \right )\\
\alpha/2&=\int_{z_{\alpha/2}}^{\infty}\phi(z)dz\\
&= 1-\Phi(z_{\alpha/2})\\
&= \Phi(-z_{\alpha/2})\\
\end{aligned}
\right \}
(\#eq:binary-task-model-def-z-alpha2)
\end{equation}
The normal approximation is adequate if both of the following two conditions are both met: $K_1\widehat{FPF} > 10$ and $K_1(1-\widehat{\text{FPF}}) > 10$. This means, approximately, that $\widehat{\text{FPF}}$ cannot be too close to zero or 1.
Similarly, an approximate symmetric $(1-\alpha)$ confidence interval for $\text{TPF}$ is:
\begin{equation}
CI_{1-\alpha}^{\text{TPF}}=\left ( \widehat{\text{TPF}} - z_{\alpha/2} \sigma_{\text{TPF}}, \widehat{\text{TPF}} + z_{\alpha/2} \sigma_{\text{TPF}} \right )
(\#eq:binary-task-model-ci-tpf)
\end{equation}
In Eqn. \@ref(eq:binary-task-model-ci-tpf),
\begin{equation}
\sigma_{\text{TPF}}^2 = \frac{\widehat{\text{TPF}}\left ( 1 - \widehat{\text{TPF}} \right )}{K_2}
(\#eq:binary-task-model-var-tpf)
\end{equation}
The confidence intervals are largest when the probabilities (FPF or TPF) are close to 0.5 and decrease inversely as the square root of the relevant number of cases. The symmetric binomial distribution based estimates can stray outside the allowed range (0 to 1). Exact confidence intervals that are asymmetric around the central value and which are guaranteed to be in the allowed range can be calculated: it is implemented in `R` in function `binom.test() `and used below:
```{r, echo=FALSE}
options(digits=3)
seed <- 100;set.seed(seed)
alpha <- 0.05;K1 <- 99;K2 <- 111;mu <- 5;zeta <- mu/2
cat("alpha = ", alpha,
"\nK1 = ", K1,
"\nK2 = ", K2,
"\nmu = ", mu,
"\nzeta = ", zeta, "\n")
z1 <- rnorm(K1)
z2 <- rnorm(K2) + mu
nTN <- length(z1[z1 < zeta])
nTP <- length(z2[z2 >= zeta])
Sp <- nTN/K1;Se <- nTP/K2
cat("Specificity = ", Sp,
"\nSensitivity = ", Se, "\n")
# Approx binomial tests
cat("Approx 95 percent CI for Specificity = ",
-abs(qnorm(alpha/2))*sqrt(Sp*(1-Sp)/K1)+Sp,
+abs(qnorm(alpha/2))*sqrt(Sp*(1-Sp)/K1)+Sp,"\n")
# Exact binomial test
ret <- binom.test(nTN, K1, p = nTN/K1)
cat("Exact 95 percent CI for Specificity = ",
as.numeric(ret$conf.int),"\n")
# Approx binomial tests
cat("Approx 95 percent CI for Sensitivity = ",
-abs(qnorm(alpha/2))*sqrt(Se*(1-Se)/K2)+Se,
+abs(qnorm(alpha/2))*sqrt(Se*(1-Se)/K2)+Se,"\n")
# Exact binomial test
ret <- binom.test(nTP, K2, p = nTP/K2)
cat("Exact 95 percent CI for Sensitivity = ",
as.numeric(ret$conf.int),"\n")
```
Note that the approximate confidence intervals can stray outside the allowed range but the exact confidence intervals do not.
## Variability: the Beam study {#binary-task-model-beam-study}
In this study [@beam1996variability] fifty accredited mammography centers were randomly sampled in the United States. “Accredited” is a legal/regulatory term implying, among other things, that the radiologists interpreting the breast cases were “board certified” by the American Board of Radiology. One hundred eight (108) certified radiologists from these centers gave blinded interpretation to a common set of 79 randomly sampled (stratified sampling) enriched screening cases containing 45 cases with cancer and the rest with benign lesions. Ground truth for these women had been established either by biopsy or by 2-year follow-up.
The observed range of sensitivity (TPF) was 53 percent and the range of FPF was 63 percent; the corresponding range for the AUC was 21 percent, Table \@ref(tab:binary-task-model-table-beam-study). Empirical AUC was estimated using a 5-point BIRADS ratings of the images (the zero category was not allowed). Explanation of empirical AUC is deferred to Chapter \@ref(empirical-auc).
```{r, echo=FALSE}
results <- array(dim = c(3,3))
results[1,] <- c(46.7, 100, 53.3)
results[2,] <- c(36.3, 99.3, 63.0)
results[3,] <- c(0.74, 0.95, 0.21)
df <- as.data.frame(results)
rownames(df) <- c("Sensitivity","Specificity","AUC")
colnames(df) <- c("Min","Max","Range")
```
```{r binary-task-model-table-beam-study, echo=FALSE}
kable(df, caption = "The variability of 108 radiologists on a common dataset of screening mammograms. Note the reduced variability when one uses AUC which accounts for variations in reporting thresholds (AUC variability range is 21 percent compared to 53 percent for sensitivity and 63 percent for specificity).", escape = FALSE)
```
```{r beam-study-fig, echo=FALSE, out.width ="80%", fig.cap="Schematic patterned from the Beam et al study showing the ROC operating points of 108 mammographers. Wide variability in sensitivity and specificity are evident while AUC is less variable. See text below.", fig.show='hold'}
knitr::include_graphics("images/BeamStudy.png")
```
In Fig. \@ref(fig:beam-study-fig) if one looks at the points labeled (B) and (C) one can mentally construct a smooth ROC curve that starts at (0,0), passes roughly through these points and ends at (1,1). In this sense, the intrinsic performances (i.e., AUCs or equivalently the $\mu$ parameters) of radiologists B and C are similar. The only difference between them is that radiologist B is using lower threshold than radiologist C. Radiologist C is more concerned with minimizing FPs while radiologist B is more concerned with maximizing sensitivity. By appropriate feedback radiologist C can perhaps be induced to change the threshold to that of radiologist B. An example of feedback might be: “you are missing too many cancers and this could get us all into trouble; worry less about reduced specificity and more about increasing your sensitivity”.
In contrast, radiologist A has intrinsically superior performance to B or C. No change in threshold is going to get the other two to a similar level of performance as radiologist A. Extensive training will be needed to bring the under-performing radiologists to the expert level of radiologist A.
Fig. \@ref(fig:beam-study-fig) and Table \@ref(tab:binary-task-model-table-beam-study) illustrate several important principles.
1. Since an operating point is characterized by two values, unless both numbers are higher (e.g., radiologist A vs. B or C) it is difficult to unambiguously compare them.
2. While sensitivity and specificity depend on the reporting threshold, the area under the ROC plot is independent of it. Using the area under the ROC curve one can unambiguously compare two readers.
3. Combining sensitivity and the complement of specificity into a single AUC measure yields the additional benefit of lower variability. In Fig. \@ref(fig:beam-study-fig), the range for sensitivity is 53 percent while that for specificity is 63 percent. In contrast, the range for AUC is only 21 percent. This means that much of the observed variations in sensitivity and specificity are due to variations in thresholds, and using AUC eliminates this source of variability. Decreased variability of a measure is a highly desirable characteristic as it implies the measurement is more precise making it easier to detect differences between readers.
## Discussion{#binary-task-model-discussion}
Sensitivity and specificity are widely used in the medical imaging literature. It is important to realize that they do not provide a complete picture of diagnostic performance, since they represent performance at a particular observer-dependent threshold. As demonstrated in Fig. \@ref(fig:beam-study-fig) expert observers can and do operate at different thresholds. If using sensitivity and specificity the dependence on reporting threshold often makes it difficult to unambiguously compare observers. An additional source of variability is introduced by the varying thresholds.
The ROC curve and AUC are completely defined by the $\mu$ parameter of the equal variance binormal model. Since both are independent of reporting threshold they overcome the ambiguity inherent in comparing sensitivity/specificity pairs. AUC is widely used in assessing imaging systems.
> It should impress the reader that a subjective internal sensory perception of disease presence and an equally subjective internal threshold can be translated into an objective performance measure, such as the area under an ROC curve or equivalently, the $\mu$ parameter. The latter has the physical meaning of a perceptual signal to noise ratio.
The properties of the unit normal distribution and the binomial distribution were used to derive parametric confidence intervals for sensitivity and specificity. These were compared to exact confidence intervals. An important study was reviewed showing wide variability in sensitivity and specificity for radiologists interpreting a common set of cases in screening mammography, but smaller variability in AUCs. This is because much of the variability in sensitivity and specificity is due to variation of the reporting threshold, which does not affect the area under the ROC curve. This is an important reason for preferring comparisons based on area under the ROC curve to those based on comparing sensitivity/specificity pairs.
## Appendix I {#binary-task-model-demo}
### Estimates from a finite sample
The following embedded code simulates 9 non-diseased and 11 diseased cases. The $\mu$ parameter is 1.5 and $\zeta$ is $\mu/2$. Shown are the estimates of sensitivity and specificity and $\mu$.
```{r echo=FALSE}
simulateDataset <- function(K1, K2, mu, zeta, seed) {
set.seed(seed)
z1 <- rnorm(K1)
z2 <- rnorm(K2) + mu
nTN <- length(z1[z1 < zeta])
nTP <- length(z2[z2 >= zeta])
Sp <- nTN/K1
Se <- nTP/K2
mu <- qnorm(Sp)+qnorm(Se)
return(
list(Sp = Sp,
Se = Se,
mu = mu))
}
```
```{r, echo=FALSE}
mu <- 1.5
zeta <- mu/2
seed <- 100 # line 4
K1 <- 9
K2 <- 11
ds <- simulateDataset(K1, K2, mu, zeta, seed)
cat("seed = ", seed,
"\nmu true = ", mu,
"\nzeta true = ", zeta,
"\nK1 = ", K1,
"\nK2 = ", K2,
"\nSpecificity = ", ds$Sp,
"\nSensitivity = ", ds$Se,
"\nEst. of mu = ", ds$mu, "\n")
```
Since this is a finite sample the estimate of $\mu$ is not equal to the true value. In fact, all of the estimates, sensitivity, specificity and $\mu$ are subject to sampling variability.
### Changing the seed variable
No matter how many times one runs the above code, one always sees the same output shown above. This is because one sets the `seed` of the random number generator to a fixed value, namely 100. This is like having a perfectly reproducible reader repeatedly interpreting the same cases – one always gets the same results. Changing the `seed` to 101 yields:
```{r, echo=FALSE}
seed <- 101 # change
ds <- simulateDataset(K1, K2, mu, zeta, seed)
cat("seed = ", seed,
"\nmu true = ", mu,
"\nzeta true = ", zeta,
"\nK1 = ", K1,
"\nK2 = ", K2,
"\nSpecificity = ", ds$Sp,
"\nSensitivity = ", ds$Se,
"\nEst. of mu = ", ds$mu, "\n")
```
Changing `seed` is equivalent to sampling a completely new set of cases. The effect is quite large (estimated sensitivity falls from 0.909 to 0.545 and estimated $\mu$ falls from 2.56 to 0.879) because the size of the relevant case set, $K_2=11$ for sensitivity, is small.
### Increasing the numbers of cases
Here we increase $K_1$ and $K_2$, by a factor of 10 each, and reset the `seed` to 100.
```{r, echo=FALSE}
K1 <- 90 # change
K2 <- 110 # change
seed <- 100 # change
ds <- simulateDataset(K1, K2, mu, zeta, seed)
cat("seed = ", seed,
"\nmu true = ", mu,
"\nzeta true = ", zeta,
"\nK1 = ", K1,
"\nK2 = ", K2,
"\nSpecificity = ", ds$Sp,
"\nSensitivity = ", ds$Se,
"\nEst. of mu = ", ds$mu, "\n")
```
Next we change `seed` to 101.
```{r, echo=FALSE}
seed <- 101 # change
ds <- simulateDataset(K1, K2, mu, zeta, seed)
cat("seed = ", seed,
"\nmu true = ", mu,
"\nzeta true = ", zeta,
"\nK1 = ", K1,
"\nK2 = ", K2,
"\nSpecificity = ", ds$Sp,
"\nSensitivity = ", ds$Se,
"\nEst. of mu = ", ds$mu, "\n")
```
Notice that now the values are less sensitive to seed. Table \@ref(tab:binary-task-model-table) illustrates this trend with increasing sample size.
```{r, echo=FALSE}
results <- array(dim = c(9,6))
mu <- 1.5
zeta <- mu/2
results[9,] <- c(Inf, Inf, NA, pnorm(zeta), pnorm(mu-zeta), mu)
K1_arr <- c(9, 9, 90, 90, 900, 900, 9000, 9000, NA)
K2_arr <- c(11, 11, 110, 110, 1100, 1100, 11000, 11000, NA)
seed_arr <- c(100,101,100,101,100,101,100,101,NA)
for (i in 1:8) {
ds <- simulateDataset(K1_arr[i], K2_arr[i], mu, zeta, seed_arr[i])
results[i,] <- c(K1_arr[i], K2_arr[i], seed_arr[i], ds$Sp, ds$Se, ds$mu)
}
df <- as.data.frame(results)
colnames(df) <- c("K1","K2","seed","Se","Sp","mu")
```
```{r binary-task-model-table, echo=FALSE}
kable(df, caption = "Effect of sample size and seed on estimates of sensitivity, specificity and the mu-parameter.", escape = FALSE)
```
As the numbers of cases increase, the sensitivity and specificity converge to a common value, around 0.773 and the estimate of the separation parameter converges to the known value.
```{r}
pnorm(0.75) # example 1
2*qnorm(pnorm(zeta)) # example 2
```
Because the threshold is halfway between the two distributions, as in this example, sensitivity and specificity are identical. In words, with two unit variance distributions separated by 1.5, the area under the diseased distribution (centered at 1.5) above 0.75, namely sensitivity, equals the area under the non-diseased distribution (centered at zero) below 0.75, namely specificity, and the common value is $\Phi(0.75)= 0.773$, yielding the last row of Table \@ref(tab:binary-task-model-table), and example 1 in the above code snippet. Example 2 in the above code snippet illustrates Eqn. \@ref(eq:binary-task-model-solve-mu). The factor of two arises since in this example sensitivity and specificity are identical.
From Table \@ref(tab:binary-task-model-table), for the same numbers of cases but different seeds, comparing pairs of sensitivity and specificity values is more difficult as two pairs of numbers (i.e., four numbers) are involved. Comparing a single pair of $\mu$ values is easier as only two numbers are involved. The tendency of the pairs to become independent of case sample is discernible with fewer cases with $\mu$, around 90/110 cases, than with sensitivity and specificity pairs.
The numbers in the table might appear disheartening in terms of the implied numbers of cases needed to detect a difference in specificity. Even with 200 cases, the difference in specificity for two seed values is 0.081, which is large considering that the scale extends from 0 to 1.0. A similar comment applies to differences in sensitivity. The situation is not quite that bad:
> One uses an area measure that combines sensitivity and specificity and hence yields less variability. One uses the ratings paradigm which is more efficient than the binary paradigm used in this chapter. Finally, one takes advantage of correlations that exist between the interpretations using matched-case matched-reader interpretations in two modalities; this tends to decrease variability in the AUC-difference even further (most applications of ROC methods involved detecting differences in AUCs not absolute values).
## Chapter References {#binary-task-model-references}