Evaluation Methods Documentation (#433)

Striveworks · Feb 23, 2024 · 0ecb542 · 0ecb542
1 parent fd1a75a
commit 0ecb542
Showing 1 changed file with 144 additions and 22 deletions.
diff --git a/docs/metrics.md b/docs/metrics.md
@@ -4,30 +4,152 @@ Let's look at the various metrics you can calculate using Valor.
 
 If we're missing an important metric for your particular use case, please [write us a GitHub Issue ticket](https://github.com/Striveworks/valor/issues). We love hearing your suggestions.
 
-
 ## Classification Metrics
-| Name 	| Description 	|
-|---	|---	|
-| Precision 	| The number of true positives divided by the total number of positive predictions (i.e., the number of true positives plus the number of false positives). 	|
-| Recall 	| The number of true positives divided by the total count of the class of interest (i.e., the number of true positives plus the number of true negatives). 	|
-| F1 	| A weighted average of precision and recall, calculated as `(2 * Precision * Recall)/(Precision + Recall)`. 	|
-| Accuracy 	| The number of true positives divided by the total number of predictions. 	|
-| ROC AUC 	| The area under the Receiver Operating Characteristic (ROC) curve for the predictions generated by a given model. 	|
-
-## Object Detection and Instance Segmentation Metrics
-
-| Name 	| Description 	|
-|---	|---	|
-| Average Precision (AP) 	| The weighted mean of precisions achieved at several different recall thresholds for a single Intersection over Union (IOU)*, grouped by class. |
-| AP Averaged Over IOUs 	| The average of several AP metrics, calculated at various IOUs, grouped by class. 	|
-| Mean Average Precision (mAP) 	| The mean of several AP scores, calculated over various classes.	|
-| mAP Averaged Over IOUs 	| The mean of several averaged AP scores, calculated over various classes. 	|
+
+| Name | Description | Equation |
+|:- | :- | :- |
+| Precision | The number of true positives divided by the total number of positive predictions (i.e., the number of true positives plus the number of false positives). | $$\dfrac{\|TP\|}{\|TP\|+\|FP\|}$$ |
+| Recall | The number of true positives divided by the total count of the class of interest (i.e., the number of true positives plus the number of true negatives). | $$\dfrac{\|TP\|}{\|TP\|+\|FN\|}$$ |
+| F1 | A weighted average of precision and recall. | $$\frac{2 * Precision * Recall}{Precision + Recall}$$ |
+| Accuracy | The number of true predictions divided by the total number of predictions. | $$\dfrac{\|TP\|+\|TN\|}{\|TP\|+\|TN\|+\|FP\|+\|FN\|}$$ |
+| ROC AUC | The area under the Receiver Operating Characteristic (ROC) curve for the predictions generated by a given model. | See [ROCAUC methods](#binary-roc-auc). |
+
+## Object Detection and Instance Segmentation Metrics[^1]
+
+| Name | Description | Equation |
+| :- | :- | :- |
+| Average Precision (AP) | The weighted mean of precisions achieved at several different recall thresholds for a single Intersection over Union (IOU), grouped by class. | See [AP methods](#average-precision-ap). |
+| AP Averaged Over IOUs | The average of several AP metrics, calculated at various IOUs, grouped by class. | $$\dfrac{1}{\text{number of thresholds}} \sum\limits_{iou \in thresholds} AP_{iou}$$ |
+| Mean Average Precision (mAP) 	| The mean of several AP scores, calculated over various classes. | $$\dfrac{1}{\text{number of classes}} \sum\limits_{c \in classes} AP_{c}$$ |
+| mAP Averaged Over IOUs | The mean of several averaged AP scores, calculated over various classes. | $$\dfrac{1}{\text{number of thresholds}} \sum\limits_{iou \in thresholds} mAP_{iou}$$ |
+
+[^1]: When calculating IOUs for object detection metrics, Valor handles the necessary conversion between different types of geometric annotations. For example, if your model prediction is a polygon and your groundtruth is a raster, then the raster will be converted to a polygon prior to calculating the IOU.
 
 ## Semantic Segmentation Metrics
 
-| Name 	| Description 	|
-|---	|---	|
-| Intersection Over Union (IOU) 	| The overlap between the ground truth and predicted regions of an image, measured as a percentage, grouped by class. IOUs are calculated by a) fetching the ground truth and prediction rasters for a particular image and class, b) counting the true positive pixels (e.g., the number of pixels that were selected in both the ground truth masks and prediction masks), and c) dividing the sum of true positives by the total number of pixels in both the ground truth and prediction masks. |
-| Mean IOU 	| The average of IOUs, calculated over several different classes. 	|
+| Name | Description | Equation |
+| :- | :- | :- |
+| Intersection Over Union (IOU) | A ratio between the groundtruth and predicted regions of an image, measured as a percentage, grouped by class. |$$\dfrac{area( prediction \cap groundtruth )}{area( prediction \cup groundtruth )}$$ |
+| Mean IOU 	| The average of IOUs, calculated over several different classes. | $$\dfrac{1}{\text{number of classes}} \sum\limits_{c \in classes} IOU_{c}$$ |
+
+# Appendix: Metric Calculations
+
+## Binary ROC AUC
+
+### Receiver Operating Characteristic (ROC)
+
+An ROC curve plots the True Positive Rate (TPR) vs. the False Positive Rate (FPR) at different confidence thresholds.
+
+In Valor, we use the confidence scores sorted in decreasing order as our thresholds. Using these thresholds, we can calculate our TPR and FPR as follows:
+
+#### Determining the Rate of Correct Predictions
+
+| Element | Description |
+| ------- | ------------ |
+| True Positive (TP) | Prediction confidence score >= threshold and is correct. |
+| False Positive (FP) | Prediction confidence score >= threshold and is incorrect. |
+| True Negative (TN) | Prediction confidence score < threshold and is correct. |
+| False Negative (FN) | Prediction confidence score < threshold and is incorrect. |
+
+- $\text{True Positive Rate (TPR)} = \dfrac{|TP|}{|TP| + |FN|} = \dfrac{|TP(threshold)|}{|TP(threshold)| + |FN(threshold)|}$
+
+- $\text{False Positive Rate (FPR)} = \dfrac{|FP|}{|FP| + |TN|} = \dfrac{|FP(threshold)|}{|FP(threshold)| + |TN(threshold)|}$
+
+We now use the confidence scores, sorted in decreasing order, as our thresholds in order to generate points on a curve.
+
+$$
+Point(score) = (FPR(score), \ TPR(score))
+$$
+
+### Area Under the ROC Curve (ROC AUC)
+
+After calculating the ROC curve, we find the ROC AUC metric by approximating the integral using the trapezoidal rule formula.
+
+$$
+ROC AUC =  \sum_{i=1}^{|scores|} \frac{  \lVert Point(score_{i-1}) - Point(score_i) \rVert }{2}
+$$
+
+See [Classification: ROC Curve and AUC](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc) for more information.
+
+## Average Precision (AP)
+
+For object detection and instance segmentation tasks, average precision is calculated from the intersection-over-union (IOU) of geometric predictions and ground truths.
+
+### Multiclass Precision and Recall
+
+Tasks that predict geometries (such as object detection or instance segmentation) use the ratio intersection-over-union (IOU) to calculate precision and recall. IOU is the ratio of the intersecting area over the joint area spanned by the two geometries, and is defined in the following equation.
+
+$$Intersection \ over \ Union \ (IOU) = \dfrac{Area( prediction \cap groundtruth )}{Area( prediction \cup groundtruth )}$$
+
+Using different IOU thresholds, we can determine whether we count a pairing between a prediction and a ground truth pairing based on their overlap.
+
+| Case | Description |
+| :- | :- |
+| True Positive (TP) | Prediction-GroundTruth pair exists with IOU >= threshold. |
+| False Positive (FP) | Prediction-GroundTruth pair exists with IOU < threshold. |
+| True Negative (TN) | Unused in multi-class evaluation.
+| False Negative (FN) | No Prediction with a matching label exists for the GroundTruth. |
+
+- $Precision = \dfrac{|TP|}{|TP| + |FP|} = \dfrac{\text{Number of True Predictions}}{|\text{Predictions}|}$
+
+- $Recall = \dfrac{|TP|}{|TP| + |FN|} = \dfrac{\text{Number of True Predictions}}{|\text{Groundtruths}|}$
+
+### Matching Ground Truths with Predictions
+
+To properly evaluate a detection, we must first find the best pairings of predictions to ground truths. We start by iterating over our predictions, ordering them by highest scores first. We pair each prediction with the ground truth that has the highest calculated IOU. Both the prediction and ground truth are now considered paired and removed from the pool of choices.
+
+```python
+def rank_ious(
+    groundtruths: list,
+    predictions: list,
+) -> list[float]:
+    """Ranks ious by unique pairings."""
+
+    retval = []
+    groundtruths = set(groundtruths)
+    for prediction in sorted(predictions, key=lambda x : -x.score):
+        groundtruth = max(groundtruths, key=lambda x : calculate_iou(groundtruth, prediction))
+        groundtruths.remove(groundtruth)
+        retval.append(calculate_iou(groundtruth, prediction))
+```
+
+### Precision-Recall Curve
+
+We can now compute the precision-recall curve using our previously ranked IOU's. We do this by iterating through the ranked IOU's and creating points cumulatively using recall and precision.
+
+```python
+def create_precision_recall_curve(
+    number_of_groundtruths: int,
+    ranked_ious: list[float],
+    threshold: float
+) -> list[tuple[float, float]]:
+    """Creates the precision-recall curve from a list of IOU's and a threshold."""
+
+    retval = []
+    count_tp = 0
+    for i in range(ranked_ious):
+        if ranked_ious[i] >= threshold:
+            count_tp += 1
+        precision = count_tp / (i + 1)
+        recall = count_tp / number_of_groundtruths
+        retval.append((recall, precision))
+```
+
+### Calculating Average Precision
+
+Average precision is defined as the area under the precision-recall curve.
+
+We will use a 101-point interpolation of the curve to be consistent with the COCO evaluator. The intent behind interpolation is to reduce the fuzziness that results from ranking pairs.
+
+$$
+AP = \frac{1}{101} \sum\limits_{r\in\{ 0, 0.01, \ldots , 1 \}}\rho_{interp}(r)
+$$
+
+$$
+\rho_{interp} = \underset{\tilde{r}:\tilde{r} \ge r}{max \ \rho (\tilde{r})}
+$$
 
-\*When calculating IOUs for object detection metrics, Valor handles the necessary conversion between different types of image annotations. For example, if your model prediction is a polygon and your ground truth is a raster, then the raster will be converted to a polygon prior to calculating the IOU.
+### References
+- [MS COCO Detection Evaluation](https://cocodataset.org/#detection-eval)
+- [The PASCAL Visual Object Classes (VOC) Challenge](https://link.springer.com/article/10.1007/s11263-009-0275-4)
+- [Mean Average Precision (mAP) Using the COCO Evaluator](https://pyimagesearch.com/2022/05/02/mean-average-precision-map-using-the-coco-evaluator/)