Skip to content

Commit

Permalink
working on docs
Browse files Browse the repository at this point in the history
  • Loading branch information
czaloom committed Dec 12, 2024
1 parent abbf05a commit 1cbddaf
Show file tree
Hide file tree
Showing 9 changed files with 87 additions and 98 deletions.
11 changes: 5 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,19 @@ install:

install-dev:
pip install -e src/[all]
pre-commit install

pre-commit:
@echo "Running pre-commit..."
pre-commit install
pre-commit run --all

tests:
@echo "Running unit tests..."
poetry run pytest ./lite/tests/text_generation -v
pytest ./lite/tests/text_generation -v

external-tests:
integration-tests:
@echo "Running external integration tests..."
poetry run pytest ./lite/tests/text_generation -v
poetry run pytest ./integration_tests/external -v
pytest ./lite/tests/text_generation -v

clean:
@echo "Cleaning up temporary files..."
Expand All @@ -31,6 +30,6 @@ help:
@echo " install-dev Install valor_lite along with development tools."
@echo " pre-commit Run pre-commit."
@echo " tests Run unit tests."
@echo " external-tests Run external integration tests."
@echo " integration-tests Run external integration tests."
@echo " clean Remove temporary files."
@echo " help Show this help message."
16 changes: 8 additions & 8 deletions docs/classification/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@
| Counts | A dictionary containing counts of true positives, false positives, true negatives, false negatives, for each label. | See [Counts](#counts). |
| Confusion Matrix | | See [Confusion Matrix](#confusion-matrix). |

# Appendix: Metric Calculations
## Appendix: Metric Calculations

## Counts
### Counts
Precision-recall curves offer insight into which confidence threshold you should pick for your production pipeline. The `PrecisionRecallCurve` metric includes the true positives, false positives, true negatives, false negatives, precision, recall, and F1 score for each (label key, label value, confidence threshold) combination. When using the Valor Python client, the output will be formatted as follows:

```python
Expand Down Expand Up @@ -45,15 +45,15 @@ print(pr_evaluation)
}]
```

## Binary ROC AUC
### Binary ROC AUC

### Receiver Operating Characteristic (ROC)
#### Receiver Operating Characteristic (ROC)

An ROC curve plots the True Positive Rate (TPR) vs. the False Positive Rate (FPR) at different confidence thresholds.

In Valor, we use the confidence scores sorted in decreasing order as our thresholds. Using these thresholds, we can calculate our TPR and FPR as follows:

#### Determining the Rate of Correct Predictions
##### Determining the Rate of Correct Predictions

| Element | Description |
| ------- | ------------ |
Expand All @@ -70,19 +70,19 @@ We now use the confidence scores, sorted in decreasing order, as our thresholds

$Point(score) = (FPR(score), \ TPR(score))$

### Area Under the ROC Curve (ROC AUC)
#### Area Under the ROC Curve (ROC AUC)

After calculating the ROC curve, we find the ROC AUC metric by approximating the integral using the trapezoidal rule formula.

$ROC AUC = \sum_{i=1}^{|scores|} \frac{ \lVert Point(score_{i-1}) - Point(score_i) \rVert }{2}$

See [Classification: ROC Curve and AUC](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc) for more information.

## Confusion Matrix
### Confusion Matrix

Valor also includes a more detailed version of `PrecisionRecallCurve` which can be useful for debugging your model's false positives and false negatives. When calculating `DetailedPrecisionCurve`, Valor will classify false positives as either `hallucinations` or `misclassifications` and your false negatives as either `missed_detections` or `misclassifications` using the following logic:

### Classification Tasks
#### Classification Tasks
- A **false positive** occurs when there is a qualified prediction (with `score >= score_threshold`) with the same `Label.key` as the ground truth on the datum, but the `Label.value` is incorrect.
- **Example**: if there's a photo with one ground truth label on it (e.g., `Label(key='animal', value='dog')`), and we predicted another label value (e.g., `Label(key='animal', value='cat')`) on that datum, we'd say it's a `misclassification` since the key was correct but the value was not.
- Similarly, a **false negative** occurs when there is a prediction with the same `Label.key` as the ground truth on the datum, but the `Label.value` is incorrect.
Expand Down
23 changes: 17 additions & 6 deletions docs/contributing.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Contributing to Valor
# Contibuting & Development

## Contributing to Valor

We welcome all contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas aimed at improving Valor. This doc describes the high-level process for how to contribute to this repository. If you have any questions or comments about this process, please feel free to reach out to us on [Slack](https://striveworks-public.slack.com/join/shared_invite/zt-1a0jx768y-2J1fffN~b4fXYM8GecvOhA#/shared-invite/email).

## On GitHub
### On GitHub

We use [Git](https://git-scm.com/doc) on [GitHub](https://github.com) to manage this repo, which means you will need to sign up for a free GitHub account to submit issues, ideas, and pull requests. We use Git for version control to allow contributors from all over the world to work together on this project.

Expand All @@ -12,7 +14,7 @@ If you are new to Git, these official resources can help bring you up to speed:
- [GitHub documentation for collaborating with pull requests](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests)
- [GitHub documentation for working with forks](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks)

## Contribution Workflow
### Contribution Workflow

Generally, the high-level workflow for contributing to this repo includes:

Expand All @@ -30,7 +32,7 @@ Generally, the high-level workflow for contributing to this repo includes:
For questions or comments on this process, please reach out to us at any time on [Slack](https://striveworks-public.slack.com/join/shared_invite/zt-1a0jx768y-2J1fffN~b4fXYM8GecvOhA#/shared-invite/email).


## Development Tips and Tricks
## Development

### Setting Up Your Environment

Expand All @@ -42,7 +44,7 @@ python3 -m venv .env-valor
source .env-valor/bin/activate

# conda
conda create --name valor python=3.11
conda create --name valor python=3.10
conda activate valor
```

Expand All @@ -55,8 +57,17 @@ make install-dev

All of our tests are run automatically via GitHub Actions on every push, so it's important to double-check that your code passes all local tests before committing your code.

For linting and code formatting use:
```shell
make pre-commit
```

For unit and functional testing:
```shell
make tests
make external-tests
```

For integration testing:
```shell
make external-tests
```
61 changes: 15 additions & 46 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,55 +1,24 @@
# Introduction

Valor is a centralized evaluation store that makes it easy to measure, explore, and rank model performance. Valor empowers data scientists and engineers to evaluate the performance of their machine learning pipelines and use those evaluations to make better modeling decisions in the future. To skip this textual introduction and dive right in, first go [here](installation.md) for instructions to setup the Valor service, and then checkout the [sample notebooks](https://github.com/Striveworks/valor/blob/main/examples/).
Valor is a collection of evaluation methods that make it easy to measure, explore, and rank machine learning model performance. Valor empowers data scientists and engineers to evaluate the performance of their machine learning pipelines and use those evaluations to make better modeling decisions in the future. To skip this textual introduction and dive right in, first go [here](#installation) for basic installation instructions, and then checkout the [example notebooks](https://github.com/Striveworks/valor/blob/main/examples/).

Valor is maintained by Striveworks, a cutting-edge machine learning operations (MLOps) company based out of Austin, Texas. We'd love to learn more about your interest in Valor and answer any questions you may have; please don't hesitate to reach out to us on [Slack](https://striveworks-public.slack.com/join/shared_invite/zt-1a0jx768y-2J1fffN~b4fXYM8GecvOhA#/shared-invite/email) or [GitHub](https://github.com/striveworks/valor).

These docs are organized as follows:
## Installation

- **[Overview](index.md)** (this page): Provides an overview of what Valor is, why it's important, and how it works.
- **[Example Notebooks](https://github.com/Striveworks/valor/blob/main/examples/)**: Collection of descriptive Jupyter notebooks giving examples of how to evaluate model performance using Valor.
- **[Contributing and Development](contributing.md)**: Explains how you can build on and contribute to Valor.

# Overview

# Use Cases for a Containerized Evaluation Store

As we've worked with dozens of data scientists and engineers on their MLOps pipelines, we have identified three important questions that an effective evaluation store could help them answer. First, they wanted to understand: **"Of the various models I tested for a given dataset, which one performs best?"**. This is a very common and important use case—and one that is often solved on a model-to-model basis in a local Jupyter notebook. This focus on bespoke implementations limits traceability and makes it difficult to create apples-to-apples comparisons between new model runs and prior model runs.

Second, our users wanted to understand: **"How does the performance of a particular model vary across datasets?"**. We found that many practitioners use the same computer vision model (e.g., YOLOv8) for a variety of supervised learning tasks, and they needed a way to identify patterns where that particular model didn't meet expectations.

Finally, our users wanted to understand: **"How can I use my prior evaluations to pick the best model for a future ML pipeline?"**. This last question requires the ability to filter previous evaluations on granular metadata (e.g., time of day, geospatial coordinates, etc.) in order to provide tailored recommendations regarding which model to pick in the future.

With these three use cases in mind, we set out to build a centralized evaluation store that we later named Valor.

# Introducing Valor

Valor is a centralized evaluation store that makes it easy to measure, explore, and rank model performance. Our ultimate goal with Valor is to help data scientists and engineers pick the right ML model for their specific needs. To that end, we built Valor with three design principles in mind:

- **Valor works with any dataset or model:** We believe Valor should be able to handle any supervised learning task that you want to throw at it. Just pass in your ground truth annotations and predictions, describe your learning task (i.e., object detection), and Valor will do the rest. (Note: At launch, Valor will only support classification and computer vision (i.e., image segmentation and object detection) tasks. We're confident this framework will abstract well to other supervised learning tasks and plan to support them in later releases).
- **Valor can handle any type of image, model, or dataset metadata you throw at it:** Metadata is a critical component of any evaluation store as it enables the system to offer tailored model recommendations based on a user's specific needs. To that end, we built Valor to handle any metadata under the sun. Dates, geospatial coordinates, and even JSONs filled with configuration details are all on the table. This means you can slice and dice your evaluations any way you want: just pass in the right labels for your use case and define your filter (say a geographic bounding box), and you’ll get back results for your specific needs.
- **Valor standardizes the evaluation process:** The trickiest part of comparing two different model runs is avoiding apples-to-oranges comparisons. Valor helps you audit your metrics and avoid false comparisons by versioning your uploads, storing them in a centralized location, and ensuring that you only compare runs that used the exact same filters and metrics.
### PyPi
```
pip install valor-lite
```

# How It Works: An Illustrative Example
### Source
```
git clone https://github.com/Striveworks/valor.git
cd valor
make install
```

Let’s walk through a quick example to bring Valor to life.
# Quick Links

Say that you're interested in using computer vision models to detect forest fires around the world using satellite imagery. You've just been tasked with building a new ML pipeline to detect fires in an unfamiliar region of interest. How might you leverage your evaluation metrics from prior ML pipelines to understand which model will perform best for this particular use case?

<img src="static/example_1.png" alt="A satellite image of forest fires.">

To answer this question, we'll start by passing in three pieces of information from each of our prior modeling runs:

- **GroundTruths:** First, we'll pass in human-annotated bounding boxes to tell Valor exactly where forest fires can be found across all of the satellite images used in prior runs.
- **Predictions:** Next, we'll pass machine-generated predictions for each image (also in the form of bounding boxes) so that Valor can evaluate how well each model did at predicting forest fires.
- **Labels:** Finally, we'll pass metadata to Valor describing each of our various images (e.g., the time of day the photo was taken, the geospatial coordinates of the forest in the photo, etc.). We'll use this metadata later on in order to identify the right model for our new use case.

Once we pass in these three ingredients, Valor will compare all of our `GroundTruths` and `Predictions` in order to calculate various evaluation metrics (i.e., mean average precision or mAP). These metrics, `Labels`, `GroundTruths`, and `Predictions`, will all be stored in Postgres, with PostGIS support for fast geospatial lookups and geometric comparisons at a later date.

Finally, once all of our previous pipeline runs and evaluations are stored in Valor, we can use Valor’s API to specify our exact filter criteria and get back its model rankings. In this case, we can ask Valor to find us the best model for detecting forest fires at night in a 50 mile radius around (42.36, -71.03), sorted by mAP. Valor will then filter all of our stored evaluation metrics, rank each model with evaluations that meet our criteria, and send back all relevant evaluation metrics to help us determine which model to use for our new modeling pipeline.

<img src="static/example_2.png" alt="A satellite image of forest fires.">

# Next Steps

We'd recommend reviewing our ["Getting Started" sample notebook](https://github.com/Striveworks/valor/blob/main/examples/getting_started.ipynb) to become further acquainted with Valor. For more detailed explanations of Valor's technical underpinnings, see our [technical concepts guide](technical_concepts.md).
- **[Example Notebooks](https://github.com/Striveworks/valor/blob/main/examples/)**: Collection of descriptive Jupyter notebooks giving examples of how to evaluate model performance using Valor.
- **[Contributing and Development](contributing.md)**: Explains how you can build on and contribute to Valor.
22 changes: 11 additions & 11 deletions docs/object_detection/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,15 @@
| Precision-Recall Curves | | See [Precision-Recall Curve](#precision-recall-curve)|
| Confusion Matrix | | See [Confusion Matrix](#confusion-matrix)|

# Appendix: Metric Calculations
## Appendix: Metric Calculations

## Counts
### Counts

## Average Precision (AP)
### Average Precision (AP)

For object detection and instance segmentation tasks, average precision is calculated from the intersection-over-union (IOU) of geometric predictions and ground truths.

### Multiclass Precision and Recall
#### Multiclass Precision and Recall

Tasks that predict geometries (such as object detection or instance segmentation) use the ratio intersection-over-union (IOU) to calculate precision and recall. IOU is the ratio of the intersecting area over the joint area spanned by the two geometries, and is defined in the following equation.

Expand All @@ -41,7 +41,7 @@ Using different IOU thresholds, we can determine whether we count a pairing betw

- $Recall = \dfrac{|TP|}{|TP| + |FN|} = \dfrac{\text{Number of True Predictions}}{|\text{Groundtruths}|}$

### Matching Ground Truths with Predictions
#### Matching Ground Truths with Predictions

To properly evaluate a detection, we must first find the best pairings of predictions to ground truths. We start by iterating over our predictions, ordering them by highest scores first. We pair each prediction with the ground truth that has the highest calculated IOU. Both the prediction and ground truth are now considered paired and removed from the pool of choices.

Expand All @@ -60,7 +60,7 @@ def rank_ious(
retval.append(calculate_iou(groundtruth, prediction))
```

### Precision-Recall Curve
#### Precision-Recall Curve

We can now compute the precision-recall curve using our previously ranked IOU's. We do this by iterating through the ranked IOU's and creating points cumulatively using recall and precision.

Expand All @@ -82,7 +82,7 @@ def create_precision_recall_curve(
retval.append((recall, precision))
```

### Calculating Average Precision
#### Calculating Average Precision

Average precision is defined as the area under the precision-recall curve.

Expand All @@ -92,12 +92,12 @@ $AP = \frac{1}{101} \sum\limits_{r\in\{ 0, 0.01, \ldots , 1 \}}\rho_{interp}(r)$

$\rho_{interp} = \underset{\tilde{r}:\tilde{r} \ge r}{max \ \rho (\tilde{r})}$

### References
#### References
- [MS COCO Detection Evaluation](https://cocodataset.org/#detection-eval)
- [The PASCAL Visual Object Classes (VOC) Challenge](https://link.springer.com/article/10.1007/s11263-009-0275-4)
- [Mean Average Precision (mAP) Using the COCO Evaluator](https://pyimagesearch.com/2022/05/02/mean-average-precision-map-using-the-coco-evaluator/)

## Average Recall (AR)
### Average Recall (AR)

To calculate Average Recall (AR), we:

Expand All @@ -111,7 +111,7 @@ Note that this metric differs from COCO's calculation in two ways:
- COCO averages across classes while calculating AR, while we calculate AR separately for each class. Our AR calculations matches the original FAIR definition of AR, while our mAR calculations match what COCO calls AR.
- COCO calculates three different AR metrics (AR@1, AR@5, AR@100) by considering only the top 1/5/100 most confident predictions during the matching process. Valor, on the other hand, allows users to input a `recall_score_threshold` value that will prevent low-confidence predictions from being counted as true positives when calculating AR.

## Precision-Recall Curve
### Precision-Recall Curve
Precision-recall curves offer insight into which confidence threshold you should pick for your production pipeline. The `PrecisionRecallCurve` metric includes the true positives, false positives, true negatives, false negatives, precision, recall, and F1 score for each (label key, label value, confidence threshold) combination. When using the Valor Python client, the output will be formatted as follows:

```python
Expand Down Expand Up @@ -151,7 +151,7 @@ The `PrecisionRecallCurve` values differ from the precision-recall curves used t
- The `PrecisionRecallCurve` values visualize how precision and recall change as confidence thresholds vary from 0.05 to 0.95 in increments of 0.05. In contrast, the precision-recall curves used to calculate Average Precision are non-uniform; they vary over the actual confidence scores for each ground truth-prediction match.
- If your pipeline predicts a label on an image, but that label doesn't exist on any ground truths in that particular image, then the `PrecisionRecallCurve` values will consider that prediction to be a false positive, whereas the other detection metrics will ignore that particular prediction.

## Confusion Matrix
### Confusion Matrix

Valor also includes a more detailed version of `PrecisionRecallCurve` which can be useful for debugging your model's false positives and false negatives. When calculating `DetailedPrecisionCurve`, Valor will classify false positives as either `hallucinations` or `misclassifications` and your false negatives as either `missed_detections` or `misclassifications` using the following logic:

Expand Down
Loading

0 comments on commit 1cbddaf

Please sign in to comment.