From 1cbddaf03ae49e6c834ff364d995e0591f438aaf Mon Sep 17 00:00:00 2001 From: Charles Zaloom Date: Thu, 12 Dec 2024 12:38:14 -0600 Subject: [PATCH] working on docs --- Makefile | 11 ++-- docs/classification/metrics.md | 16 ++--- docs/contributing.md | 23 +++++-- docs/index.md | 61 +++++-------------- docs/object_detection/metrics.md | 22 +++---- docs/semantic_segmentation/metrics.md | 8 +-- docs/text_generation/metrics.md | 34 +++++------ examples/README.md | 10 +++ ...detection.ipynb => object_detection.ipynb} | 0 9 files changed, 87 insertions(+), 98 deletions(-) create mode 100644 examples/README.md rename examples/{object-detection.ipynb => object_detection.ipynb} (100%) diff --git a/Makefile b/Makefile index d25bfe375..9478c0632 100644 --- a/Makefile +++ b/Makefile @@ -6,20 +6,19 @@ install: install-dev: pip install -e src/[all] - pre-commit install pre-commit: @echo "Running pre-commit..." + pre-commit install pre-commit run --all tests: @echo "Running unit tests..." - poetry run pytest ./lite/tests/text_generation -v + pytest ./lite/tests/text_generation -v -external-tests: +integration-tests: @echo "Running external integration tests..." - poetry run pytest ./lite/tests/text_generation -v - poetry run pytest ./integration_tests/external -v + pytest ./lite/tests/text_generation -v clean: @echo "Cleaning up temporary files..." @@ -31,6 +30,6 @@ help: @echo " install-dev Install valor_lite along with development tools." @echo " pre-commit Run pre-commit." @echo " tests Run unit tests." - @echo " external-tests Run external integration tests." + @echo " integration-tests Run external integration tests." @echo " clean Remove temporary files." @echo " help Show this help message." \ No newline at end of file diff --git a/docs/classification/metrics.md b/docs/classification/metrics.md index 9d3328eb8..3a3337b11 100644 --- a/docs/classification/metrics.md +++ b/docs/classification/metrics.md @@ -10,9 +10,9 @@ | Counts | A dictionary containing counts of true positives, false positives, true negatives, false negatives, for each label. | See [Counts](#counts). | | Confusion Matrix | | See [Confusion Matrix](#confusion-matrix). | -# Appendix: Metric Calculations +## Appendix: Metric Calculations -## Counts +### Counts Precision-recall curves offer insight into which confidence threshold you should pick for your production pipeline. The `PrecisionRecallCurve` metric includes the true positives, false positives, true negatives, false negatives, precision, recall, and F1 score for each (label key, label value, confidence threshold) combination. When using the Valor Python client, the output will be formatted as follows: ```python @@ -45,15 +45,15 @@ print(pr_evaluation) }] ``` -## Binary ROC AUC +### Binary ROC AUC -### Receiver Operating Characteristic (ROC) +#### Receiver Operating Characteristic (ROC) An ROC curve plots the True Positive Rate (TPR) vs. the False Positive Rate (FPR) at different confidence thresholds. In Valor, we use the confidence scores sorted in decreasing order as our thresholds. Using these thresholds, we can calculate our TPR and FPR as follows: -#### Determining the Rate of Correct Predictions +##### Determining the Rate of Correct Predictions | Element | Description | | ------- | ------------ | @@ -70,7 +70,7 @@ We now use the confidence scores, sorted in decreasing order, as our thresholds $Point(score) = (FPR(score), \ TPR(score))$ -### Area Under the ROC Curve (ROC AUC) +#### Area Under the ROC Curve (ROC AUC) After calculating the ROC curve, we find the ROC AUC metric by approximating the integral using the trapezoidal rule formula. @@ -78,11 +78,11 @@ $ROC AUC = \sum_{i=1}^{|scores|} \frac{ \lVert Point(score_{i-1}) - Point(scor See [Classification: ROC Curve and AUC](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc) for more information. -## Confusion Matrix +### Confusion Matrix Valor also includes a more detailed version of `PrecisionRecallCurve` which can be useful for debugging your model's false positives and false negatives. When calculating `DetailedPrecisionCurve`, Valor will classify false positives as either `hallucinations` or `misclassifications` and your false negatives as either `missed_detections` or `misclassifications` using the following logic: -### Classification Tasks +#### Classification Tasks - A **false positive** occurs when there is a qualified prediction (with `score >= score_threshold`) with the same `Label.key` as the ground truth on the datum, but the `Label.value` is incorrect. - **Example**: if there's a photo with one ground truth label on it (e.g., `Label(key='animal', value='dog')`), and we predicted another label value (e.g., `Label(key='animal', value='cat')`) on that datum, we'd say it's a `misclassification` since the key was correct but the value was not. - Similarly, a **false negative** occurs when there is a prediction with the same `Label.key` as the ground truth on the datum, but the `Label.value` is incorrect. diff --git a/docs/contributing.md b/docs/contributing.md index 6f0a4748a..ab23bc33f 100644 --- a/docs/contributing.md +++ b/docs/contributing.md @@ -1,8 +1,10 @@ -# Contributing to Valor +# Contibuting & Development + +## Contributing to Valor We welcome all contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas aimed at improving Valor. This doc describes the high-level process for how to contribute to this repository. If you have any questions or comments about this process, please feel free to reach out to us on [Slack](https://striveworks-public.slack.com/join/shared_invite/zt-1a0jx768y-2J1fffN~b4fXYM8GecvOhA#/shared-invite/email). -## On GitHub +### On GitHub We use [Git](https://git-scm.com/doc) on [GitHub](https://github.com) to manage this repo, which means you will need to sign up for a free GitHub account to submit issues, ideas, and pull requests. We use Git for version control to allow contributors from all over the world to work together on this project. @@ -12,7 +14,7 @@ If you are new to Git, these official resources can help bring you up to speed: - [GitHub documentation for collaborating with pull requests](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests) - [GitHub documentation for working with forks](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks) -## Contribution Workflow +### Contribution Workflow Generally, the high-level workflow for contributing to this repo includes: @@ -30,7 +32,7 @@ Generally, the high-level workflow for contributing to this repo includes: For questions or comments on this process, please reach out to us at any time on [Slack](https://striveworks-public.slack.com/join/shared_invite/zt-1a0jx768y-2J1fffN~b4fXYM8GecvOhA#/shared-invite/email). -## Development Tips and Tricks +## Development ### Setting Up Your Environment @@ -42,7 +44,7 @@ python3 -m venv .env-valor source .env-valor/bin/activate # conda -conda create --name valor python=3.11 +conda create --name valor python=3.10 conda activate valor ``` @@ -55,8 +57,17 @@ make install-dev All of our tests are run automatically via GitHub Actions on every push, so it's important to double-check that your code passes all local tests before committing your code. +For linting and code formatting use: ```shell make pre-commit +``` + +For unit and functional testing: +```shell make tests -make external-tests ``` + +For integration testing: +```shell +make external-tests +``` \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index b31a9dc37..72a38e32e 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,55 +1,24 @@ # Introduction -Valor is a centralized evaluation store that makes it easy to measure, explore, and rank model performance. Valor empowers data scientists and engineers to evaluate the performance of their machine learning pipelines and use those evaluations to make better modeling decisions in the future. To skip this textual introduction and dive right in, first go [here](installation.md) for instructions to setup the Valor service, and then checkout the [sample notebooks](https://github.com/Striveworks/valor/blob/main/examples/). +Valor is a collection of evaluation methods that make it easy to measure, explore, and rank machine learning model performance. Valor empowers data scientists and engineers to evaluate the performance of their machine learning pipelines and use those evaluations to make better modeling decisions in the future. To skip this textual introduction and dive right in, first go [here](#installation) for basic installation instructions, and then checkout the [example notebooks](https://github.com/Striveworks/valor/blob/main/examples/). Valor is maintained by Striveworks, a cutting-edge machine learning operations (MLOps) company based out of Austin, Texas. We'd love to learn more about your interest in Valor and answer any questions you may have; please don't hesitate to reach out to us on [Slack](https://striveworks-public.slack.com/join/shared_invite/zt-1a0jx768y-2J1fffN~b4fXYM8GecvOhA#/shared-invite/email) or [GitHub](https://github.com/striveworks/valor). -These docs are organized as follows: +## Installation -- **[Overview](index.md)** (this page): Provides an overview of what Valor is, why it's important, and how it works. -- **[Example Notebooks](https://github.com/Striveworks/valor/blob/main/examples/)**: Collection of descriptive Jupyter notebooks giving examples of how to evaluate model performance using Valor. -- **[Contributing and Development](contributing.md)**: Explains how you can build on and contribute to Valor. - -# Overview - -# Use Cases for a Containerized Evaluation Store - -As we've worked with dozens of data scientists and engineers on their MLOps pipelines, we have identified three important questions that an effective evaluation store could help them answer. First, they wanted to understand: **"Of the various models I tested for a given dataset, which one performs best?"**. This is a very common and important use case—and one that is often solved on a model-to-model basis in a local Jupyter notebook. This focus on bespoke implementations limits traceability and makes it difficult to create apples-to-apples comparisons between new model runs and prior model runs. - -Second, our users wanted to understand: **"How does the performance of a particular model vary across datasets?"**. We found that many practitioners use the same computer vision model (e.g., YOLOv8) for a variety of supervised learning tasks, and they needed a way to identify patterns where that particular model didn't meet expectations. - -Finally, our users wanted to understand: **"How can I use my prior evaluations to pick the best model for a future ML pipeline?"**. This last question requires the ability to filter previous evaluations on granular metadata (e.g., time of day, geospatial coordinates, etc.) in order to provide tailored recommendations regarding which model to pick in the future. - -With these three use cases in mind, we set out to build a centralized evaluation store that we later named Valor. - -# Introducing Valor - -Valor is a centralized evaluation store that makes it easy to measure, explore, and rank model performance. Our ultimate goal with Valor is to help data scientists and engineers pick the right ML model for their specific needs. To that end, we built Valor with three design principles in mind: - -- **Valor works with any dataset or model:** We believe Valor should be able to handle any supervised learning task that you want to throw at it. Just pass in your ground truth annotations and predictions, describe your learning task (i.e., object detection), and Valor will do the rest. (Note: At launch, Valor will only support classification and computer vision (i.e., image segmentation and object detection) tasks. We're confident this framework will abstract well to other supervised learning tasks and plan to support them in later releases). -- **Valor can handle any type of image, model, or dataset metadata you throw at it:** Metadata is a critical component of any evaluation store as it enables the system to offer tailored model recommendations based on a user's specific needs. To that end, we built Valor to handle any metadata under the sun. Dates, geospatial coordinates, and even JSONs filled with configuration details are all on the table. This means you can slice and dice your evaluations any way you want: just pass in the right labels for your use case and define your filter (say a geographic bounding box), and you’ll get back results for your specific needs. -- **Valor standardizes the evaluation process:** The trickiest part of comparing two different model runs is avoiding apples-to-oranges comparisons. Valor helps you audit your metrics and avoid false comparisons by versioning your uploads, storing them in a centralized location, and ensuring that you only compare runs that used the exact same filters and metrics. +### PyPi +``` +pip install valor-lite +``` -# How It Works: An Illustrative Example +### Source +``` +git clone https://github.com/Striveworks/valor.git +cd valor +make install +``` -Let’s walk through a quick example to bring Valor to life. +# Quick Links -Say that you're interested in using computer vision models to detect forest fires around the world using satellite imagery. You've just been tasked with building a new ML pipeline to detect fires in an unfamiliar region of interest. How might you leverage your evaluation metrics from prior ML pipelines to understand which model will perform best for this particular use case? - -A satellite image of forest fires. - -To answer this question, we'll start by passing in three pieces of information from each of our prior modeling runs: - -- **GroundTruths:** First, we'll pass in human-annotated bounding boxes to tell Valor exactly where forest fires can be found across all of the satellite images used in prior runs. -- **Predictions:** Next, we'll pass machine-generated predictions for each image (also in the form of bounding boxes) so that Valor can evaluate how well each model did at predicting forest fires. -- **Labels:** Finally, we'll pass metadata to Valor describing each of our various images (e.g., the time of day the photo was taken, the geospatial coordinates of the forest in the photo, etc.). We'll use this metadata later on in order to identify the right model for our new use case. - -Once we pass in these three ingredients, Valor will compare all of our `GroundTruths` and `Predictions` in order to calculate various evaluation metrics (i.e., mean average precision or mAP). These metrics, `Labels`, `GroundTruths`, and `Predictions`, will all be stored in Postgres, with PostGIS support for fast geospatial lookups and geometric comparisons at a later date. - -Finally, once all of our previous pipeline runs and evaluations are stored in Valor, we can use Valor’s API to specify our exact filter criteria and get back its model rankings. In this case, we can ask Valor to find us the best model for detecting forest fires at night in a 50 mile radius around (42.36, -71.03), sorted by mAP. Valor will then filter all of our stored evaluation metrics, rank each model with evaluations that meet our criteria, and send back all relevant evaluation metrics to help us determine which model to use for our new modeling pipeline. - -A satellite image of forest fires. - -# Next Steps - -We'd recommend reviewing our ["Getting Started" sample notebook](https://github.com/Striveworks/valor/blob/main/examples/getting_started.ipynb) to become further acquainted with Valor. For more detailed explanations of Valor's technical underpinnings, see our [technical concepts guide](technical_concepts.md). +- **[Example Notebooks](https://github.com/Striveworks/valor/blob/main/examples/)**: Collection of descriptive Jupyter notebooks giving examples of how to evaluate model performance using Valor. +- **[Contributing and Development](contributing.md)**: Explains how you can build on and contribute to Valor. diff --git a/docs/object_detection/metrics.md b/docs/object_detection/metrics.md index 88ea9b166..19e076434 100644 --- a/docs/object_detection/metrics.md +++ b/docs/object_detection/metrics.md @@ -14,15 +14,15 @@ | Precision-Recall Curves | | See [Precision-Recall Curve](#precision-recall-curve)| | Confusion Matrix | | See [Confusion Matrix](#confusion-matrix)| -# Appendix: Metric Calculations +## Appendix: Metric Calculations -## Counts +### Counts -## Average Precision (AP) +### Average Precision (AP) For object detection and instance segmentation tasks, average precision is calculated from the intersection-over-union (IOU) of geometric predictions and ground truths. -### Multiclass Precision and Recall +#### Multiclass Precision and Recall Tasks that predict geometries (such as object detection or instance segmentation) use the ratio intersection-over-union (IOU) to calculate precision and recall. IOU is the ratio of the intersecting area over the joint area spanned by the two geometries, and is defined in the following equation. @@ -41,7 +41,7 @@ Using different IOU thresholds, we can determine whether we count a pairing betw - $Recall = \dfrac{|TP|}{|TP| + |FN|} = \dfrac{\text{Number of True Predictions}}{|\text{Groundtruths}|}$ -### Matching Ground Truths with Predictions +#### Matching Ground Truths with Predictions To properly evaluate a detection, we must first find the best pairings of predictions to ground truths. We start by iterating over our predictions, ordering them by highest scores first. We pair each prediction with the ground truth that has the highest calculated IOU. Both the prediction and ground truth are now considered paired and removed from the pool of choices. @@ -60,7 +60,7 @@ def rank_ious( retval.append(calculate_iou(groundtruth, prediction)) ``` -### Precision-Recall Curve +#### Precision-Recall Curve We can now compute the precision-recall curve using our previously ranked IOU's. We do this by iterating through the ranked IOU's and creating points cumulatively using recall and precision. @@ -82,7 +82,7 @@ def create_precision_recall_curve( retval.append((recall, precision)) ``` -### Calculating Average Precision +#### Calculating Average Precision Average precision is defined as the area under the precision-recall curve. @@ -92,12 +92,12 @@ $AP = \frac{1}{101} \sum\limits_{r\in\{ 0, 0.01, \ldots , 1 \}}\rho_{interp}(r)$ $\rho_{interp} = \underset{\tilde{r}:\tilde{r} \ge r}{max \ \rho (\tilde{r})}$ -### References +#### References - [MS COCO Detection Evaluation](https://cocodataset.org/#detection-eval) - [The PASCAL Visual Object Classes (VOC) Challenge](https://link.springer.com/article/10.1007/s11263-009-0275-4) - [Mean Average Precision (mAP) Using the COCO Evaluator](https://pyimagesearch.com/2022/05/02/mean-average-precision-map-using-the-coco-evaluator/) -## Average Recall (AR) +### Average Recall (AR) To calculate Average Recall (AR), we: @@ -111,7 +111,7 @@ Note that this metric differs from COCO's calculation in two ways: - COCO averages across classes while calculating AR, while we calculate AR separately for each class. Our AR calculations matches the original FAIR definition of AR, while our mAR calculations match what COCO calls AR. - COCO calculates three different AR metrics (AR@1, AR@5, AR@100) by considering only the top 1/5/100 most confident predictions during the matching process. Valor, on the other hand, allows users to input a `recall_score_threshold` value that will prevent low-confidence predictions from being counted as true positives when calculating AR. -## Precision-Recall Curve +### Precision-Recall Curve Precision-recall curves offer insight into which confidence threshold you should pick for your production pipeline. The `PrecisionRecallCurve` metric includes the true positives, false positives, true negatives, false negatives, precision, recall, and F1 score for each (label key, label value, confidence threshold) combination. When using the Valor Python client, the output will be formatted as follows: ```python @@ -151,7 +151,7 @@ The `PrecisionRecallCurve` values differ from the precision-recall curves used t - The `PrecisionRecallCurve` values visualize how precision and recall change as confidence thresholds vary from 0.05 to 0.95 in increments of 0.05. In contrast, the precision-recall curves used to calculate Average Precision are non-uniform; they vary over the actual confidence scores for each ground truth-prediction match. - If your pipeline predicts a label on an image, but that label doesn't exist on any ground truths in that particular image, then the `PrecisionRecallCurve` values will consider that prediction to be a false positive, whereas the other detection metrics will ignore that particular prediction. -## Confusion Matrix +### Confusion Matrix Valor also includes a more detailed version of `PrecisionRecallCurve` which can be useful for debugging your model's false positives and false negatives. When calculating `DetailedPrecisionCurve`, Valor will classify false positives as either `hallucinations` or `misclassifications` and your false negatives as either `missed_detections` or `misclassifications` using the following logic: diff --git a/docs/semantic_segmentation/metrics.md b/docs/semantic_segmentation/metrics.md index 654ee57b7..6f2c1f55e 100644 --- a/docs/semantic_segmentation/metrics.md +++ b/docs/semantic_segmentation/metrics.md @@ -9,12 +9,12 @@ | Mean IOU | The average of IOU across labels, grouped by label key. | $\dfrac{1}{\text{number of labels}} \sum\limits_{label \in labels} IOU_{c}$ | | Confusion Matrix | | See [Confusion Matrix](#confusion-matrix) | -# Appendix: Metric Calculations +## Appendix: Metric Calculations -## Confusion Matrix +### Confusion Matrix Description -### Unmatched Predictions +#### Unmatched Predictions -### Unmatched Ground Truths \ No newline at end of file +#### Unmatched Ground Truths \ No newline at end of file diff --git a/docs/text_generation/metrics.md b/docs/text_generation/metrics.md index ef68dc889..5aadeac9c 100644 --- a/docs/text_generation/metrics.md +++ b/docs/text_generation/metrics.md @@ -22,15 +22,15 @@ | Summary Coherence | Rates the coherence of a textual summary relative to some source text using a score from 1 to 5, where 5 means "This summary is extremely coherent based on the information provided in the source text". | See [appendix](#summary-coherence-llm-guided) for details. | | Toxicity | The proportion of opinions in the predicted text that are toxic. | $\dfrac{\textnormal{Number of Toxic Opinions}}{\textnormal{Total Number of Opinions}}$ | -# Appendix: Metric Calculations +## Appendix: Metric Calculations -## General Text Generation Metrics +### General Text Generation Metrics The general text generation metrics apply to a broad set of text generation tasks. These metrics don't compare to ground truths and don't require context. The metrics are evaluated purely based on the predicted text. Some of the general text generation metrics are not necessarily useful in all tasks, but still can be used. For example, the bias and toxicity metrics evaluate opinions in the predicted text for bias/toxicity. If a task should have few/no opinions, then these metrics might not be useful. However bias and toxicity can still be evaluated on the predicted text, and if there are no opinions, then the bias/toxicity scores should be 0, indicating that there were no biased/toxic opinions. -### Bias (LLM-Guided) +#### Bias (LLM-Guided) Uses - Prediction - Annotation.text @@ -62,7 +62,7 @@ In DeepEval, bias is defined according to the following rubric: - Biased: Crime rates are always higher in those big cities. - Not Biased: Studies show a correlation between population density and certain types of crime. -### Toxicity (LLM-Guided) +#### Toxicity (LLM-Guided) Uses - Prediction - Annotation.text @@ -101,7 +101,7 @@ In DeepEval, whether an opinion is toxic is defined according to the following r Question and Answering (Q&A) is a subcategory of text generation tasks in which the datum is a query/question, and the prediction is an answer to that query. In this setting we can evaluate the predicted text based on properties such as relevance to the query or the correctness of the answer. These metrics will not apply to all text generation tasks. For example, not all text generation tasks have a single correct answer. -### Answer Correctness (LLM-Guided) +#### Answer Correctness (LLM-Guided) Uses - GroundTruth - Annotation.text @@ -121,7 +121,7 @@ If there are multiple ground truth answers for a datum, then the answer correctn Our implementation was adapted from [RAGAS's implementation](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_correctness.py). We follow a similar prompting strategy and computation, however we do not do a weighted sum with an answer similarity score using embeddings. RAGAS's answer correctness metric is a weighted sum of the f1 score described here with the answer similarity score. RAGAS computes answer similarity by embedding both the ground truth and prediction and taking their inner product. They use default weights of 0.75 for the f1 score and 0.25 for the answer similarity score. In Valor, we decided to implement answer correctness as just the f1 score, so that users are not required to supply an embedding model. -### Answer Relevance (LLM-Guided) +#### Answer Relevance (LLM-Guided) Uses - Datum.text @@ -135,11 +135,11 @@ $$AnswerRelevance = \frac{\textnormal{Number of Relevant Statements}}{\textnorma Our implementation closely follows [DeepEval's implementation](https://github.com/confident-ai/deepeval/tree/main/deepeval/metrics/answer_relevancy). We use the same two step prompting strategy and modified DeepEval's instructions. -## RAG Metrics +### RAG Metrics Retrieval Augmented Generation (RAG) is a subcategory of Q&A where the model retrieves contexts from a database, then uses the retrieved contexts to aid in generating an answer. RAG models can be evaluated with Q&A metrics (AnswerCorrectness and AnswerRelevance) that evaluate the quality of the generated answer to the query, but RAG models can also be evaluated with RAG specific metrics. Some RAG metrics (Faithfulness and Hallucination) evaluate the quality of the generated answer relative to the retrieved contexts. Other RAG metrics (ContextPrecision, ContextRecall and ContextRelevance) evaluate the retrieval mechanism by evaluating the quality of the retrieved contexts relative to the query and/or ground truth answers. -### Context Precision (LLM-Guided) +#### Context Precision (LLM-Guided) Uses - Datum.text @@ -170,7 +170,7 @@ If multiple ground truth answers are provided for a datum, then the verdict for Our implementation uses the same computation as both [RAGAS](https://docs.ragas.io/en/latest/concepts/metrics/context_precision.html) and [DeepEval](https://docs.confident-ai.com/docs/metrics-contextual-precision). Our instruction is loosely adapted from [DeepEval's instruction](https://github.com/confident-ai/deepeval/blob/main/deepeval/metrics/contextual_precision/template.py). -### Context Recall (LLM-Guided) +#### Context Recall (LLM-Guided) Uses - GroundTruth - Annotation.text @@ -186,7 +186,7 @@ If multiple ground truth answers are provided for a datum, then the context reca Our implementation loosely follows [RAGAS](https://docs.ragas.io/en/latest/concepts/metrics/context_recall.html). The example in Valor's instruction was adapted from the example in [RAGAS's instruction](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_recall.py). -### Context Relevance (LLM-Guided) +#### Context Relevance (LLM-Guided) Uses - Datum.text @@ -200,7 +200,7 @@ $$Context Relevance = \frac{\textnormal{Number of Relevant Contexts}}{\textnorma Our implementation follows [DeepEval's implementation](https://github.com/confident-ai/deepeval/tree/main/deepeval/metrics/context_relevancy). The LLM instruction was adapted from DeepEval's instruction. -### Faithfulness (LLM-Guided) +#### Faithfulness (LLM-Guided) Uses - Prediction - Annotation.text @@ -216,7 +216,7 @@ Our implementation loosely follows and combines the strategies of [DeepEval](htt We follow [DeepEval's prompting strategy](https://github.com/confident-ai/deepeval/blob/main/deepeval/metrics/faithfulness) as this strategy is closer to the other prompting strategies in Valor, however we heavily modify the instructions. Most notably, we reword the instructions and examples to follow RAGAS's definition of faithfulness. -### Hallucination (LLM-Guided) +#### Hallucination (LLM-Guided) Uses - Prediction - Annotation.text @@ -232,13 +232,13 @@ Note the differences between faithfulness and hallucination. First, for hallucin Our implementation follows [DeepEval's implementation](https://github.com/confident-ai/deepeval/tree/main/deepeval/metrics/hallucination). -## Summarization Metrics +### Summarization Metrics Summarization is the task of generating a shorter version of a piece of text that retains the most important information. Summarization metrics evaluate the quality of a summary by comparing it to the original text. Note that Datum.text is used differently for summarization than for Q&A and RAG tasks. For summarization, the Datum.text should be the text that was summarized and the prediction text should be the generated summary. This is different than Q&A and RAG where the Datum.text is the query and the prediction text is the generated answer. -### Summary Coherence (LLM-Guided) +#### Summary Coherence (LLM-Guided) Uses - Datum.text @@ -250,11 +250,11 @@ An LLM is prompted to evaluate the collective quality of a summary given the tex Valor's implementation of the summary coherence metric uses an instruction that was adapted from appendix A of DeepEval's paper G-EVAL: [NLG Evaluation using GPT-4 with Better Human Alignment](https://arxiv.org/pdf/2303.16634). The instruction in appendix A of DeepEval's paper is specific to news articles, but Valor generalized this instruction to apply to any text summarization task. -## Non-LLM-Guided Text Comparison Metrics +### Non-LLM-Guided Text Comparison Metrics This section contains non-LLM-guided metrics for comparing a predicted text to one or more ground truth texts. These metrics can be run without specifying any LLM api parameters. -### ROUGE +#### ROUGE Uses - GroundTruth - Annotation.text @@ -276,7 +276,7 @@ In Valor, the ROUGE output value is a dictionary containing the following elemen Behind the scenes, we use [Hugging Face's `evaluate` package](https://huggingface.co/spaces/evaluate-metric/rouge) to calculate these scores. Users can pass `rouge_types` and `rouge_use_stemmer` to EvaluationParameters in order to gain access to additional functionality from this package. -### BLEU +#### BLEU Uses - GroundTruth - Annotation.text diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 000000000..ecae23cfe --- /dev/null +++ b/examples/README.md @@ -0,0 +1,10 @@ +# Examples + +This folder contains various examples of Valor usage. + +| File | Description | +| --- | --- | +| [tabular_classification.ipynb](tabular_classification.ipynb) | Evaluate a scikit-learn classification model. | +| [object-detection.ipynb](object_detection.ipynb) | Evaluate YOLOv8 over the COCO-panoptic dataset. | +| [text_generation.ipynb](text_generation.ipynb) | Evaluate Meta's Llama-3.2-1B-Instruct. | +| [benchmarking.ipynb](benchmarking.ipynb) | Shows how to benchmark valor-lite. | \ No newline at end of file diff --git a/examples/object-detection.ipynb b/examples/object_detection.ipynb similarity index 100% rename from examples/object-detection.ipynb rename to examples/object_detection.ipynb