Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(blog): classification metrics on the backend #10501

Merged
merged 16 commits into from
Dec 5, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
210 changes: 210 additions & 0 deletions docs/posts/classification-metrics-on-the-backend/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
---
title: "Classification metrics on the backend"
author: "Tyler White"
date: "2024-11-15"
gforsyth marked this conversation as resolved.
Show resolved Hide resolved
image: thumbnail.png
categories:
- blog
- machine learning
- portability
---

A review of binary classification models, metrics used to evaluate them, and performing
the metric calculations with Ibis.
IndexSeek marked this conversation as resolved.
Show resolved Hide resolved

We're going explore common classification metrics such as accuracy, precision, recall,
and F1 score, demonstrating how to compute each one using Ibis. In this example, we'll
use DuckDB, the default Ibis backend, but we could use this same code to execute
against another backend such as Postgres or Snowflake. This capability is useful as it
offers an easy and performant way to evaluate model performance without extracting data
from the source system.

## Classification models

In machine learning, classification entails categorizing data into different groups.
Binary classification, which is what we'll be covering in this post, specifically
involves sorting data into only two distinct groups. For example, a model could
differentiate between whether or not an email is spam.

## Model evaluation

It's important to validate the performance of the model to ensure it makes correct
predictions consistently and doesn’t only perform well on the data it was trained on.
These metrics help us understand not just the errors the model makes, but also the
types of errors. For example, we might want to know if the model is more likely to
predict a positive outcome when the actual outcome is negative.

The easiest way to breakdown how this works is to look at a confusion matrix.
IndexSeek marked this conversation as resolved.
Show resolved Hide resolved

### Confusion matrix

A confusion matrix is a table used to describe the performance of a classification
model on a set of data for which the true values are known. As binary classification
only involves two categories, the confusion matrix is a simple 2x2 table where each
cell shows the count of true positives, false positives, false negatives, and true
IndexSeek marked this conversation as resolved.
Show resolved Hide resolved
negatives.

![](confusion_matrix.png)

Here's a breakdown of the terms with examples.

True Positives (TP)
: Correctly predicted positive examples.

We guessed it was a spam email, and it was. This email is going straight to the junk
folder.

False Positives (FP)
: Incorrectly predicted as positive.

We guessed it was a spam email, but it actually wasn’t. Hopefully, the recipient
doesn’t miss anything important as this email is going to the junk folder.

False Negatives (FN)
: Incorrectly predicted as negative.

We didn't guess it was a spam email, but it really was. Hopefully, the recipient
doesn’t click any links!

True Negatives (TN)
: Correctly predicted negative examples.

We guessed it was not a spam email, and it actually was not. The recipient can read
this email as intended.

### Building a confusion matrix

#### Sample dataset

Let's create a sample dataset that includes twelve rows with two columns: `actual` and
`prediction`. The `actual` column contains the true values, and the `prediction` column
contains the model's predictions.

```{python}

from ibis.interactive import *

t = ibis.memtable(
{
"id": range(1, 13),
"actual": [1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1],
"prediction": [1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1],
}
)

t
```

We can use the `case` function to create a new column that categorizes the outcomes.

```{python}

case_expr = (
ibis.case()
.when((_.actual == 0) & (_.prediction == 0), "TN")
.when((_.actual == 0) & (_.prediction == 1), "FP")
.when((_.actual == 1) & (_.prediction == 0), "FN")
.when((_.actual == 1) & (_.prediction == 1), "TP")
.end()
)

t = t.mutate(outcome=case_expr)

t
```

To create the confusion matrix, we'll group by the outcome, count the occurrences, and
use `pivot_wider`. Widening our data makes it possible to perform column-wise
operations on the table expression for metric calculations.

```{python}

cm = (
t.group_by("outcome")
.agg(counted=_.count())
.pivot_wider(names_from="outcome", values_from="counted")
.select("TP", "FP", "FN", "TN")
)

cm
IndexSeek marked this conversation as resolved.
Show resolved Hide resolved
```

Now that we've built a confusion matrix, we're able to more easily calculate a few
common classification metrics.

### Metrics

Here are the metrics we'll calculate as well as a brief description of each.

Accuracy
: The proportion of correct predictions out of all predictions made. This measures the
overall effectiveness of the model across all classes.

Precision
: The proportion of true positive predictions out of all positive predictions made.
This tells us how many of the predicted positives were actually correct.

Recall
: The proportion of true positive predictions out of all actual positive examples. This
measures how well the model identifies all actual positives.

F1 Score
: A metric that combines precision and recall into a single score by taking their
weighted average. This balances the trade-off between precision and recall, making it
especially useful for imbalanced datasets.

We can calculate these metrics using the columns from the confusion matrix we created
earlier.

```{python}

accuracy_expr = (_.TP + _.TN) / (_.TP + _.TN + _.FP + _.FN)
precision_expr = _.TP / (_.TP + _.FP)
recall_expr = _.TP / (_.TP + _.FN)
f1_score_expr = 2 * (precision_expr * recall_expr) / (precision_expr + recall_expr)

metrics = cm.select(
accuracy=accuracy_expr,
precision=precision_expr,
recall=recall_expr,
f1_score=f1_score_expr,
)

metrics
```

## A more efficient approach
IndexSeek marked this conversation as resolved.
Show resolved Hide resolved

In the illustrative example above, we used a case expression and pivoted the data the
IndexSeek marked this conversation as resolved.
Show resolved Hide resolved
demonstrate where the values would fall in the confusion matrix and then performed our
IndexSeek marked this conversation as resolved.
Show resolved Hide resolved
metric calculations using the pivoted data. We can actually skip this step using column
aggregation.

```{python}
tp = (t.actual * t.prediction).sum()
fp = t.prediction.sum() - tp
fn = t.actual.sum() - tp
tn = t.actual.count() - tp - fp - fn

accuracy_expr = (tp + tn) / (tp + tn + fp + fn)
IndexSeek marked this conversation as resolved.
Show resolved Hide resolved
precision_expr = tp / (tp + fp)
IndexSeek marked this conversation as resolved.
Show resolved Hide resolved
recall_expr = tp / (tp + fn)
IndexSeek marked this conversation as resolved.
Show resolved Hide resolved
f1_score_expr = 2 * (precision_expr * recall_expr) / (precision_expr + recall_expr)
IndexSeek marked this conversation as resolved.
Show resolved Hide resolved

t.select(
accuracy=accuracy_expr,
precision=precision_expr,
recall=recall_expr,
f1_score=f1_score_expr,
).limit(1)
Copy link
Member Author

@IndexSeek IndexSeek Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way we could render these results? I was fiddling around with:

print(f"{accuracy_expr=}, {precision_expr=}, {recall_expr=}, {f1_score_expr=}")

But it wasn't rendering nicely.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.execute() should work (or .to_pyarrow().as_py() or some of the other .to_* export methods)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up using to_pyarrow().as_py(). I suspect some readers may like to see that we can bring this to a Python object.

image

```

## Conclusion

By pushing the computation down to the backend, the performance is as powerful as the
backend we're connected to. This capability allows us to easily scale to different
backends without modifying any code.

We hope you give this a try and let us know how it goes. If you have any questions or
feedback, please reach out to us on [GitHub](https://github.com/ibis-project) or
[Zulip](https://ibis-project.zulipchat.com/).
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading