Skip to content

Commit

Permalink
Merge branch 'koaning:main' into examples
Browse files Browse the repository at this point in the history
  • Loading branch information
anopsy authored Mar 25, 2024
2 parents 5097b30 + d321198 commit a6cec13
Show file tree
Hide file tree
Showing 19 changed files with 174 additions and 61 deletions.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/feature-request-template.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
name: Feature Request Template
name: New Feature Request
about: This is a template for a Feature Request
title: "[FEATURE]"
labels: enhancement
Expand Down
11 changes: 6 additions & 5 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,25 @@
Before working on a large PR, please check with @koaning or @MBrouns that they agree with the direction of the PR. This discussion should take place in a Github issue before working on the PR, unless it's a minor change like spelling in the docs.
Before working on a large PR, please check with @FBruzzesi or @koaning to confirm that they agree with the direction of the PR. This discussion should take place in a [Github issue](https://github.com/koaning/scikit-lego/issues/new/choose) before working on the PR, unless it's a minor change like spelling in the docs.

# Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context.

Fixes # (issue)
Fixes #(issue)

## Type of change

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)


# Checklist:
## Checklist:

- [ ] My code follows the style guidelines (flake8)
- [ ] My code follows the style guidelines (ruff)
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation (also to the readme.md)
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] I have added tests to check whether the new feature adheres to the sklearn convention
- [ ] New and existing unit tests pass locally with my changes

If you feel your PR is ready for a review, ping @koaning or @mbrouns.
If you feel your PR is ready for a review, ping @FBruzzesi or @koaning.
13 changes: 7 additions & 6 deletions .github/workflows/dependencies.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,33 +15,34 @@ jobs:
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Install uv
run: curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Install dependencies
run: |
python -m pip install --upgrade pip setuptools wheel
python -m pip install pytest
uv pip install pytest setuptools wheel --system
- name: Run Base Install
run: |
python -m pip install -e .
uv pip install -e . --system
- name: Run Checks
run: |
python tests/scripts/check_pip.py missing cvxpy
python tests/scripts/check_pip.py installed scikit-learn
python tests/scripts/import_all.py
- name: Install cvxpy
run: |
python -m pip install -e ".[cvxpy]"
uv pip install -e ".[cvxpy]" --system
- name: Run Checks
run: |
python tests/scripts/check_pip.py installed cvxpy scikit-learn
python tests/scripts/import_all.py
- name: Install All
run: |
python -m pip install -e ".[all]"
uv pip install -e ".[all]" --system
- name: Run Checks
run: |
python tests/scripts/check_pip.py installed cvxpy formulaic scikit-learn umap-learn
- name: Docs can Build
run: |
sudo apt-get update && sudo apt-get install pandoc
python -m pip install -e ".[docs]"
uv pip install -e ".[docs]" --system
mkdocs build
17 changes: 12 additions & 5 deletions .github/workflows/schedule-dependencies.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
name: Cron Test Dependencies

on:
workflow_dispatch:
schedule:
- cron: "0 0 * * *"


jobs:
cron:
Expand All @@ -15,17 +17,22 @@ jobs:
steps:
- name: Checkout source code
uses: actions/checkout@v4
- name: Install uv (Unix)
if: runner.os != 'Windows'
run: curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Install uv (Windows)
if: runner.os == 'Windows'
run: powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install wheel
pip install ${{ matrix.pre-release-dependencies }} scikit-lego
pip freeze
uv pip install wheel --system
uv pip install ${{ matrix.pre-release-dependencies }} scikit-lego --system
uv pip freeze
- name: Test with pytest
run: |
pip install -e ".[test]"
uv pip install -e ".[test]" --system
make test
13 changes: 8 additions & 5 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,17 @@ jobs:
steps:
- name: Checkout source code
uses: actions/checkout@v4
- name: Install uv (Unix)
if: runner.os != 'Windows'
run: curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Install uv (Windows)
if: runner.os == 'Windows'
run: powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip --no-cache-dir
python -m pip install -e ".[test]"
run: uv pip install -e ".[test]" --system
- name: Test with pytest
run: |
make test
run: make test
25 changes: 25 additions & 0 deletions docs/_scripts/cross-validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,3 +204,28 @@ def print_folds(cv, X, y, groups):
grid.best_estimator_.get_params()["reg__alpha"]
# 0.8
# --8<-- [end:grid-search]



######################################## ClusterKfold ####################################
##########################################################################################

# --8<-- [start:cluster-fold-start]
from sklego.model_selection import ClusterFoldValidation
from sklearn.cluster import KMeans

clusterer = KMeans(n_clusters=5, random_state=42)
folder = ClusterFoldValidation(clusterer)
# --8<-- [end:cluster-fold-start]


# --8<-- [start:cluster-fold-plot]
import matplotlib.pylab as plt
import numpy as np

X_orig = np.random.uniform(0, 1, (1000, 2))
for i, split in enumerate(folder.split(X_orig)):
x_train, x_valid = split
plt.scatter(X_orig[x_valid, 0], X_orig[x_valid, 1], label=f"split {i}")
plt.legend();
# --8<-- [end:cluster-fold-plot]
Binary file added docs/_static/cross-validation/kfold.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/contribution.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ This means we're usually open to ideas to add here but there are a few things to

When writing a new feature there's some more
[details with regard to how scikit learn likes to have its parts implemented][scikit-develop].
We will display the a sample implementation of the `ColumnSelector` below. Please review all comments marked as Important.
We will display a sample implementation of the `ColumnSelector` below. Please review all comments marked as Important.

```py hl_lines="19-22 24-28 46-51 65-69 77-78 83-85" linenums="1"
from sklearn.base import BaseEstimator, TransformerMixin, MetaEstimatorMixin
Expand Down
35 changes: 35 additions & 0 deletions docs/user-guide/cross-validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,5 +127,40 @@ To use `GroupTimeSeriesSplit` with sklearn's [GridSearchCV](https://scikit-learn
--8<-- "docs/_scripts/cross-validation.py:grid-search"
```

## Cluster-Kfold

The [ClusterFoldValidation](clusterfold-api) object is a cross-validator that splits the data into `n_splits` folds, where each fold is determined by a clustering algorithm. This is not a common pattern, probably more like an anti-pattern really, but it might be useful when you want to make sure that the train and test sets are very distinct. This can be seen as a way to make it harder for the algorithm perform well, because the training sets are sampled differently than the test sets.

### Example

Here's how you could set up a cross validator that uses KMeans.

```py title="Using Kmeans to generate folds"
--8<-- "docs/_scripts/cross-validation.py:cluster-fold-start"
```

You can also use other cross validation methods, but the nice thing about Kmeans is that it demos well. Here's how it would generate folds on a uniform dataset.

```py title="Using Kmeans to generate folds"
--8<-- "docs/_scripts/cross-validation.py:cluster-fold-plot"
```

![example-1](../_static/cross-validation/kfold.png)

As you can see, each split will focus on a cluster of the data. Hopefully this also makes it clear that this method will ensure that each validation set will be rather distinct from the train set. These sets are not only exclusive, but they are also from a different region of the data by design.

Note that this image is mostly for illustrative purposes because you typically won't directly generate these folds yourself. Instead you'd use a helper function like `cross_val_score` or `GridSearchCV` to do this for you.

```py title="More realistic example"
from sklearn.model_selection import cross_val_score

# Given an existing pipeline and X,y dataset, you probably would do something like this:
fold_method = KlusterFoldValidation(
KMeans(n_cluster=5, random_state=42)
)
cross_val_score(pipeline, X, y, cv=fold_method)
```

[time-gap-split-api]: ../../api/model-selection#sklego.model_selection.TimeGapSplit
[group-ts-split-api]: ../../api/model-selection#sklego.model_selection.GroupTimeSeriesSplit
[clusterfold-api]: ../../api/model-selection#sklego.model_selection.ClusterFoldValidation
2 changes: 1 addition & 1 deletion docs/user-guide/linear-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ We've turned the array into a dataframe so that we can apply the [`ColumnSelecto

--8<-- "docs/_static/linear-models/grid.html"

You can see that the `ProbWeightRegression` indeeds sums to one.
You can see that the `ProbWeightRegression` indeed sums to one.

```py
--8<-- "docs/_scripts/linear-models.py:prob-weight-coefs"
Expand Down
4 changes: 3 additions & 1 deletion docs/user-guide/mixture-methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Gaussian Mixture Models (GMMs) are flexible building blocks for other machine le

This is in part because they are great approximations for general probability distributions but also because they remain somewhat interpretable even when the dataset gets very complex.

This package makes use of GMMs to construct other algorithms.
This package makes use of GMMs to construct other algorithms. In addition to the [GMMClassifier][gmm-classifier-api] and [GMMDetector][gmm-classifier-api], this library also features a [BayesianGMMClassifier][bayes_gmm-classifier-api] and [BayesianGMMDetector][bayes_gmm-outlier-detector-api] as well. These methods offer pretty much the same API, but will have internal methods to figure out what number of components to estimate. These methods tend to take significantly more time to train, so alternatively you may also try doing a proper grid search to figure out the best number of components for your use-case.

## Classification

Expand Down Expand Up @@ -59,4 +59,6 @@ As a sidenote: this image was generated with some dummy data, but its code can b
```

[gmm-classifier-api]: ../../api/mixture#sklego.mixture.gmm_classifier.GMMClassifier
[bayes_gmm-classifier-api]: ../../api/mixture#sklego.mixture.bayesian_gmm_classifier.BayesianGMMClassifier
[gmm-outlier-detector-api]: ../../api/mixture#sklego.mixture.gmm_outlier_detector.GMMOutlierDetector
[bayes_gmm-outlier-detector-api]: ../../api/mixture#sklego.mixture.gmm_outlier_detector.BayesianGMMOutlierDetector
9 changes: 9 additions & 0 deletions docs/user-guide/preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,14 @@ Now let's see what occurs when we add a constraint that enforces the feature to

If these features are now passed to a model that supports monotonicity constraints then we can build models with guarantees.

## Outlier Removal

The [`OutlierRemover`][outlier-remover-api] class is a transformer that removes outliers from your dataset during training time only based on some outlier detector estimator. This can be useful in scenarios where outliers in the training data can negatively impact the performance of your model. By removing these outliers during training, your model can learn from a "clean" dataset that may lead to better performance.

It's important to note that this transformer only removes outliers during training. This means that when you use your trained model to predict on new data, the new data will not have any outliers removed. This is useful because in a real-world scenario, new data may contain outliers and you would want your model to be able to handle these cases.

The `OutlierRemover` class is initialized with an `outlier_detector` estimator, and a boolean flag `refit`. The outlier detector should be a scikit-learn compatible estimator that implements `.fit()` and `.predict()` methods. The refit flag determines whether the underlying estimator is fitted during `OutlierRemover.fit()`.

[estimator-transformer-api]: ../../api/meta#sklego.meta.estimator_transformer.EstimatorTransformer
[meta-module]: ../../api/meta
[id-transformer-api]: ../../api/preprocessing#sklego.preprocessing.identitytransformer.IdentityTransformer
Expand All @@ -261,6 +269,7 @@ If these features are now passed to a model that supports monotonicity constrain
[rbf-api]: ../../api/preprocessing#sklego.preprocessing.repeatingbasis.RepeatingBasisFunction
[interval-encoder-api]: ../../api/preprocessing#sklego.preprocessing.intervalencoder.IntervalEncoder
[decay-section]: ../../user-guide/meta#decayed-estimation
[outlier-remover-api]: ../../api/preprocessing#sklego.preprocessing.outlier_remover.OutlierRemover

[formulaic-docs]: https://matthewwardrop.github.io/formulaic/
[formulaic-formulas]: https://matthewwardrop.github.io/formulaic/formulas/
10 changes: 5 additions & 5 deletions sklego/mixture/bayesian_gmm_detector.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,11 @@ class BayesianGMMOutlierDetector(OutlierMixin, BaseEstimator):
"""The `BayesianGMMOutlierDetector` trains a Bayesian Gaussian Mixture model on a dataset `X`. Once a density is
trained we can evaluate the likelihood scores to see if it is deemed likely.
By giving a threshold this model might then label outliers if their likelihood score is too low.
By providing a `threshold` this model might then label outliers if their likelihood score is too low.
!!! note
The parameters other than `threshold` and `method` are an exact copy of the parameters in
[sklearn.mixture.BayesianGaussianMixture]( https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html).
Parameters
----------
Expand All @@ -28,10 +32,6 @@ class BayesianGMMOutlierDetector(OutlierMixin, BaseEstimator):
If you select `method="stddev"` then the threshold value represents the
numbers of standard deviations before calling something an outlier.
!!! note
The other parameters are an exact copy of the parameters in
[sklearn.mixture.BayesianGaussianMixture]( https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html).
Attributes
----------
gmm_ : BayesianGaussianMixture
Expand Down
10 changes: 5 additions & 5 deletions sklego/mixture/gmm_outlier_detector.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,11 @@ class GMMOutlierDetector(OutlierMixin, BaseEstimator):
"""The `GMMDetector` trains a Gaussian Mixture model on a dataset `X`. Once a density is trained we can evaluate the
likelihood scores to see if it is deemed likely.
By giving a threshold this model might then label outliers if their likelihood score is too low.
By providing a `threshold` this model might then label outliers if their likelihood score is too low.
!!! note
The parameters other than `threshold` and `method` are an exact copy of the parameters in
[sklearn.mixture.GaussianMixture]( https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html).
Parameters
----------
Expand All @@ -28,10 +32,6 @@ class GMMOutlierDetector(OutlierMixin, BaseEstimator):
If you select `method="stddev"` then the threshold value represents the
numbers of standard deviations before calling something an outlier.
!!! note
The other parameters are an exact copy of the parameters in
[sklearn.mixture.GaussianMixture]( https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html).
Attributes
----------
gmm_ : GaussianMixture
Expand Down
16 changes: 14 additions & 2 deletions sklego/model_selection.py
Original file line number Diff line number Diff line change
Expand Up @@ -248,8 +248,20 @@ def get_split_info(X, indices, j, part, summary):
return pd.DataFrame(summary)


class KlusterFoldValidation:
"""KlusterFold cross validator. Create folds based on provided cluster method
def KlusterFoldValidation(**kwargs):
warn(
"Please use `ClusterFoldValidation` instead of `KlusterFoldValidation`."
"We will use correct spelling going forward and `KlusterFoldValidation` will be deprecated.",
DeprecationWarning,
)
return ClusterFoldValidation(**kwargs)


class ClusterFoldValidation:
"""Cross validator that creates folds based on provided cluster method.
This ensures that data points in the same cluster are not split across different folds.
!!! info "New in version 0.9.0"
Parameters
----------
Expand Down
3 changes: 2 additions & 1 deletion sklego/pandas_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,8 @@ def log_step_extra(
**log_func_kwargs: dict
Keyword arguments to be passed to `log_functions`
Returns:
Returns
-------
Callable
The decorated function.
Expand Down
Loading

0 comments on commit a6cec13

Please sign in to comment.