Skip to content

Commit

Permalink
feat: Narwhals for dataframe-agnostic codebase (#671)
Browse files Browse the repository at this point in the history
* placeholder to develop narwhals features

* feat: make `ColumnDropper` dataframe-agnostic (#655)

* feat: make ColumnDropped dataframe-agnostic

* use narwhals[polars] in pyproject.toml, link to list of supported libraries

* note that narwhals is used for cross-dataframe support

* test refactor

* docstrings

---------

Co-authored-by: FBruzzesi <[email protected]>

* feat: make ColumnSelector dataframe-agnostic (#659)

* columnselector with test rufformatted

* adding whitespace

* fixed the fit and transform

* removed intendation in examples

* font:false

* feat: make `add_lags` dataframe-agnostic (#661)

* make add_lags dataframe-agnostic

* try getting tests to run?

* patch: cvxpy 1.5.0 support (#663)

---------

Co-authored-by: Francesco Bruzzesi <[email protected]>

* Make `RegressionOutlier` dataframe-agnostic (#665)

* make regression outlier df-agnostic

* need to use eager-only for this one

* pass native to check_array

* remove cudf, link to check_X_y

* feat: Make InformationFilter dataframe-agnostic

* Make Timegapsplit dataframe-agnostic (#668)

* make timegapsplit dataframe-agnostic

* actually, include cuDF

* feat: make FairClassifier data-agnostic (#669)

* start all over

* fixture working

* wip

* passing tests - again

* pre-commit complaining

* changed fixture on test_demographic_parity

* feat: Make PandasTypeSelector selector dataframe-agnostic (#670)

* make pandas dtype selector df-agnostic

* bump version

* 3.8 compat

* Update sklego/preprocessing/pandastransformers.py

Co-authored-by: Francesco Bruzzesi <[email protected]>

* fixup pyproject.toml

* unify (and test!) error message

* deprecate

* update readme

* undo contribution.md change

---------

Co-authored-by: Francesco Bruzzesi <[email protected]>

* format typeselector and bump version

* feat: Make grouped and hierarchical dataframe-agnostic (#667)

* feat: make grouped and hierarchical dataframe-agnostic

* add pyarrow

* narwhals grouped_transformer

* grouped transformer eureka

* hierarchical narwhalified

* so close but so far

* return series instead of DataFrame for y

* grouped WIP

* merge branch and fix grouped

* future annotations

* format

* handling negative indices

* solve conflicts

* hacking C

* fairness: change C values in tests

---------

Co-authored-by: Marco Edward Gorelli <[email protected]>
Co-authored-by: Magdalena Anopsy <[email protected]>
Co-authored-by: Dea María Léon <[email protected]>
  • Loading branch information
4 people authored May 24, 2024
1 parent 6a9654f commit fbb8e57
Show file tree
Hide file tree
Showing 35 changed files with 1,158 additions and 736 deletions.
1 change: 1 addition & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ on:
pull_request:
branches:
- main
- narwhals-development

jobs:
test:
Expand Down
5 changes: 5 additions & 0 deletions docs/api/preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,8 @@
options:
show_root_full_path: true
show_root_heading: true

:::sklego.preprocessing.pandastransformers.TypeSelector
options:
show_root_full_path: true
show_root_heading: true
2 changes: 1 addition & 1 deletion docs/contribution.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ When a new feature is introduced, it should be documented, and typically there a
- [x] A user guide in the `docs/user-guide/` folder.
- [x] A python script in the `docs/_scripts/` folder to generate plots and code snippets (see [next section](#working-with-pymdown-snippets-extension))
- [x] Relevant static files, such as images, plots, tables and html's, should be saved in the `docs/_static/` folder.
- [x] Edit the `mkdocs.yaml` file to include the new pages in the navigation.
- [x] Edit the `mkdocs.yaml` file to include the new pages in the navigation.

### Working with pymdown snippets extension

Expand Down
14 changes: 12 additions & 2 deletions docs/this.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,20 @@ not everything needs to be built, not everything needs to be explored.
Change everything and you'll soon be a jerk,
you may invent a new tool, not a way to work.
Some problems cannot be solved in a single day,
but if you ignore them, they sometimes go away.
but if you can ignore them, they sometimes go away.

So as we forge ahead, let's remember the creed,
simplicity over complexity, our library's seed.
In the maze of features, let's not lose sight,
of the end goal in mind shining bright.

With each new feature, a temptation to craft,
but elegance is found in what we choose to subtract.
For every line of code, let's ask ourselves twice,
does it add clarity or is it a vice?

There's a lot of power in simplicity,
it keeps you approach strong,
it keeps the approach strong,
if you understand the solution better than the problem,
you're doing it wrong.
```
4 changes: 1 addition & 3 deletions mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,7 @@ theme:
name: material
logo: _static/logo.png
favicon: _static/logo.png
font:
text: Ubuntu
code: Ubuntu Mono
font: false
highlightjs: true
hljs_languages:
- bash
Expand Down
6 changes: 4 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "scikit-lego"
version = "0.8.2"
version = "0.9.0"
description="A collection of lego bricks for scikit-learn pipelines"

license = {file = "LICENSE"}
Expand All @@ -20,6 +20,7 @@ maintainers = [
]

dependencies = [
"narwhals>=0.8.13",
"pandas>=1.1.5",
"scikit-learn>=1.0",
"importlib-metadata >= 1.0; python_version < '3.8'",
Expand Down Expand Up @@ -61,6 +62,8 @@ docs = [
]

test = [
"narwhals[polars]",
"pyarrow",
"pytest>=6.2.5",
"pytest-xdist>=1.34.0",
"pytest-cov>=2.6.1",
Expand Down Expand Up @@ -111,4 +114,3 @@ markers = [
"formulaic: tests that require formulaic (deselect with '-m \"not formulaic\"')",
"umap: tests that require umap (deselect with '-m \"not umap\"')"
]

2 changes: 1 addition & 1 deletion readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ Here's a list of features that this library currently offers:
- `sklego.preprocessing.InformationFilter` transformer that can de-correlate features
- `sklego.preprocessing.IdentityTransformer` returns the same data, allows for concatenating pipelines
- `sklego.preprocessing.OrthogonalTransformer` makes all features linearly independent
- `sklego.preprocessing.PandasTypeSelector` selects columns based on pandas type
- `sklego.preprocessing.TypeSelector` selects columns based on type
- `sklego.preprocessing.RandomAdder` adds randomness in training
- `sklego.preprocessing.RepeatingBasisFunction` repeating feature engineering, useful for timeseries
- `sklego.preprocessing.DictMapper` assign numeric values on categorical columns
Expand Down
2 changes: 1 addition & 1 deletion sklego/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def transform_train(self, X, y=None):
"""

_HASHERS = {
pd.DataFrame: lambda X: hashlib.sha256(pd.util.hash_pandas_object(X, index=True).values).hexdigest(),
pd.DataFrame: lambda X: hashlib.sha256(pd.util.hash_pandas_object(X, index=True).to_numpy()).hexdigest(),
np.ndarray: lambda X: hash(X.data.tobytes()),
np.memmap: lambda X: hash(X.data.tobytes()),
}
Expand Down
22 changes: 11 additions & 11 deletions sklego/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,8 +112,8 @@ def load_penguins(return_X_y=False, as_frame=False):
"body_mass_g",
"sex",
]
].values,
df["species"].values,
].to_numpy(),
df["species"].to_numpy(),
)
if return_X_y:
return X, y
Expand Down Expand Up @@ -162,8 +162,8 @@ def load_arrests(return_X_y=False, as_frame=False):
if as_frame:
return df
X, y = (
df[["colour", "year", "age", "sex", "employed", "citizen", "checks"]].values,
df["released"].values,
df[["colour", "year", "age", "sex", "employed", "citizen", "checks"]].to_numpy(),
df["released"].to_numpy(),
)
if return_X_y:
return X, y
Expand Down Expand Up @@ -208,7 +208,7 @@ def load_chicken(return_X_y=False, as_frame=False):
df = pd.read_csv(filepath)
if as_frame:
return df
X, y = df[["time", "diet", "chick"]].values, df["weight"].values
X, y = df[["time", "diet", "chick"]].to_numpy(), df["weight"].to_numpy()
if return_X_y:
return X, y
return {"data": X, "target": y}
Expand Down Expand Up @@ -265,8 +265,8 @@ def load_abalone(return_X_y=False, as_frame=False):
"shell_weight",
"rings",
]
].values
y = df["sex"].values
].to_numpy()
y = df["sex"].to_numpy()
if return_X_y:
return X, y
return {"data": X, "target": y}
Expand Down Expand Up @@ -304,8 +304,8 @@ def load_heroes(return_X_y=False, as_frame=False):
df = pd.read_csv(filepath)
if as_frame:
return df
X = df[["health", "attack"]].values
y = df["attack_type"].values
X = df[["health", "attack"]].to_numpy()
y = df["attack_type"].to_numpy()
if return_X_y:
return X, y
return {"data": X, "target": y}
Expand Down Expand Up @@ -377,8 +377,8 @@ def load_hearts(return_X_y=False, as_frame=False):
"ca",
"thal",
]
].values
y = df["target"].values
].to_numpy()
y = df["target"].to_numpy()
if return_X_y:
return X, y
return {"data": X, "target": y}
Expand Down
6 changes: 3 additions & 3 deletions sklego/linear_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
from inspect import signature
from warnings import warn

import narwhals as nw
import numpy as np
import pandas as pd
from scipy.optimize import minimize
from scipy.special._ufuncs import expit
from sklearn.base import BaseEstimator, RegressorMixin
Expand Down Expand Up @@ -493,8 +493,8 @@ def fit(self, X, y):
raise ValueError(f"penalty should be either 'l1' or 'none', got {self.penalty}")

self.sensitive_col_idx_ = self.sensitive_cols

if isinstance(X, pd.DataFrame):
X = nw.from_native(X, eager_only=True, strict=False)
if isinstance(X, nw.DataFrame):
self.sensitive_col_idx_ = [i for i, name in enumerate(X.columns) if name in self.sensitive_cols]
X, y = check_X_y(X, y, accept_large_sparse=False)
sensitive = X[:, self.sensitive_col_idx_]
Expand Down
78 changes: 41 additions & 37 deletions sklego/meta/_grouped_utils.py
Original file line number Diff line number Diff line change
@@ -1,55 +1,59 @@
from typing import Tuple
from __future__ import annotations

import numpy as np
from typing import List

import narwhals as nw
import pandas as pd
from scipy.sparse import issparse
from sklearn.utils import check_array
from sklearn.utils.validation import _ensure_no_complex_data


def _split_groups_and_values(
X, groups, name="", min_value_cols=1, check_X=True, **kwargs
) -> Tuple[pd.DataFrame, np.ndarray]:
_data_format_checks(X, name=name)
check_array(X, ensure_min_features=min_value_cols, dtype=None, force_all_finite=False)
def parse_X_y(X, y, groups, check_X=True, **kwargs) -> nw.DataFrame:
"""Converts X, y to narwhals dataframe.
try:
if isinstance(X, pd.DataFrame):
X_group = X.loc[:, groups]
X_value = X.drop(columns=groups).values
else:
X = np.asarray(X) # deals with `_NotAnArray` case
X_group = pd.DataFrame(X[:, groups])
pos_indexes = range(X.shape[1])
X_value = np.delete(X, [pos_indexes[g] for g in groups], axis=1)
except (KeyError, IndexError):
raise ValueError(f"Could not drop groups {groups} from columns of X")
If it is not a supported dataframe, it uses pandas constructor as a fallback.
X_group = _check_grouping_columns(X_group, **kwargs)
Additionally, data checks are performed.
"""
# Check raw X
_data_format_checks(X)

if check_X:
X_value = check_array(X_value, **kwargs)
# Convert X to Narwhals frame
X = nw.from_native(X, strict=False, eager_only=True)
if not isinstance(X, nw.DataFrame):
X = nw.from_native(pd.DataFrame(X))

return X_group, X_value
# Check groups and feaures values
if groups is not None:
_validate_groups_values(X, groups)

if check_X:
check_array(X.drop(groups), **kwargs)

def _data_format_checks(X, name):
_ensure_no_complex_data(X)
# Convert y and assign it to the frame
n_samples = X.shape[0]
native_space = nw.get_native_namespace(X)

y_native = native_space.Series([None] * n_samples) if y is None else native_space.Series(y)
return X.with_columns(__sklego_target__=nw.from_native(y_native, allow_series=True))

if issparse(X): # sklearn.validation._ensure_sparse_format to complicated
raise ValueError(f"The estimator {name} does not work on sparse matrices")

def _validate_groups_values(X: nw.DataFrame, groups: List[int] | List[str]) -> None:
X_cols = X.columns
unexisting_cols = [g for g in groups if g not in X_cols]

def _check_grouping_columns(X_group, **kwargs) -> pd.DataFrame:
"""Do basic checks on grouping columns"""
# Do regular checks on numeric columns
X_group_num = X_group.select_dtypes(include="number")
if X_group_num.shape[1]:
check_array(X_group_num, **kwargs)
if len(unexisting_cols):
raise ValueError(f"The following groups are not available in X: {unexisting_cols}")

# Only check missingness in object columns
if X_group.select_dtypes(exclude="number").isnull().any(axis=None):
raise ValueError("X has NaN values")
if X.select(nw.col(groups).is_null().any()).to_numpy().squeeze().any():
raise ValueError("Groups values have NaN")

# The grouping part we always want as a DataFrame with range index
return X_group.reset_index(drop=True)

def _data_format_checks(X):
"""Checks that X is not sparse nor has complex dtype"""
_ensure_no_complex_data(X)

if issparse(X): # sklearn.validation._ensure_sparse_format to complicated
msg = "Estimator does not work on sparse matrices"
raise ValueError(msg)
19 changes: 13 additions & 6 deletions sklego/meta/_shrinkage_utils.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
from functools import partial

import narwhals as nw
import numpy as np
from sklearn.utils.validation import check_is_fitted

from sklego.common import expanding_list
from sklego.common import as_list, expanding_list


def constant_shrinkage(group_sizes, alpha: float) -> np.ndarray:
Expand Down Expand Up @@ -193,20 +194,26 @@ def _fit_shrinkage_factors(self, frame, groups, most_granular_only=False):
Whether to return only the shrinkage factors for the most granular group values.
"""
check_is_fitted(self, ["estimators_", "shrinkage_function_"])
counts = frame.groupby(groups).size().rename("counts")
counts = frame.group_by(groups).agg(nw.len().alias("counts"))
all_grp_values = list(self.estimators_.keys())

if most_granular_only:
all_grp_values = [grp_value for grp_value in all_grp_values if len(grp_value) == len(groups)]
all_grp_values = [grp_value for grp_value in all_grp_values if len(as_list(grp_value)) == len(groups)]

hierarchical_counts = {
grp_value: [counts.loc[subgroup].sum() for subgroup in expanding_list(grp_value, tuple)]
grp_value: [
# As zip is "zip shortest" and filter works with comma separate conditions:
counts.filter(*[nw.col(c) == v for c, v in zip(groups, subgroup)])
.select(nw.sum("counts"))
.to_numpy()[0][0]
for subgroup in expanding_list(grp_value, tuple)
]
for grp_value in all_grp_values
}

shrinkage_factors = {
grp_value: self.shrinkage_function_(counts, **self.shrinkage_kwargs)
for grp_value, counts in hierarchical_counts.items()
grp_value: self.shrinkage_function_(counts_, **self.shrinkage_kwargs)
for grp_value, counts_ in hierarchical_counts.items()
}

# Normalize and pad
Expand Down
Loading

5 comments on commit fbb8e57

@Manohar0077
Copy link

@Manohar0077 Manohar0077 commented on fbb8e57 Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am getting this error when trying to use RepeatingBasisFunction.

from sklego.preprocessing import RepeatingBasisFunction
File "/home/ec2-user/anaconda3/envs/testenv/lib/python3.11/site-packages/sklego/preprocessing/init.py", line 24, in
from sklego.preprocessing.pandastransformers import ColumnDropper, ColumnSelector, PandasTypeSelector, TypeSelector
File "/home/ec2-user/anaconda3/envs/testenv/lib/python3.11/site-packages/sklego/preprocessing/pandastransformers.py", line 5, in
import narwhals as nw
ModuleNotFoundError: No module named 'narwhals'

@MarcoGorelli
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @Manohar0077 - how did you install scikit-lego? Narwhals is listed as a dependency, so it should bring it in for you

dependencies = [
"narwhals>=0.8.13",
"pandas>=1.1.5",
"scikit-learn>=1.0",
"importlib-metadata >= 1.0; python_version < '3.8'",
"importlib-resources; python_version < '3.9'",
]

@FBruzzesi
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Manohar0077, as Marco mentioned, if you installed scikit-lego directly then narwhals comes as a dependency.
A couple of considerations:

  • If the problem persists, please consider opening an issue instead of discussing it in here
  • A more general version of RepeatingBasisFunction made it into scikit-learn, consider taking a look at SplineTransformer.

@MarcoGorelli
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you installed scikit-lego directly then narwhals comes as a dependency

looks like this might have been forgotten for the conda-forge package? conda-forge/scikit-lego-feedstock#28

@Manohar0077
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I installed it with the command - conda install conda-forge::scikit-lego. But it shuold also install the dependencies under it right?.
Anyway I will open an issue on this.

Please sign in to comment.