Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Narwhals for dataframe-agnostic codebase #671

Merged
merged 16 commits into from
May 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ on:
pull_request:
branches:
- main
- narwhals-development
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this go away once we have it in main?


jobs:
test:
Expand Down
5 changes: 5 additions & 0 deletions docs/api/preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,8 @@
options:
show_root_full_path: true
show_root_heading: true

:::sklego.preprocessing.pandastransformers.TypeSelector
options:
show_root_full_path: true
show_root_heading: true
2 changes: 1 addition & 1 deletion docs/contribution.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ When a new feature is introduced, it should be documented, and typically there a
- [x] A user guide in the `docs/user-guide/` folder.
- [x] A python script in the `docs/_scripts/` folder to generate plots and code snippets (see [next section](#working-with-pymdown-snippets-extension))
- [x] Relevant static files, such as images, plots, tables and html's, should be saved in the `docs/_static/` folder.
- [x] Edit the `mkdocs.yaml` file to include the new pages in the navigation.
- [x] Edit the `mkdocs.yaml` file to include the new pages in the navigation.

### Working with pymdown snippets extension

Expand Down
14 changes: 12 additions & 2 deletions docs/this.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,20 @@ not everything needs to be built, not everything needs to be explored.
Change everything and you'll soon be a jerk,
you may invent a new tool, not a way to work.
Some problems cannot be solved in a single day,
but if you ignore them, they sometimes go away.
but if you can ignore them, they sometimes go away.

So as we forge ahead, let's remember the creed,
simplicity over complexity, our library's seed.
In the maze of features, let's not lose sight,
of the end goal in mind shining bright.

With each new feature, a temptation to craft,
but elegance is found in what we choose to subtract.
For every line of code, let's ask ourselves twice,
does it add clarity or is it a vice?

There's a lot of power in simplicity,
it keeps you approach strong,
it keeps the approach strong,
if you understand the solution better than the problem,
you're doing it wrong.
```
4 changes: 1 addition & 3 deletions mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,7 @@ theme:
name: material
logo: _static/logo.png
favicon: _static/logo.png
font:
text: Ubuntu
code: Ubuntu Mono
font: false
highlightjs: true
hljs_languages:
- bash
Expand Down
6 changes: 4 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "scikit-lego"
version = "0.8.2"
version = "0.9.0"
description="A collection of lego bricks for scikit-learn pipelines"

license = {file = "LICENSE"}
Expand All @@ -20,6 +20,7 @@ maintainers = [
]

dependencies = [
"narwhals>=0.8.13",
"pandas>=1.1.5",
"scikit-learn>=1.0",
"importlib-metadata >= 1.0; python_version < '3.8'",
Expand Down Expand Up @@ -61,6 +62,8 @@ docs = [
]

test = [
"narwhals[polars]",
"pyarrow",
"pytest>=6.2.5",
"pytest-xdist>=1.34.0",
"pytest-cov>=2.6.1",
Expand Down Expand Up @@ -111,4 +114,3 @@ markers = [
"formulaic: tests that require formulaic (deselect with '-m \"not formulaic\"')",
"umap: tests that require umap (deselect with '-m \"not umap\"')"
]

2 changes: 1 addition & 1 deletion readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ Here's a list of features that this library currently offers:
- `sklego.preprocessing.InformationFilter` transformer that can de-correlate features
- `sklego.preprocessing.IdentityTransformer` returns the same data, allows for concatenating pipelines
- `sklego.preprocessing.OrthogonalTransformer` makes all features linearly independent
- `sklego.preprocessing.PandasTypeSelector` selects columns based on pandas type
- `sklego.preprocessing.TypeSelector` selects columns based on type
- `sklego.preprocessing.RandomAdder` adds randomness in training
- `sklego.preprocessing.RepeatingBasisFunction` repeating feature engineering, useful for timeseries
- `sklego.preprocessing.DictMapper` assign numeric values on categorical columns
Expand Down
2 changes: 1 addition & 1 deletion sklego/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def transform_train(self, X, y=None):
"""

_HASHERS = {
pd.DataFrame: lambda X: hashlib.sha256(pd.util.hash_pandas_object(X, index=True).values).hexdigest(),
pd.DataFrame: lambda X: hashlib.sha256(pd.util.hash_pandas_object(X, index=True).to_numpy()).hexdigest(),
np.ndarray: lambda X: hash(X.data.tobytes()),
np.memmap: lambda X: hash(X.data.tobytes()),
}
Expand Down
22 changes: 11 additions & 11 deletions sklego/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,8 +112,8 @@ def load_penguins(return_X_y=False, as_frame=False):
"body_mass_g",
"sex",
]
].values,
df["species"].values,
].to_numpy(),
df["species"].to_numpy(),
)
if return_X_y:
return X, y
Expand Down Expand Up @@ -162,8 +162,8 @@ def load_arrests(return_X_y=False, as_frame=False):
if as_frame:
return df
X, y = (
df[["colour", "year", "age", "sex", "employed", "citizen", "checks"]].values,
df["released"].values,
df[["colour", "year", "age", "sex", "employed", "citizen", "checks"]].to_numpy(),
df["released"].to_numpy(),
)
if return_X_y:
return X, y
Expand Down Expand Up @@ -208,7 +208,7 @@ def load_chicken(return_X_y=False, as_frame=False):
df = pd.read_csv(filepath)
if as_frame:
return df
X, y = df[["time", "diet", "chick"]].values, df["weight"].values
X, y = df[["time", "diet", "chick"]].to_numpy(), df["weight"].to_numpy()
if return_X_y:
return X, y
return {"data": X, "target": y}
Expand Down Expand Up @@ -265,8 +265,8 @@ def load_abalone(return_X_y=False, as_frame=False):
"shell_weight",
"rings",
]
].values
y = df["sex"].values
].to_numpy()
y = df["sex"].to_numpy()
if return_X_y:
return X, y
return {"data": X, "target": y}
Expand Down Expand Up @@ -304,8 +304,8 @@ def load_heroes(return_X_y=False, as_frame=False):
df = pd.read_csv(filepath)
if as_frame:
return df
X = df[["health", "attack"]].values
y = df["attack_type"].values
X = df[["health", "attack"]].to_numpy()
y = df["attack_type"].to_numpy()
if return_X_y:
return X, y
return {"data": X, "target": y}
Expand Down Expand Up @@ -377,8 +377,8 @@ def load_hearts(return_X_y=False, as_frame=False):
"ca",
"thal",
]
].values
y = df["target"].values
].to_numpy()
y = df["target"].to_numpy()
if return_X_y:
return X, y
return {"data": X, "target": y}
Expand Down
6 changes: 3 additions & 3 deletions sklego/linear_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
from inspect import signature
from warnings import warn

import narwhals as nw
import numpy as np
import pandas as pd
from scipy.optimize import minimize
from scipy.special._ufuncs import expit
from sklearn.base import BaseEstimator, RegressorMixin
Expand Down Expand Up @@ -493,8 +493,8 @@ def fit(self, X, y):
raise ValueError(f"penalty should be either 'l1' or 'none', got {self.penalty}")

self.sensitive_col_idx_ = self.sensitive_cols

if isinstance(X, pd.DataFrame):
X = nw.from_native(X, eager_only=True, strict=False)
if isinstance(X, nw.DataFrame):
self.sensitive_col_idx_ = [i for i, name in enumerate(X.columns) if name in self.sensitive_cols]
X, y = check_X_y(X, y, accept_large_sparse=False)
sensitive = X[:, self.sensitive_col_idx_]
Expand Down
78 changes: 41 additions & 37 deletions sklego/meta/_grouped_utils.py
Original file line number Diff line number Diff line change
@@ -1,55 +1,59 @@
from typing import Tuple
from __future__ import annotations

import numpy as np
from typing import List

import narwhals as nw
import pandas as pd
from scipy.sparse import issparse
from sklearn.utils import check_array
from sklearn.utils.validation import _ensure_no_complex_data


def _split_groups_and_values(
X, groups, name="", min_value_cols=1, check_X=True, **kwargs
) -> Tuple[pd.DataFrame, np.ndarray]:
_data_format_checks(X, name=name)
check_array(X, ensure_min_features=min_value_cols, dtype=None, force_all_finite=False)
def parse_X_y(X, y, groups, check_X=True, **kwargs) -> nw.DataFrame:
"""Converts X, y to narwhals dataframe.

try:
if isinstance(X, pd.DataFrame):
X_group = X.loc[:, groups]
X_value = X.drop(columns=groups).values
else:
X = np.asarray(X) # deals with `_NotAnArray` case
X_group = pd.DataFrame(X[:, groups])
pos_indexes = range(X.shape[1])
X_value = np.delete(X, [pos_indexes[g] for g in groups], axis=1)
except (KeyError, IndexError):
raise ValueError(f"Could not drop groups {groups} from columns of X")
If it is not a supported dataframe, it uses pandas constructor as a fallback.

X_group = _check_grouping_columns(X_group, **kwargs)
Additionally, data checks are performed.
"""
# Check raw X
_data_format_checks(X)

if check_X:
X_value = check_array(X_value, **kwargs)
# Convert X to Narwhals frame
X = nw.from_native(X, strict=False, eager_only=True)
if not isinstance(X, nw.DataFrame):
X = nw.from_native(pd.DataFrame(X))

return X_group, X_value
# Check groups and feaures values
if groups is not None:
_validate_groups_values(X, groups)

if check_X:
check_array(X.drop(groups), **kwargs)

def _data_format_checks(X, name):
_ensure_no_complex_data(X)
# Convert y and assign it to the frame
n_samples = X.shape[0]
native_space = nw.get_native_namespace(X)

y_native = native_space.Series([None] * n_samples) if y is None else native_space.Series(y)
return X.with_columns(__sklego_target__=nw.from_native(y_native, allow_series=True))

if issparse(X): # sklearn.validation._ensure_sparse_format to complicated
raise ValueError(f"The estimator {name} does not work on sparse matrices")

def _validate_groups_values(X: nw.DataFrame, groups: List[int] | List[str]) -> None:
X_cols = X.columns
unexisting_cols = [g for g in groups if g not in X_cols]

def _check_grouping_columns(X_group, **kwargs) -> pd.DataFrame:
"""Do basic checks on grouping columns"""
# Do regular checks on numeric columns
X_group_num = X_group.select_dtypes(include="number")
if X_group_num.shape[1]:
check_array(X_group_num, **kwargs)
if len(unexisting_cols):
raise ValueError(f"The following groups are not available in X: {unexisting_cols}")

# Only check missingness in object columns
if X_group.select_dtypes(exclude="number").isnull().any(axis=None):
raise ValueError("X has NaN values")
if X.select(nw.col(groups).is_null().any()).to_numpy().squeeze().any():
raise ValueError("Groups values have NaN")

# The grouping part we always want as a DataFrame with range index
return X_group.reset_index(drop=True)

def _data_format_checks(X):
"""Checks that X is not sparse nor has complex dtype"""
_ensure_no_complex_data(X)

if issparse(X): # sklearn.validation._ensure_sparse_format to complicated
msg = "Estimator does not work on sparse matrices"
raise ValueError(msg)
19 changes: 13 additions & 6 deletions sklego/meta/_shrinkage_utils.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
from functools import partial

import narwhals as nw
import numpy as np
from sklearn.utils.validation import check_is_fitted

from sklego.common import expanding_list
from sklego.common import as_list, expanding_list


def constant_shrinkage(group_sizes, alpha: float) -> np.ndarray:
Expand Down Expand Up @@ -193,20 +194,26 @@ def _fit_shrinkage_factors(self, frame, groups, most_granular_only=False):
Whether to return only the shrinkage factors for the most granular group values.
"""
check_is_fitted(self, ["estimators_", "shrinkage_function_"])
counts = frame.groupby(groups).size().rename("counts")
counts = frame.group_by(groups).agg(nw.len().alias("counts"))
all_grp_values = list(self.estimators_.keys())

if most_granular_only:
all_grp_values = [grp_value for grp_value in all_grp_values if len(grp_value) == len(groups)]
all_grp_values = [grp_value for grp_value in all_grp_values if len(as_list(grp_value)) == len(groups)]

hierarchical_counts = {
grp_value: [counts.loc[subgroup].sum() for subgroup in expanding_list(grp_value, tuple)]
grp_value: [
# As zip is "zip shortest" and filter works with comma separate conditions:
counts.filter(*[nw.col(c) == v for c, v in zip(groups, subgroup)])
.select(nw.sum("counts"))
.to_numpy()[0][0]
for subgroup in expanding_list(grp_value, tuple)
]
for grp_value in all_grp_values
}

shrinkage_factors = {
grp_value: self.shrinkage_function_(counts, **self.shrinkage_kwargs)
for grp_value, counts in hierarchical_counts.items()
grp_value: self.shrinkage_function_(counts_, **self.shrinkage_kwargs)
for grp_value, counts_ in hierarchical_counts.items()
}

# Normalize and pad
Expand Down
Loading
Loading