feat: Narwhals for dataframe-agnostic codebase (#671)

* placeholder to develop narwhals features * feat: make `ColumnDropper` dataframe-agnostic (#655) * feat: make ColumnDropped dataframe-agnostic * use narwhals[polars] in pyproject.toml, link to list of supported libraries * note that narwhals is used for cross-dataframe support * test refactor * docstrings --------- Co-authored-by: FBruzzesi <[email protected]> * feat: make ColumnSelector dataframe-agnostic (#659) * columnselector with test rufformatted * adding whitespace * fixed the fit and transform * removed intendation in examples * font:false * feat: make `add_lags` dataframe-agnostic (#661) * make add_lags dataframe-agnostic * try getting tests to run? * patch: cvxpy 1.5.0 support (#663) --------- Co-authored-by: Francesco Bruzzesi <[email protected]> * Make `RegressionOutlier` dataframe-agnostic (#665) * make regression outlier df-agnostic * need to use eager-only for this one * pass native to check_array * remove cudf, link to check_X_y * feat: Make InformationFilter dataframe-agnostic * Make Timegapsplit dataframe-agnostic (#668) * make timegapsplit dataframe-agnostic * actually, include cuDF * feat: make FairClassifier data-agnostic (#669) * start all over * fixture working * wip * passing tests - again * pre-commit complaining * changed fixture on test_demographic_parity * feat: Make PandasTypeSelector selector dataframe-agnostic (#670) * make pandas dtype selector df-agnostic * bump version * 3.8 compat * Update sklego/preprocessing/pandastransformers.py Co-authored-by: Francesco Bruzzesi <[email protected]> * fixup pyproject.toml * unify (and test!) error message * deprecate * update readme * undo contribution.md change --------- Co-authored-by: Francesco Bruzzesi <[email protected]> * format typeselector and bump version * feat: Make grouped and hierarchical dataframe-agnostic (#667) * feat: make grouped and hierarchical dataframe-agnostic * add pyarrow * narwhals grouped_transformer * grouped transformer eureka * hierarchical narwhalified * so close but so far * return series instead of DataFrame for y * grouped WIP * merge branch and fix grouped * future annotations * format * handling negative indices * solve conflicts * hacking C * fairness: change C values in tests --------- Co-authored-by: Marco Edward Gorelli <[email protected]> Co-authored-by: Magdalena Anopsy <[email protected]> Co-authored-by: Dea María Léon <[email protected]>
koaning · May 24, 2024 · fbb8e57 · fbb8e57 · Manohar0077 · Jun 6, 2024
1 parent 6a9654f
commit fbb8e57
Show file tree

Hide file tree

Showing 35 changed files with 1,158 additions and 736 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -4,6 +4,7 @@ on:
   pull_request:
     branches:
     - main
+    - narwhals-development
 
 jobs:
   test:

diff --git a/docs/api/preprocessing.md b/docs/api/preprocessing.md
@@ -64,3 +64,8 @@
     options:
         show_root_full_path: true
         show_root_heading: true
+
+:::sklego.preprocessing.pandastransformers.TypeSelector
+    options:
+        show_root_full_path: true
+        show_root_heading: true
diff --git a/docs/contribution.md b/docs/contribution.md
@@ -174,7 +174,7 @@ When a new feature is introduced, it should be documented, and typically there a
 - [x] A user guide in the `docs/user-guide/` folder.
 - [x] A python script in the `docs/_scripts/` folder to generate plots and code snippets (see [next section](#working-with-pymdown-snippets-extension))
 - [x] Relevant static files, such as images, plots, tables and html's, should be saved in the `docs/_static/` folder.
-- [x] Edit the `mkdocs.yaml` file to include the new pages in the navigation. 
+- [x] Edit the `mkdocs.yaml` file to include the new pages in the navigation.
 
 ### Working with pymdown snippets extension
 

diff --git a/docs/this.md b/docs/this.md
@@ -37,10 +37,20 @@ not everything needs to be built, not everything needs to be explored.
 Change everything and you'll soon be a jerk,
 you may invent a new tool, not a way to work.
 Some problems cannot be solved in a single day,
-but if you ignore them, they sometimes go away.
+but if you can ignore them, they sometimes go away.
+
+So as we forge ahead, let's remember the creed,
+simplicity over complexity, our library's seed.
+In the maze of features, let's not lose sight,
+of the end goal in mind shining bright.
+
+With each new feature, a temptation to craft,
+but elegance is found in what we choose to subtract.
+For every line of code, let's ask ourselves twice,
+does it add clarity or is it a vice?
 
 There's a lot of power in simplicity,
-it keeps you approach strong,
+it keeps the approach strong,
 if you understand the solution better than the problem,
 you're doing it wrong.
 ```
diff --git a/mkdocs.yaml b/mkdocs.yaml
@@ -21,9 +21,7 @@ theme:
   name: material
   logo: _static/logo.png
   favicon: _static/logo.png
-  font:
-    text: Ubuntu
-    code: Ubuntu Mono
+  font: false
   highlightjs: true
   hljs_languages:
     - bash

diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "scikit-lego"
-version = "0.8.2"
+version = "0.9.0"
 description="A collection of lego bricks for scikit-learn pipelines"
 
 license = {file = "LICENSE"}
@@ -20,6 +20,7 @@ maintainers = [
 ]
 
 dependencies = [
+    "narwhals>=0.8.13",
     "pandas>=1.1.5",
     "scikit-learn>=1.0",
     "importlib-metadata >= 1.0; python_version < '3.8'",
@@ -61,6 +62,8 @@ docs = [
 ]
 
 test = [
+    "narwhals[polars]",
+    "pyarrow",
     "pytest>=6.2.5",
     "pytest-xdist>=1.34.0",
     "pytest-cov>=2.6.1",
@@ -111,4 +114,3 @@ markers = [
     "formulaic: tests that require formulaic (deselect with '-m \"not formulaic\"')",
     "umap: tests that require umap (deselect with '-m \"not umap\"')"
 ]
-
diff --git a/readme.md b/readme.md
@@ -120,7 +120,7 @@ Here's a list of features that this library currently offers:
 - `sklego.preprocessing.InformationFilter` transformer that can de-correlate features
 - `sklego.preprocessing.IdentityTransformer` returns the same data, allows for concatenating pipelines
 - `sklego.preprocessing.OrthogonalTransformer` makes all features linearly independent
-- `sklego.preprocessing.PandasTypeSelector` selects columns based on pandas type
+- `sklego.preprocessing.TypeSelector` selects columns based on type
 - `sklego.preprocessing.RandomAdder` adds randomness in training
 - `sklego.preprocessing.RepeatingBasisFunction` repeating feature engineering, useful for timeseries
 - `sklego.preprocessing.DictMapper` assign numeric values on categorical columns

diff --git a/sklego/common.py b/sklego/common.py
@@ -58,7 +58,7 @@ def transform_train(self, X, y=None):
     """
 
     _HASHERS = {
-        pd.DataFrame: lambda X: hashlib.sha256(pd.util.hash_pandas_object(X, index=True).values).hexdigest(),
+        pd.DataFrame: lambda X: hashlib.sha256(pd.util.hash_pandas_object(X, index=True).to_numpy()).hexdigest(),
         np.ndarray: lambda X: hash(X.data.tobytes()),
         np.memmap: lambda X: hash(X.data.tobytes()),
     }

diff --git a/sklego/datasets.py b/sklego/datasets.py
@@ -112,8 +112,8 @@ def load_penguins(return_X_y=False, as_frame=False):
                 "body_mass_g",
                 "sex",
             ]
-        ].values,
-        df["species"].values,
+        ].to_numpy(),
+        df["species"].to_numpy(),
     )
     if return_X_y:
         return X, y
@@ -162,8 +162,8 @@ def load_arrests(return_X_y=False, as_frame=False):
     if as_frame:
         return df
     X, y = (
-        df[["colour", "year", "age", "sex", "employed", "citizen", "checks"]].values,
-        df["released"].values,
+        df[["colour", "year", "age", "sex", "employed", "citizen", "checks"]].to_numpy(),
+        df["released"].to_numpy(),
     )
     if return_X_y:
         return X, y
@@ -208,7 +208,7 @@ def load_chicken(return_X_y=False, as_frame=False):
     df = pd.read_csv(filepath)
     if as_frame:
         return df
-    X, y = df[["time", "diet", "chick"]].values, df["weight"].values
+    X, y = df[["time", "diet", "chick"]].to_numpy(), df["weight"].to_numpy()
     if return_X_y:
         return X, y
     return {"data": X, "target": y}
@@ -265,8 +265,8 @@ def load_abalone(return_X_y=False, as_frame=False):
             "shell_weight",
             "rings",
         ]
-    ].values
-    y = df["sex"].values
+    ].to_numpy()
+    y = df["sex"].to_numpy()
     if return_X_y:
         return X, y
     return {"data": X, "target": y}
@@ -304,8 +304,8 @@ def load_heroes(return_X_y=False, as_frame=False):
     df = pd.read_csv(filepath)
     if as_frame:
         return df
-    X = df[["health", "attack"]].values
-    y = df["attack_type"].values
+    X = df[["health", "attack"]].to_numpy()
+    y = df["attack_type"].to_numpy()
     if return_X_y:
         return X, y
     return {"data": X, "target": y}
@@ -377,8 +377,8 @@ def load_hearts(return_X_y=False, as_frame=False):
             "ca",
             "thal",
         ]
-    ].values
-    y = df["target"].values
+    ].to_numpy()
+    y = df["target"].to_numpy()
     if return_X_y:
         return X, y
     return {"data": X, "target": y}

diff --git a/sklego/linear_model.py b/sklego/linear_model.py
@@ -9,8 +9,8 @@
 from inspect import signature
 from warnings import warn
 
+import narwhals as nw
 import numpy as np
-import pandas as pd
 from scipy.optimize import minimize
 from scipy.special._ufuncs import expit
 from sklearn.base import BaseEstimator, RegressorMixin
@@ -493,8 +493,8 @@ def fit(self, X, y):
             raise ValueError(f"penalty should be either 'l1' or 'none', got {self.penalty}")
 
         self.sensitive_col_idx_ = self.sensitive_cols
-
-        if isinstance(X, pd.DataFrame):
+        X = nw.from_native(X, eager_only=True, strict=False)
+        if isinstance(X, nw.DataFrame):
             self.sensitive_col_idx_ = [i for i, name in enumerate(X.columns) if name in self.sensitive_cols]
         X, y = check_X_y(X, y, accept_large_sparse=False)
         sensitive = X[:, self.sensitive_col_idx_]

diff --git a/sklego/meta/_grouped_utils.py b/sklego/meta/_grouped_utils.py
@@ -1,55 +1,59 @@
-from typing import Tuple
+from __future__ import annotations
 
-import numpy as np
+from typing import List
+
+import narwhals as nw
 import pandas as pd
 from scipy.sparse import issparse
 from sklearn.utils import check_array
 from sklearn.utils.validation import _ensure_no_complex_data
 
 
-def _split_groups_and_values(
-    X, groups, name="", min_value_cols=1, check_X=True, **kwargs
-) -> Tuple[pd.DataFrame, np.ndarray]:
-    _data_format_checks(X, name=name)
-    check_array(X, ensure_min_features=min_value_cols, dtype=None, force_all_finite=False)
+def parse_X_y(X, y, groups, check_X=True, **kwargs) -> nw.DataFrame:
+    """Converts X, y to narwhals dataframe.
 
-    try:
-        if isinstance(X, pd.DataFrame):
-            X_group = X.loc[:, groups]
-            X_value = X.drop(columns=groups).values
-        else:
-            X = np.asarray(X)  # deals with `_NotAnArray` case
-            X_group = pd.DataFrame(X[:, groups])
-            pos_indexes = range(X.shape[1])
-            X_value = np.delete(X, [pos_indexes[g] for g in groups], axis=1)
-    except (KeyError, IndexError):
-        raise ValueError(f"Could not drop groups {groups} from columns of X")
+    If it is not a supported dataframe, it uses pandas constructor as a fallback.
 
-    X_group = _check_grouping_columns(X_group, **kwargs)
+    Additionally, data checks are performed.
+    """
+    # Check raw X
+    _data_format_checks(X)
 
-    if check_X:
-        X_value = check_array(X_value, **kwargs)
+    # Convert X to Narwhals frame
+    X = nw.from_native(X, strict=False, eager_only=True)
+    if not isinstance(X, nw.DataFrame):
+        X = nw.from_native(pd.DataFrame(X))
 
-    return X_group, X_value
+    # Check groups and feaures values
+    if groups is not None:
+        _validate_groups_values(X, groups)
 
+        if check_X:
+            check_array(X.drop(groups), **kwargs)
 
-def _data_format_checks(X, name):
-    _ensure_no_complex_data(X)
+    # Convert y and assign it to the frame
+    n_samples = X.shape[0]
+    native_space = nw.get_native_namespace(X)
+
+    y_native = native_space.Series([None] * n_samples) if y is None else native_space.Series(y)
+    return X.with_columns(__sklego_target__=nw.from_native(y_native, allow_series=True))
 
-    if issparse(X):  # sklearn.validation._ensure_sparse_format to complicated
-        raise ValueError(f"The estimator {name} does not work on sparse matrices")
 
+def _validate_groups_values(X: nw.DataFrame, groups: List[int] | List[str]) -> None:
+    X_cols = X.columns
+    unexisting_cols = [g for g in groups if g not in X_cols]
 
-def _check_grouping_columns(X_group, **kwargs) -> pd.DataFrame:
-    """Do basic checks on grouping columns"""
-    # Do regular checks on numeric columns
-    X_group_num = X_group.select_dtypes(include="number")
-    if X_group_num.shape[1]:
-        check_array(X_group_num, **kwargs)
+    if len(unexisting_cols):
+        raise ValueError(f"The following groups are not available in X: {unexisting_cols}")
 
-    # Only check missingness in object columns
-    if X_group.select_dtypes(exclude="number").isnull().any(axis=None):
-        raise ValueError("X has NaN values")
+    if X.select(nw.col(groups).is_null().any()).to_numpy().squeeze().any():
+        raise ValueError("Groups values have NaN")
 
-    # The grouping part we always want as a DataFrame with range index
-    return X_group.reset_index(drop=True)
+
+def _data_format_checks(X):
+    """Checks that X is not sparse nor has complex dtype"""
+    _ensure_no_complex_data(X)
+
+    if issparse(X):  # sklearn.validation._ensure_sparse_format to complicated
+        msg = "Estimator does not work on sparse matrices"
+        raise ValueError(msg)
diff --git a/sklego/meta/_shrinkage_utils.py b/sklego/meta/_shrinkage_utils.py
@@ -1,9 +1,10 @@
 from functools import partial
 
+import narwhals as nw
 import numpy as np
 from sklearn.utils.validation import check_is_fitted
 
-from sklego.common import expanding_list
+from sklego.common import as_list, expanding_list
 
 
 def constant_shrinkage(group_sizes, alpha: float) -> np.ndarray:
@@ -193,20 +194,26 @@ def _fit_shrinkage_factors(self, frame, groups, most_granular_only=False):
             Whether to return only the shrinkage factors for the most granular group values.
         """
         check_is_fitted(self, ["estimators_", "shrinkage_function_"])
-        counts = frame.groupby(groups).size().rename("counts")
+        counts = frame.group_by(groups).agg(nw.len().alias("counts"))
         all_grp_values = list(self.estimators_.keys())
 
         if most_granular_only:
-            all_grp_values = [grp_value for grp_value in all_grp_values if len(grp_value) == len(groups)]
+            all_grp_values = [grp_value for grp_value in all_grp_values if len(as_list(grp_value)) == len(groups)]
 
         hierarchical_counts = {
-            grp_value: [counts.loc[subgroup].sum() for subgroup in expanding_list(grp_value, tuple)]
+            grp_value: [
+                # As zip is "zip shortest" and filter works with comma separate conditions:
+                counts.filter(*[nw.col(c) == v for c, v in zip(groups, subgroup)])
+                .select(nw.sum("counts"))
+                .to_numpy()[0][0]
+                for subgroup in expanding_list(grp_value, tuple)
+            ]
             for grp_value in all_grp_values
         }
 
         shrinkage_factors = {
-            grp_value: self.shrinkage_function_(counts, **self.shrinkage_kwargs)
-            for grp_value, counts in hierarchical_counts.items()
+            grp_value: self.shrinkage_function_(counts_, **self.shrinkage_kwargs)
+            for grp_value, counts_ in hierarchical_counts.items()
         }
 
         # Normalize and pad
-Original file line number
+Diff line change
@@ Expand Up / @@ -4,6 +4,7 @@ on: @@
       pull_request:
         branches:
         - main
+        - narwhals-development
     jobs:
       test:
@@ Expand Down @@
	dependencies = [
	"narwhals>=0.8.13",
	"pandas>=1.1.5",
	"scikit-learn>=1.0",
	"importlib-metadata >= 1.0; python_version < '3.8'",
	"importlib-resources; python_version < '3.9'",
	]