-
Notifications
You must be signed in to change notification settings - Fork 118
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Narwhals for dataframe-agnostic codebase (#671)
* placeholder to develop narwhals features * feat: make `ColumnDropper` dataframe-agnostic (#655) * feat: make ColumnDropped dataframe-agnostic * use narwhals[polars] in pyproject.toml, link to list of supported libraries * note that narwhals is used for cross-dataframe support * test refactor * docstrings --------- Co-authored-by: FBruzzesi <[email protected]> * feat: make ColumnSelector dataframe-agnostic (#659) * columnselector with test rufformatted * adding whitespace * fixed the fit and transform * removed intendation in examples * font:false * feat: make `add_lags` dataframe-agnostic (#661) * make add_lags dataframe-agnostic * try getting tests to run? * patch: cvxpy 1.5.0 support (#663) --------- Co-authored-by: Francesco Bruzzesi <[email protected]> * Make `RegressionOutlier` dataframe-agnostic (#665) * make regression outlier df-agnostic * need to use eager-only for this one * pass native to check_array * remove cudf, link to check_X_y * feat: Make InformationFilter dataframe-agnostic * Make Timegapsplit dataframe-agnostic (#668) * make timegapsplit dataframe-agnostic * actually, include cuDF * feat: make FairClassifier data-agnostic (#669) * start all over * fixture working * wip * passing tests - again * pre-commit complaining * changed fixture on test_demographic_parity * feat: Make PandasTypeSelector selector dataframe-agnostic (#670) * make pandas dtype selector df-agnostic * bump version * 3.8 compat * Update sklego/preprocessing/pandastransformers.py Co-authored-by: Francesco Bruzzesi <[email protected]> * fixup pyproject.toml * unify (and test!) error message * deprecate * update readme * undo contribution.md change --------- Co-authored-by: Francesco Bruzzesi <[email protected]> * format typeselector and bump version * feat: Make grouped and hierarchical dataframe-agnostic (#667) * feat: make grouped and hierarchical dataframe-agnostic * add pyarrow * narwhals grouped_transformer * grouped transformer eureka * hierarchical narwhalified * so close but so far * return series instead of DataFrame for y * grouped WIP * merge branch and fix grouped * future annotations * format * handling negative indices * solve conflicts * hacking C * fairness: change C values in tests --------- Co-authored-by: Marco Edward Gorelli <[email protected]> Co-authored-by: Magdalena Anopsy <[email protected]> Co-authored-by: Dea María Léon <[email protected]>
- Loading branch information
1 parent
6a9654f
commit fbb8e57
Showing
35 changed files
with
1,158 additions
and
736 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,7 @@ on: | |
pull_request: | ||
branches: | ||
- main | ||
- narwhals-development | ||
|
||
jobs: | ||
test: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,55 +1,59 @@ | ||
from typing import Tuple | ||
from __future__ import annotations | ||
|
||
import numpy as np | ||
from typing import List | ||
|
||
import narwhals as nw | ||
import pandas as pd | ||
from scipy.sparse import issparse | ||
from sklearn.utils import check_array | ||
from sklearn.utils.validation import _ensure_no_complex_data | ||
|
||
|
||
def _split_groups_and_values( | ||
X, groups, name="", min_value_cols=1, check_X=True, **kwargs | ||
) -> Tuple[pd.DataFrame, np.ndarray]: | ||
_data_format_checks(X, name=name) | ||
check_array(X, ensure_min_features=min_value_cols, dtype=None, force_all_finite=False) | ||
def parse_X_y(X, y, groups, check_X=True, **kwargs) -> nw.DataFrame: | ||
"""Converts X, y to narwhals dataframe. | ||
try: | ||
if isinstance(X, pd.DataFrame): | ||
X_group = X.loc[:, groups] | ||
X_value = X.drop(columns=groups).values | ||
else: | ||
X = np.asarray(X) # deals with `_NotAnArray` case | ||
X_group = pd.DataFrame(X[:, groups]) | ||
pos_indexes = range(X.shape[1]) | ||
X_value = np.delete(X, [pos_indexes[g] for g in groups], axis=1) | ||
except (KeyError, IndexError): | ||
raise ValueError(f"Could not drop groups {groups} from columns of X") | ||
If it is not a supported dataframe, it uses pandas constructor as a fallback. | ||
X_group = _check_grouping_columns(X_group, **kwargs) | ||
Additionally, data checks are performed. | ||
""" | ||
# Check raw X | ||
_data_format_checks(X) | ||
|
||
if check_X: | ||
X_value = check_array(X_value, **kwargs) | ||
# Convert X to Narwhals frame | ||
X = nw.from_native(X, strict=False, eager_only=True) | ||
if not isinstance(X, nw.DataFrame): | ||
X = nw.from_native(pd.DataFrame(X)) | ||
|
||
return X_group, X_value | ||
# Check groups and feaures values | ||
if groups is not None: | ||
_validate_groups_values(X, groups) | ||
|
||
if check_X: | ||
check_array(X.drop(groups), **kwargs) | ||
|
||
def _data_format_checks(X, name): | ||
_ensure_no_complex_data(X) | ||
# Convert y and assign it to the frame | ||
n_samples = X.shape[0] | ||
native_space = nw.get_native_namespace(X) | ||
|
||
y_native = native_space.Series([None] * n_samples) if y is None else native_space.Series(y) | ||
return X.with_columns(__sklego_target__=nw.from_native(y_native, allow_series=True)) | ||
|
||
if issparse(X): # sklearn.validation._ensure_sparse_format to complicated | ||
raise ValueError(f"The estimator {name} does not work on sparse matrices") | ||
|
||
def _validate_groups_values(X: nw.DataFrame, groups: List[int] | List[str]) -> None: | ||
X_cols = X.columns | ||
unexisting_cols = [g for g in groups if g not in X_cols] | ||
|
||
def _check_grouping_columns(X_group, **kwargs) -> pd.DataFrame: | ||
"""Do basic checks on grouping columns""" | ||
# Do regular checks on numeric columns | ||
X_group_num = X_group.select_dtypes(include="number") | ||
if X_group_num.shape[1]: | ||
check_array(X_group_num, **kwargs) | ||
if len(unexisting_cols): | ||
raise ValueError(f"The following groups are not available in X: {unexisting_cols}") | ||
|
||
# Only check missingness in object columns | ||
if X_group.select_dtypes(exclude="number").isnull().any(axis=None): | ||
raise ValueError("X has NaN values") | ||
if X.select(nw.col(groups).is_null().any()).to_numpy().squeeze().any(): | ||
raise ValueError("Groups values have NaN") | ||
|
||
# The grouping part we always want as a DataFrame with range index | ||
return X_group.reset_index(drop=True) | ||
|
||
def _data_format_checks(X): | ||
"""Checks that X is not sparse nor has complex dtype""" | ||
_ensure_no_complex_data(X) | ||
|
||
if issparse(X): # sklearn.validation._ensure_sparse_format to complicated | ||
msg = "Estimator does not work on sparse matrices" | ||
raise ValueError(msg) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
fbb8e57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am getting this error when trying to use RepeatingBasisFunction.
from sklego.preprocessing import RepeatingBasisFunction
File "/home/ec2-user/anaconda3/envs/testenv/lib/python3.11/site-packages/sklego/preprocessing/init.py", line 24, in
from sklego.preprocessing.pandastransformers import ColumnDropper, ColumnSelector, PandasTypeSelector, TypeSelector
File "/home/ec2-user/anaconda3/envs/testenv/lib/python3.11/site-packages/sklego/preprocessing/pandastransformers.py", line 5, in
import narwhals as nw
ModuleNotFoundError: No module named 'narwhals'
fbb8e57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @Manohar0077 - how did you install scikit-lego? Narwhals is listed as a dependency, so it should bring it in for you
scikit-lego/pyproject.toml
Lines 22 to 28 in 0964662
fbb8e57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Manohar0077, as Marco mentioned, if you installed scikit-lego directly then narwhals comes as a dependency.
A couple of considerations:
fbb8e57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this might have been forgotten for the conda-forge package? conda-forge/scikit-lego-feedstock#28
fbb8e57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I installed it with the command - conda install conda-forge::scikit-lego. But it shuold also install the dependencies under it right?.
Anyway I will open an issue on this.