Make RegressionOutlier dataframe-agnostic #665

MarcoGorelli · 2024-05-11T07:20:16Z

Description

Fairly simple 👌

Fixes #(issue)

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the style guidelines (ruff)
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (also to the readme.md)
I have added tests that prove my fix is effective or that my feature works
I have added tests to check whether the new feature adheres to the sklearn convention
New and existing unit tests pass locally with my changes

MarcoGorelli · 2024-05-11T07:21:13Z

sklego/meta/regression_outlier_detector.py

@@ -112,7 +129,8 @@ def fit(self, X, y=None):
        ValueError
            If the `model` is not a regression estimator.
        """
-        self.idx_ = np.argmax([i == self.column for i in X.columns]) if isinstance(X, pd.DataFrame) else self.column
+        X = nw.from_native(X, strict=False)


the question of how expensive this is came up on the livestream

on my laptop, it's about 1-2 microseconds

Can we pitch that as being significantly less than converting to pandas 😁?

oh yeah! even for just a small <1000 row dataframe, to_pandas takes 700-1200 times longer (depending on whether you use pyarrow)

and that's without counting that converting to pandas would break polars' optimisations, use much more memory, "force" users to have pandas+pyarrow as dependencies...I think we can sell this well :)

FBruzzesi

Easy peasy! 👌
I left a question for my understanding of the (internal) behavior

FBruzzesi · 2024-05-11T09:26:01Z

sklego/meta/regression_outlier_detector.py

@@ -112,7 +129,8 @@ def fit(self, X, y=None):
        ValueError
            If the `model` is not a regression estimator.
        """
-        self.idx_ = np.argmax([i == self.column for i in X.columns]) if isinstance(X, pd.DataFrame) else self.column
+        X = nw.from_native(X, eager_only=True, strict=False)
+        self.idx_ = np.argmax([i == self.column for i in X.columns]) if isinstance(X, nw.DataFrame) else self.column
        X = check_array(X, estimator=self)


How does check_array behave on narwhals frame? Converts it to numpy array?

That's a good point actually, thanks for having asked! I just digged into it, but admittedly I should have checked it more carefully in the first place

Yes, just like for pandas/polars input directly, it converts to a numpy array. BUT - they have some pandas-specific logic in there. So, we can just pass nw.to_native(X, strict=False), and then we can be sure there'll be no difference with respect to the status quo

FBruzzesi · 2024-05-11T13:05:28Z

sklego/meta/regression_outlier_detector.py

-        X = check_array(X, estimator=self)
+        X = nw.from_native(X, eager_only=True, strict=False)
+        self.idx_ = np.argmax([i == self.column for i in X.columns]) if isinstance(X, nw.DataFrame) else self.column
+        X = check_array(nw.to_native(X, strict=False), estimator=self)


I assume now that input is force to be anything compatible with check_array itself - I am not aware if Modin and CuDF make the cut. In case they don't, let's remove those from the docstring

let's check:

✅ modin:

In [8]: import modin.pandas as pd In [9]: pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]}).__array__() UserWarning: Distributing <class 'dict'> object. This may take some time. Out[9]: array([[1, 4], [2, 5], [3, 6]])

🚫 cuDF
cuDF:

df = cudf.DataFrame({'a': [1,2,3], 'b': [4,5,6]}) check_X_y(df, df['a'])

TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy() To explicitly construct a host matrix, consider using .to_numpy().

I just assumed they would work, but surprised that cuDF disallows it - thanks for doing due diligence here, I've removed that one from the docstring

FBruzzesi · 2024-05-11T15:54:34Z

Grazie Marco, amazing work!
Sorry for delegating the checks but I don't have a GPU setup ready 😁

* placeholder to develop narwhals features * feat: make `ColumnDropper` dataframe-agnostic (#655) * feat: make ColumnDropped dataframe-agnostic * use narwhals[polars] in pyproject.toml, link to list of supported libraries * note that narwhals is used for cross-dataframe support * test refactor * docstrings --------- Co-authored-by: FBruzzesi <[email protected]> * feat: make ColumnSelector dataframe-agnostic (#659) * columnselector with test rufformatted * adding whitespace * fixed the fit and transform * removed intendation in examples * font:false * feat: make `add_lags` dataframe-agnostic (#661) * make add_lags dataframe-agnostic * try getting tests to run? * patch: cvxpy 1.5.0 support (#663) --------- Co-authored-by: Francesco Bruzzesi <[email protected]> * Make `RegressionOutlier` dataframe-agnostic (#665) * make regression outlier df-agnostic * need to use eager-only for this one * pass native to check_array * remove cudf, link to check_X_y * feat: Make InformationFilter dataframe-agnostic * Make Timegapsplit dataframe-agnostic (#668) * make timegapsplit dataframe-agnostic * actually, include cuDF * feat: make FairClassifier data-agnostic (#669) * start all over * fixture working * wip * passing tests - again * pre-commit complaining * changed fixture on test_demographic_parity * feat: Make PandasTypeSelector selector dataframe-agnostic (#670) * make pandas dtype selector df-agnostic * bump version * 3.8 compat * Update sklego/preprocessing/pandastransformers.py Co-authored-by: Francesco Bruzzesi <[email protected]> * fixup pyproject.toml * unify (and test!) error message * deprecate * update readme * undo contribution.md change --------- Co-authored-by: Francesco Bruzzesi <[email protected]> * format typeselector and bump version * feat: Make grouped and hierarchical dataframe-agnostic (#667) * feat: make grouped and hierarchical dataframe-agnostic * add pyarrow * narwhals grouped_transformer * grouped transformer eureka * hierarchical narwhalified * so close but so far * return series instead of DataFrame for y * grouped WIP * merge branch and fix grouped * future annotations * format * handling negative indices * solve conflicts * hacking C * fairness: change C values in tests --------- Co-authored-by: Marco Edward Gorelli <[email protected]> Co-authored-by: Magdalena Anopsy <[email protected]> Co-authored-by: Dea María Léon <[email protected]>

make regression outlier df-agnostic

f589349

MarcoGorelli changed the title ~~Make RegressionOutlier~~ Make RegressionOutlier dataframe-agnostic May 11, 2024

MarcoGorelli commented May 11, 2024

View reviewed changes

MarcoGorelli marked this pull request as ready for review May 11, 2024 07:24

need to use eager-only for this one

579d1e6

MarcoGorelli force-pushed the df-agnostic-regression-outlier branch from a3a748e to 579d1e6 Compare May 11, 2024 07:33

FBruzzesi approved these changes May 11, 2024

View reviewed changes

pass native to check_array

38ca19c

FBruzzesi reviewed May 11, 2024

View reviewed changes

remove cudf, link to check_X_y

19c4293

FBruzzesi merged commit 94cf506 into koaning:narwhals-development May 11, 2024
16 checks passed

FBruzzesi mentioned this pull request May 11, 2024

[FEATURE] Narwhals migration for dataframe-agnostic codebase #658

Closed

MarcoGorelli mentioned this pull request May 12, 2024

Make Timegapsplit dataframe-agnostic #668

Merged

9 tasks

FBruzzesi mentioned this pull request May 18, 2024

feat: Narwhals for dataframe-agnostic codebase #671

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make RegressionOutlier dataframe-agnostic #665

Make RegressionOutlier dataframe-agnostic #665

MarcoGorelli commented May 11, 2024 •

edited

Loading

MarcoGorelli May 11, 2024

FBruzzesi May 11, 2024 •

edited

Loading

MarcoGorelli May 11, 2024 •

edited

Loading

FBruzzesi left a comment

FBruzzesi May 11, 2024

MarcoGorelli May 11, 2024

FBruzzesi May 11, 2024 •

edited

Loading

MarcoGorelli May 11, 2024

FBruzzesi commented May 11, 2024

Make RegressionOutlier dataframe-agnostic #665

Make RegressionOutlier dataframe-agnostic #665

Conversation

MarcoGorelli commented May 11, 2024 • edited Loading

Description

Type of change

Checklist:

MarcoGorelli May 11, 2024

Choose a reason for hiding this comment

FBruzzesi May 11, 2024 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli May 11, 2024 • edited Loading

Choose a reason for hiding this comment

FBruzzesi left a comment

Choose a reason for hiding this comment

FBruzzesi May 11, 2024

Choose a reason for hiding this comment

MarcoGorelli May 11, 2024

Choose a reason for hiding this comment

FBruzzesi May 11, 2024 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli May 11, 2024

Choose a reason for hiding this comment

FBruzzesi commented May 11, 2024

MarcoGorelli commented May 11, 2024 •

edited

Loading

FBruzzesi May 11, 2024 •

edited

Loading

MarcoGorelli May 11, 2024 •

edited

Loading

FBruzzesi May 11, 2024 •

edited

Loading