-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make RegressionOutlier dataframe-agnostic #665
Make RegressionOutlier dataframe-agnostic #665
Conversation
@@ -112,7 +129,8 @@ def fit(self, X, y=None): | |||
ValueError | |||
If the `model` is not a regression estimator. | |||
""" | |||
self.idx_ = np.argmax([i == self.column for i in X.columns]) if isinstance(X, pd.DataFrame) else self.column | |||
X = nw.from_native(X, strict=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the question of how expensive this is came up on the livestream
on my laptop, it's about 1-2 microseconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pitch that as being significantly less than converting to pandas 😁?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah! even for just a small <1000 row dataframe, to_pandas
takes 700-1200 times longer (depending on whether you use pyarrow)
and that's without counting that converting to pandas would break polars' optimisations, use much more memory, "force" users to have pandas+pyarrow as dependencies...I think we can sell this well :)
a3a748e
to
579d1e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easy peasy! 👌
I left a question for my understanding of the (internal) behavior
@@ -112,7 +129,8 @@ def fit(self, X, y=None): | |||
ValueError | |||
If the `model` is not a regression estimator. | |||
""" | |||
self.idx_ = np.argmax([i == self.column for i in X.columns]) if isinstance(X, pd.DataFrame) else self.column | |||
X = nw.from_native(X, eager_only=True, strict=False) | |||
self.idx_ = np.argmax([i == self.column for i in X.columns]) if isinstance(X, nw.DataFrame) else self.column | |||
X = check_array(X, estimator=self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does check_array
behave on narwhals frame? Converts it to numpy array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point actually, thanks for having asked! I just digged into it, but admittedly I should have checked it more carefully in the first place
Yes, just like for pandas/polars input directly, it converts to a numpy array. BUT - they have some pandas-specific logic in there. So, we can just pass nw.to_native(X, strict=False)
, and then we can be sure there'll be no difference with respect to the status quo
X = check_array(X, estimator=self) | ||
X = nw.from_native(X, eager_only=True, strict=False) | ||
self.idx_ = np.argmax([i == self.column for i in X.columns]) if isinstance(X, nw.DataFrame) else self.column | ||
X = check_array(nw.to_native(X, strict=False), estimator=self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume now that input is force to be anything compatible with check_array
itself - I am not aware if Modin and CuDF make the cut. In case they don't, let's remove those from the docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's check:
✅ modin:
In [8]: import modin.pandas as pd
In [9]: pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]}).__array__()
UserWarning: Distributing <class 'dict'> object. This may take some time.
Out[9]:
array([[1, 4],
[2, 5],
[3, 6]])
🚫 cuDF
cuDF:
df = cudf.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
check_X_y(df, df['a'])
TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy().
I just assumed they would work, but surprised that cuDF disallows it - thanks for doing due diligence here, I've removed that one from the docstring
Grazie Marco, amazing work! |
* placeholder to develop narwhals features * feat: make `ColumnDropper` dataframe-agnostic (#655) * feat: make ColumnDropped dataframe-agnostic * use narwhals[polars] in pyproject.toml, link to list of supported libraries * note that narwhals is used for cross-dataframe support * test refactor * docstrings --------- Co-authored-by: FBruzzesi <[email protected]> * feat: make ColumnSelector dataframe-agnostic (#659) * columnselector with test rufformatted * adding whitespace * fixed the fit and transform * removed intendation in examples * font:false * feat: make `add_lags` dataframe-agnostic (#661) * make add_lags dataframe-agnostic * try getting tests to run? * patch: cvxpy 1.5.0 support (#663) --------- Co-authored-by: Francesco Bruzzesi <[email protected]> * Make `RegressionOutlier` dataframe-agnostic (#665) * make regression outlier df-agnostic * need to use eager-only for this one * pass native to check_array * remove cudf, link to check_X_y * feat: Make InformationFilter dataframe-agnostic * Make Timegapsplit dataframe-agnostic (#668) * make timegapsplit dataframe-agnostic * actually, include cuDF * feat: make FairClassifier data-agnostic (#669) * start all over * fixture working * wip * passing tests - again * pre-commit complaining * changed fixture on test_demographic_parity * feat: Make PandasTypeSelector selector dataframe-agnostic (#670) * make pandas dtype selector df-agnostic * bump version * 3.8 compat * Update sklego/preprocessing/pandastransformers.py Co-authored-by: Francesco Bruzzesi <[email protected]> * fixup pyproject.toml * unify (and test!) error message * deprecate * update readme * undo contribution.md change --------- Co-authored-by: Francesco Bruzzesi <[email protected]> * format typeselector and bump version * feat: Make grouped and hierarchical dataframe-agnostic (#667) * feat: make grouped and hierarchical dataframe-agnostic * add pyarrow * narwhals grouped_transformer * grouped transformer eureka * hierarchical narwhalified * so close but so far * return series instead of DataFrame for y * grouped WIP * merge branch and fix grouped * future annotations * format * handling negative indices * solve conflicts * hacking C * fairness: change C values in tests --------- Co-authored-by: Marco Edward Gorelli <[email protected]> Co-authored-by: Magdalena Anopsy <[email protected]> Co-authored-by: Dea María Léon <[email protected]>
Description
Fairly simple 👌
Fixes #(issue)
Type of change
Checklist: