Make Timegapsplit dataframe-agnostic #668

MarcoGorelli · 2024-05-12T13:33:24Z

Description

This was quite a fun one! As a precursor, I added nw.maybe_align_index and nw.maybe_set_index, so that the automated index alignment can be preserved for pandas users, but Polars can still be supported (so long as date_serie and X / y are the same length), as discussed here

Another step towards #658

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the style guidelines (ruff)
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (also to the readme.md)
I have added tests that prove my fix is effective or that my feature works
I have added tests to check whether the new feature adheres to the sklearn convention
New and existing unit tests pass locally with my changes

MarcoGorelli · 2024-05-12T14:13:21Z

sklego/model_selection.py

-            X_train_df = X_index_df[
-                (X_index_df["__date__"] >= start_date) & (X_index_df["__date__"] < current_date + self.train_duration)
-            ]
-            X_valid_df = X_index_df[
-                (X_index_df["__date__"] >= current_date + self.train_duration + self.gap_duration)
-                & (
-                    X_index_df["__date__"]
-                    < current_date + self.train_duration + self.valid_duration + self.gap_duration
-                )
-            ]
+            X_train_df = X_index_df.filter(
+                nw.col("__date__") >= start_date, nw.col("__date__") < current_date + self.train_duration
+            )
+            X_valid_df = X_index_df.filter(
+                nw.col("__date__") >= current_date + self.train_duration + self.gap_duration,
+                nw.col("__date__") < current_date + self.train_duration + self.valid_duration + self.gap_duration,
+            )


I find this diff so pleasing 😍

MarcoGorelli · 2024-05-12T14:33:06Z

sklego/model_selection.py

-        if not date_serie.index.is_unique:
-            raise ValueError("date_serie doesn't have a unique index")


this is checked as part of nw.maybe_align_index

FBruzzesi

Another very satisfying migration 🙌🏼
I left only one pedantic question 🙃

FBruzzesi · 2024-05-12T16:03:17Z

sklego/model_selection.py

+    - Modin
+


Is CuDF breaking for some reason? It seems we are not doing any checking on X (check_array I am looking at you)

again, thanks for asking - indeed, cuDF should be in the list, as to_numpy() is called explicitly here, and they do support that https://docs.rapids.ai/api/cudf/nightly/user_guide/api_docs/api/cudf.dataframe.to_numpy/

it's __array__ (which gets triggered when you call np.asarray on an object) which they don't support they don't support #665 (comment) . So, indeed, they should be included here

* placeholder to develop narwhals features * feat: make `ColumnDropper` dataframe-agnostic (#655) * feat: make ColumnDropped dataframe-agnostic * use narwhals[polars] in pyproject.toml, link to list of supported libraries * note that narwhals is used for cross-dataframe support * test refactor * docstrings --------- Co-authored-by: FBruzzesi <[email protected]> * feat: make ColumnSelector dataframe-agnostic (#659) * columnselector with test rufformatted * adding whitespace * fixed the fit and transform * removed intendation in examples * font:false * feat: make `add_lags` dataframe-agnostic (#661) * make add_lags dataframe-agnostic * try getting tests to run? * patch: cvxpy 1.5.0 support (#663) --------- Co-authored-by: Francesco Bruzzesi <[email protected]> * Make `RegressionOutlier` dataframe-agnostic (#665) * make regression outlier df-agnostic * need to use eager-only for this one * pass native to check_array * remove cudf, link to check_X_y * feat: Make InformationFilter dataframe-agnostic * Make Timegapsplit dataframe-agnostic (#668) * make timegapsplit dataframe-agnostic * actually, include cuDF * feat: make FairClassifier data-agnostic (#669) * start all over * fixture working * wip * passing tests - again * pre-commit complaining * changed fixture on test_demographic_parity * feat: Make PandasTypeSelector selector dataframe-agnostic (#670) * make pandas dtype selector df-agnostic * bump version * 3.8 compat * Update sklego/preprocessing/pandastransformers.py Co-authored-by: Francesco Bruzzesi <[email protected]> * fixup pyproject.toml * unify (and test!) error message * deprecate * update readme * undo contribution.md change --------- Co-authored-by: Francesco Bruzzesi <[email protected]> * format typeselector and bump version * feat: Make grouped and hierarchical dataframe-agnostic (#667) * feat: make grouped and hierarchical dataframe-agnostic * add pyarrow * narwhals grouped_transformer * grouped transformer eureka * hierarchical narwhalified * so close but so far * return series instead of DataFrame for y * grouped WIP * merge branch and fix grouped * future annotations * format * handling negative indices * solve conflicts * hacking C * fairness: change C values in tests --------- Co-authored-by: Marco Edward Gorelli <[email protected]> Co-authored-by: Magdalena Anopsy <[email protected]> Co-authored-by: Dea María Léon <[email protected]>

make timegapsplit dataframe-agnostic

25b04f9

MarcoGorelli force-pushed the timegapsplit-agnostic branch from 4eda210 to 25b04f9 Compare May 12, 2024 14:09

MarcoGorelli changed the title ~~WIP Make Timegapsplit dataframe-agnostic~~ Make Timegapsplit dataframe-agnostic May 12, 2024

MarcoGorelli commented May 12, 2024

View reviewed changes

MarcoGorelli marked this pull request as ready for review May 12, 2024 14:14

MarcoGorelli commented May 12, 2024

View reviewed changes

FBruzzesi approved these changes May 12, 2024

View reviewed changes

FBruzzesi reviewed May 12, 2024

View reviewed changes

FBruzzesi mentioned this pull request May 12, 2024

[FEATURE] Narwhals migration for dataframe-agnostic codebase #658

Closed

actually, include cuDF

5edcaec

FBruzzesi merged commit d09fba5 into koaning:narwhals-development May 12, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Timegapsplit dataframe-agnostic #668

Make Timegapsplit dataframe-agnostic #668

MarcoGorelli commented May 12, 2024 •

edited

Loading

MarcoGorelli May 12, 2024

MarcoGorelli May 12, 2024

FBruzzesi left a comment

FBruzzesi May 12, 2024

MarcoGorelli May 12, 2024

		if not date_serie.index.is_unique:
		raise ValueError("date_serie doesn't have a unique index")

Make Timegapsplit dataframe-agnostic #668

Make Timegapsplit dataframe-agnostic #668

Conversation

MarcoGorelli commented May 12, 2024 • edited Loading

Description

Type of change

Checklist:

MarcoGorelli May 12, 2024

Choose a reason for hiding this comment

MarcoGorelli May 12, 2024

Choose a reason for hiding this comment

FBruzzesi left a comment

Choose a reason for hiding this comment

FBruzzesi May 12, 2024

Choose a reason for hiding this comment

MarcoGorelli May 12, 2024

Choose a reason for hiding this comment

MarcoGorelli commented May 12, 2024 •

edited

Loading