Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Timegapsplit dataframe-agnostic #668

Merged

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented May 12, 2024

Description

This was quite a fun one! As a precursor, I added nw.maybe_align_index and nw.maybe_set_index, so that the automated index alignment can be preserved for pandas users, but Polars can still be supported (so long as date_serie and X / y are the same length), as discussed here

Another step towards #658

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the style guidelines (ruff)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (also to the readme.md)
  • I have added tests that prove my fix is effective or that my feature works
  • I have added tests to check whether the new feature adheres to the sklearn convention
  • New and existing unit tests pass locally with my changes

@MarcoGorelli MarcoGorelli force-pushed the timegapsplit-agnostic branch from 4eda210 to 25b04f9 Compare May 12, 2024 14:09
@MarcoGorelli MarcoGorelli changed the title WIP Make Timegapsplit dataframe-agnostic Make Timegapsplit dataframe-agnostic May 12, 2024
Comment on lines -170 to +192
X_train_df = X_index_df[
(X_index_df["__date__"] >= start_date) & (X_index_df["__date__"] < current_date + self.train_duration)
]
X_valid_df = X_index_df[
(X_index_df["__date__"] >= current_date + self.train_duration + self.gap_duration)
& (
X_index_df["__date__"]
< current_date + self.train_duration + self.valid_duration + self.gap_duration
)
]
X_train_df = X_index_df.filter(
nw.col("__date__") >= start_date, nw.col("__date__") < current_date + self.train_duration
)
X_valid_df = X_index_df.filter(
nw.col("__date__") >= current_date + self.train_duration + self.gap_duration,
nw.col("__date__") < current_date + self.train_duration + self.valid_duration + self.gap_duration,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this diff so pleasing 😍

@MarcoGorelli MarcoGorelli marked this pull request as ready for review May 12, 2024 14:14
Comment on lines -85 to -86
if not date_serie.index.is_unique:
raise ValueError("date_serie doesn't have a unique index")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is checked as part of nw.maybe_align_index

Copy link
Collaborator

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another very satisfying migration 🙌🏼
I left only one pedantic question 🙃

Comment on lines 80 to 81
- Modin

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is CuDF breaking for some reason? It seems we are not doing any checking on X (check_array I am looking at you)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, thanks for asking - indeed, cuDF should be in the list, as to_numpy() is called explicitly here, and they do support that https://docs.rapids.ai/api/cudf/nightly/user_guide/api_docs/api/cudf.dataframe.to_numpy/

it's __array__ (which gets triggered when you call np.asarray on an object) which they don't support they don't support #665 (comment) . So, indeed, they should be included here

@FBruzzesi FBruzzesi merged commit d09fba5 into koaning:narwhals-development May 12, 2024
16 checks passed
koaning pushed a commit that referenced this pull request May 24, 2024
* placeholder to develop narwhals features

* feat: make `ColumnDropper` dataframe-agnostic (#655)

* feat: make ColumnDropped dataframe-agnostic

* use narwhals[polars] in pyproject.toml, link to list of supported libraries

* note that narwhals is used for cross-dataframe support

* test refactor

* docstrings

---------

Co-authored-by: FBruzzesi <[email protected]>

* feat: make ColumnSelector dataframe-agnostic (#659)

* columnselector with test rufformatted

* adding whitespace

* fixed the fit and transform

* removed intendation in examples

* font:false

* feat: make `add_lags` dataframe-agnostic (#661)

* make add_lags dataframe-agnostic

* try getting tests to run?

* patch: cvxpy 1.5.0 support (#663)

---------

Co-authored-by: Francesco Bruzzesi <[email protected]>

* Make `RegressionOutlier` dataframe-agnostic (#665)

* make regression outlier df-agnostic

* need to use eager-only for this one

* pass native to check_array

* remove cudf, link to check_X_y

* feat: Make InformationFilter dataframe-agnostic

* Make Timegapsplit dataframe-agnostic (#668)

* make timegapsplit dataframe-agnostic

* actually, include cuDF

* feat: make FairClassifier data-agnostic (#669)

* start all over

* fixture working

* wip

* passing tests - again

* pre-commit complaining

* changed fixture on test_demographic_parity

* feat: Make PandasTypeSelector selector dataframe-agnostic (#670)

* make pandas dtype selector df-agnostic

* bump version

* 3.8 compat

* Update sklego/preprocessing/pandastransformers.py

Co-authored-by: Francesco Bruzzesi <[email protected]>

* fixup pyproject.toml

* unify (and test!) error message

* deprecate

* update readme

* undo contribution.md change

---------

Co-authored-by: Francesco Bruzzesi <[email protected]>

* format typeselector and bump version

* feat: Make grouped and hierarchical dataframe-agnostic (#667)

* feat: make grouped and hierarchical dataframe-agnostic

* add pyarrow

* narwhals grouped_transformer

* grouped transformer eureka

* hierarchical narwhalified

* so close but so far

* return series instead of DataFrame for y

* grouped WIP

* merge branch and fix grouped

* future annotations

* format

* handling negative indices

* solve conflicts

* hacking C

* fairness: change C values in tests

---------

Co-authored-by: Marco Edward Gorelli <[email protected]>
Co-authored-by: Magdalena Anopsy <[email protected]>
Co-authored-by: Dea María Léon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants