-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC Use dataframe api to also support Polars and other dataframe libraries #597
Conversation
ba21681
to
ff53223
Compare
ff53223
to
5e25d71
Compare
Hi @MarcoGorelli, as mentioned privately, this is a very exciting shift for the data ecosystem, and thank you for showcasing how to approach it. My take on scikit-lego is that the adjustments are fairly tiny and quite easy to implement. The general concern is regarding what should be the process if some functionality is not implemented as api-standard. |
I'm also in favor of exploring this some more. While there are some dumps to discuss with other features (logging on a dataframe in Polars is different if the DF is lazy, I'd also need to double-check our fairness tools) ... but this initial change seems like a fair place to start! |
Thanks! Some parts may require a slight refactor, like if isinstance(df, pd.DataFrame):
# todo: check that `'__index__'` isn't already a column name
df.reset_index().rename(columns={df.index: '__index__'}) and then do a join based on the column I'll try taking this further then, let's see how far it can go!
I'd suggest a 3-phase approach:
|
c73d918
to
f14ba1c
Compare
f14ba1c
to
7d7fdfd
Compare
.github/workflows/dependencies.yml
Outdated
python-version: 3.8 | ||
python-version: 3.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lowest version pandas to have __dataframe_consortium_standard__
is 2.1, which only support python 3.9+ - would python3.9+ be acceptable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never know how reliable this is, yet from pypi stats it seems that the majority of downloads comes from 3.7 and 3.8 🫥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah interesting - OK, keeping 3.8 then, we can work around that with a helper function, and only use the __dataframe_consortium_standard__
for 3.9+
timeline-wise, we're aiming for Feburary to make the first non-beta release
…to use-dataframe-api
…go into use-dataframe-api
this has gone quite out-of-sync, will update soon-ish. trying to get some updates into the api design itself first |
Description
Discussed with @FBruzzesi a bit on Discord, so I thought I'd try opening a PR to show what this would look like. Just starting off with the
pandas_utils
function, curious to get some initial feedback and to know if there'd be interest in this before moving on to the restThe idea is that, instead of hard-coding support for pandas, you could make use of the DataFrame Consortium API, which is (wip in progress!) defined here: https://data-apis.org/dataframe-api/draft/index.html
Then, any function which adheres to that API will:
Example
If you run
then it will print
, just like it does now. No change here.
However, you can now also run
and you'll get
(note how I had to add
collect
before printing the result, as the result is a Polars LazyFrame)Dependencies
Note how in sklego/dataframe_utils.py, it's now possible to remove
import pandas as pd
. If this API were used throughout the package, then you wouldn't even need pandas as a required dependency - scikit-lego would be truly dataframe-agnostic, and would use whichever dataframe package the user passes.All that would be needed would the dataframe-api-compat package. Note that it's light as a feather:
__dataframe_consortium_standard__
on)Checklist:
If you feel your PR is ready for a review, ping @koaning or @MBrouns.