Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC, WIP: demo: use dataframe api in datetimeencoder #786
RFC, WIP: demo: use dataframe api in datetimeencoder #786
Changes from all commits
f3a054b
d8bcaae
2497b8d
5025819
8f6fd62
0791f58
9bdd449
b7335e6
77fe22a
4315e92
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically,
nanosecond
isn't part of the API standard (yet?). But I think you're primarily interested in pandas-Polars compatibility anyway (please correct me if I'm wrong)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move the
nanosecond
handling into the dataframe API? It would make more sense to have it there and would help your users, IMHO.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "into the dataframe API"? Into the standard? Or the
dataframe-api-compat
package? It cannot be done without having the stakeholders of the specification process agree on that first.To move it into the
dataframe-api-compat
package would be helpful but maybe thedataframe-api-compat
package does not want to expose extensions to the standard in its public API. So it would have to wait for this new method to become part of the standard first.I think keeping this code there is pragmatic for the time being (to avoid introducing a regression w.r.t. what is already done in the current code-based).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataframe-api-compat
currently only handles implementations of the Dataframe API for pandas and Polars. For some other libraries, like cudf, they plan to expose the dataframe API methods from their main namespace directly, so they wouldn't have a separate packageI've added
nanosecond
todataframe-api-compat
, but unlessnanosecond
is added to https://data-apis.org/dataframe-api/draft/index.html, then there's no guarantee that other libraries (e.g. cudf) would add itI'll try to get nanoseconds added to the spec though, would make things cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification @MarcoGorelli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this requires some discussion too
currently, you're allowing numpy arrays as inputs, but you're doing operations which numpy doesn't support (e.g.
dayofweek
) - if the input is a numpy array, is it OK with just pick a dataframe library and construct the input using that? here I'm just picking pandasThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. We're refactoring the DatetimeEncoder in #784, and we don't plan to have efficient support of Polars in this PR since we're converting the input to numpy right away.
So, to answer your question, you won't need to call a dataframe constructor. However, we will revisit this so that Polars and Pandas can be used without numpy conversions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering: do you expect many users to use skrub with non-dataframe inputs? If not maybe it's still time not to accept numpy arrays as input and only accept dataframes. That might make the code simpler and easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the idea behind
persist()
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sure that computations prior to this step won't be repeated https://data-apis.org/dataframe-api/draft/API_specification/dataframe_object.html#dataframe_api.DataFrame.persist . In the end, the consortium preferred this to
collect
. It does pretty much the same thing anyway (in Polars it does.collect().lazy()
)Below,
if df.col(col_name).std() > 0
will trigger computation because it calls(df.col(col_name).std() > 0).__bool__()
. That command produces a Python scalar, so it's necessary to trigger compute here (as opposed to keeping things lazy)By putting
persist
, we ensure that the compute which gets triggered only happens from this point until the__bool__
callJust saw you've already done a lot of work here to support Polars - well done! I'll rebase then