-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC, WIP: demo: use dataframe api in datetimeencoder #786
Conversation
Hey @MarcoGorelli, thank you for this proof of concept!
|
Sorry I just meant that the column names need to be strings. In pandas you can have literally anything as a function names, but in polars (and some other packages) column names have to be strings, so that's what the dataframe api requires Anyway I could always work around this in the pandas implementation so that it would be a strict superset of the dataframe api standard.. sorry for the noise then, this point may have been irrelevant 😄 ) |
IMO it's reasonable to require column names to be strings. pandas supporting any hashable object as a column name is rarely useful and sometimes a source of bugs in client code in my experience |
Fine with me, but ideally leave on object with more open inputs that enables renaming the column (our future "SelectCols" object).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated based on recent Dataframe API changes, and added some commentary
The hardest part about the Dataframe API is .collect
. For now I'm just doing
if hasattr(X, 'collect'):
X = X.collect()
, if/when the Consortium agrees on something we can use that instead
Also, I've updated dataframe-api-compat
to allow for non-string column names, so long as they're unique - technically outside the standard, but means adoption would result is less breakage
setup.cfg
Outdated
@@ -28,7 +28,8 @@ install_requires = | |||
scikit-learn>=1.2.1 | |||
numpy>=1.23.5 | |||
scipy>=1.9.3 | |||
pandas>=1.5.3 | |||
pandas>=2.1.0 | |||
dataframe-api-compat>=0.1.23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is extremely light, it's just a bunch of pure-Python files which wrap around the pandas and Polars apis https://pypi.org/project/dataframe-api-compat/0.1.23/#files
if hasattr(date_series, 'nanosecond'): | ||
return date_series.nanosecond() | ||
else: | ||
raise AttributeError( | ||
f"`nanosecond` is not part of the DataFrame API and so support is not guaranteed across all libraries. " | ||
"In particular, it is not supported for {date_series.__class__.__name__}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically, nanosecond
isn't part of the API standard (yet?). But I think you're primarily interested in pandas-Polars compatibility anyway (please correct me if I'm wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move the nanosecond
handling into the dataframe API? It would make more sense to have it there and would help your users, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "into the dataframe API"? Into the standard? Or the dataframe-api-compat
package? It cannot be done without having the stakeholders of the specification process agree on that first.
To move it into the dataframe-api-compat
package would be helpful but maybe the dataframe-api-compat
package does not want to expose extensions to the standard in its public API. So it would have to wait for this new method to become part of the standard first.
I think keeping this code there is pragmatic for the time being (to avoid introducing a regression w.r.t. what is already done in the current code-based).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataframe-api-compat
currently only handles implementations of the Dataframe API for pandas and Polars. For some other libraries, like cudf, they plan to expose the dataframe API methods from their main namespace directly, so they wouldn't have a separate package
I've added nanosecond
to dataframe-api-compat
, but unless nanosecond
is added to https://data-apis.org/dataframe-api/draft/index.html, then there's no guarantee that other libraries (e.g. cudf) would add it
I'll try to get nanoseconds added to the spec though, would make things cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification @MarcoGorelli
skrub/_datetime_encoder.py
Outdated
if hasattr(X, 'collect'): | ||
X = X.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this is where it gets properly controversial, because .collect
isn't part of the Dataframe API
There's a discussion here if you're interested data-apis/dataframe-api#294 (comment) , hope we can find a resolution soon-ish. Any input there would be greatly appreciated
The reason collect
is necessary here is that, a few lines down, you're calling float
on a scalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. You're anticipating the reflection on the lazy API on our part, which is great!
Ideally, we would like to release skrub soon with a weak support of Polars (meaning we will accept it as input but may convert it to numpy or pandas). Later, we'll move into stronger support (we limit Polars conversion to the strictly necessary, e.g., before calling a scikit-learn estimator).
skrub/_datetime_encoder.py
Outdated
if np.nanstd(self._extract_from_date(X[:, i], "total_time")) > 0: | ||
if float(self._extract_from_date(column, "total_time").std()) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the Dataframe API, column reductions (e.g. Column.std
) return ducktyped Scalars. If you want to use them within Python (e.g. here within an if-then statement), then you need to call float
on them to materialise them to Python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very enlightening. I guess this will encourage us to use tricks like branchless when possible to keep the laziness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks - correction here, this can be simplified, because if we write
if self._extract_from_date(column, "total_time").std():
then Python will implicitly call
if self._extract_from_date(column, "total_time").std().__bool__():
which will materialise the scalar to Python anyway
Yes, in general, if there's a way to avoid working with Python scalars, that's better
But my suspicion is that, in general, for fit
methods you'll need to materialise at some point (counter-examples welcome), whereas transform
ones may be able stay lazy (at least, the ones that don't go via numpy)
skrub/_datetime_encoder.py
Outdated
if hasattr(X, 'collect'): | ||
X = X.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note how this time, in transform
, I only needed to call .collect
much later, right at the end of the function. I need to call it because you want to return a numpy array, but I can put it as late as possible and previous operations can be done lazily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, there's a discussion we need to have on returning dataframes by default as we do with Joiner
or AggJoiner
. If a user inputs a Lazyframe, he expects a Lazyframe in return when possible. When it's not, we should follow the scikit-learn design and return a numpy array by default but allow the user to set the output (either Pandas or Polars).
Since outputting Polars dataframes is still under discussion at scikit-learn/scikit-learn#25896, we could implement a workaround with skrub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it's not, we should follow the scikit-learn design and return a numpy array by default but allow the user to set the output (either Pandas or Polars).
Since skrub does not have to deal with backward compat and will primarily be used with dataframe inputs, maybe its default policy could be to return an output with the same container type as its input (at least for the estimators for which it would be efficient to do so)?
Are there transformers in skrub that output datastructures that would not efficiently be written with by composing column-wise dataframe operations? (e.g. row-wise operations on contiguous rows in numpy arrays).
Are there transformers in skrub that cannot output dataframes (e.g. because they can only output scipy sparse matrices that cannot be efficiently represented by a dataframe)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you could return a dataframe as output, that would be 110% awesome, as then .transform
could be completely lazy (on the other hand, I think .fit
would practically always need to materialise data at some point, at least for the parts of skrub I've looked at so far)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vincent and I were just talking about this. It would be nice if skrub could work on dataframes as much as possible and delay converting to numpy until it is really necessary (eg in a scikit-learn estimator), so that:
- columns of different types can move through the pipeline without needing to cast everything to a common type
- when the dataframe contains several pandas blocks or arrow arrays we don't need to copy everything into a numpy array
- in the long term there is some chance of taking advantage of polars LazyFrames
maybe its default policy could be to return an output with the same container type as its input (at least for the estimators for which it would be efficient to do so)?
yes and ideally wherever possible use the original container internally rather than converting to a different one (and most likely incurring a copy) and then converting back
Are their transformers in skrub that output datastructures that would not efficiently be written with by composing column-wise dataframe operations? (e.g. row-wise operations on contiguous rows in numpy arrays).
ATM I don't see any but I guess there will be a point in most pipelines where a numpy array is needed, eg when we reach some scikit-learn estimator. Still it would be useful to delay that creation of a (contiguous, homogeneous) numpy array as much as possible
Are there transformers in skrub that cannot output dataframes (e.g. because they can only output scipy sparse matrices that cannot be efficiently represented by a dataframe)?
It could be the case for the TableVectorizer as it allows users to choose the transformers, eg a user could use a TfidfVectorizer as the high_card_cat_transformer
. But I think (I may be wrong) that dense data is the main envisioned use case, at least the default transformers output dense arrays. Also IIUC the TableVectorizer is likely to go later in a pipeline than some other skrub transformers as its goal is to transform the input dataframe into a numerical array that can then be fed to scikit learn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Jérome, we identified two main groups of estimators in skrub: encoders (GapEncoder, MinHashEncoder, SimilarityEncoder) that could have been implemented in scikit-learn, and "assemblers" (naming TBD) which only works with dataframes as input (Joiner, TableVectorizer, ColsSelector).
if not hasattr(X, "__dataframe_consortium_standard__"): | ||
X = check_input(X) | ||
X = pd.DataFrame(X, columns=[str(i) for i in range(X.shape[1])]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this requires some discussion too
currently, you're allowing numpy arrays as inputs, but you're doing operations which numpy doesn't support (e.g. dayofweek
) - if the input is a numpy array, is it OK with just pick a dataframe library and construct the input using that? here I'm just picking pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. We're refactoring the DatetimeEncoder in #784, and we don't plan to have efficient support of Polars in this PR since we're converting the input to numpy right away.
So, to answer your question, you won't need to call a dataframe constructor. However, we will revisit this so that Polars and Pandas can be used without numpy conversions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering: do you expect many users to use skrub with non-dataframe inputs? If not maybe it's still time not to accept numpy arrays as input and only accept dataframes. That might make the code simpler and easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Marco, thank you for the heads-up :) I think it's only a matter of time before we jump into the dataframe-api bandwagon. Is the join method functional already?
if hasattr(date_series, 'nanosecond'): | ||
return date_series.nanosecond() | ||
else: | ||
raise AttributeError( | ||
f"`nanosecond` is not part of the DataFrame API and so support is not guaranteed across all libraries. " | ||
"In particular, it is not supported for {date_series.__class__.__name__}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move the nanosecond
handling into the dataframe API? It would make more sense to have it there and would help your users, IMHO.
if not hasattr(X, "__dataframe_consortium_standard__"): | ||
X = check_input(X) | ||
X = pd.DataFrame(X, columns=[str(i) for i in range(X.shape[1])]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. We're refactoring the DatetimeEncoder in #784, and we don't plan to have efficient support of Polars in this PR since we're converting the input to numpy right away.
So, to answer your question, you won't need to call a dataframe constructor. However, we will revisit this so that Polars and Pandas can be used without numpy conversions.
skrub/_datetime_encoder.py
Outdated
if hasattr(X, 'collect'): | ||
X = X.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. You're anticipating the reflection on the lazy API on our part, which is great!
Ideally, we would like to release skrub soon with a weak support of Polars (meaning we will accept it as input but may convert it to numpy or pandas). Later, we'll move into stronger support (we limit Polars conversion to the strictly necessary, e.g., before calling a scikit-learn estimator).
skrub/_datetime_encoder.py
Outdated
if np.nanstd(self._extract_from_date(X[:, i], "total_time")) > 0: | ||
if float(self._extract_from_date(column, "total_time").std()) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very enlightening. I guess this will encourage us to use tricks like branchless when possible to keep the laziness.
skrub/_datetime_encoder.py
Outdated
if hasattr(X, 'collect'): | ||
X = X.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, there's a discussion we need to have on returning dataframes by default as we do with Joiner
or AggJoiner
. If a user inputs a Lazyframe, he expects a Lazyframe in return when possible. When it's not, we should follow the scikit-learn design and return a numpy array by default but allow the user to set the output (either Pandas or Polars).
Since outputting Polars dataframes is still under discussion at scikit-learn/scikit-learn#25896, we could implement a workaround with skrub.
yup, should be! at least, it's in both the spec and the implementation |
if hasattr(X, 'collect'): | ||
X = X.collect() | ||
X = X.__dataframe_consortium_standard__(api_version='2023.11-beta') | ||
X = X.persist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the idea behind persist()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sure that computations prior to this step won't be repeated https://data-apis.org/dataframe-api/draft/API_specification/dataframe_object.html#dataframe_api.DataFrame.persist . In the end, the consortium preferred this to collect
. It does pretty much the same thing anyway (in Polars it does .collect().lazy()
)
Below, if df.col(col_name).std() > 0
will trigger computation because it calls (df.col(col_name).std() > 0).__bool__()
. That command produces a Python scalar, so it's necessary to trigger compute here (as opposed to keeping things lazy)
By putting persist
, we ensure that the compute which gets triggered only happens from this point until the __bool__
call
Just saw you've already done a lot of work here to support Polars - well done! I'll rebase then
closing for now as this is really out-of-date i'm trying to turn things around in the dataframe api because I'm not exactly ecstatic about how it's going (data-apis/dataframe-api#346), trying to squash the concepts of "column" and "expression" into a single API leads to some very awkward code and error messages, as you've seen thanks anyway for your feedback, if you're then still interested in it we can revisit if/when it's in a better shape |
Hi @MarcoGorelli , I'm really sorry for the late reply! somehow I missed this comment. From your comments here and our experimentations with the dataframe API (eg #885 , #827), it seems it might be a little while before we can heavily rely on the dataframe API in skrub -- we'll definitely keep monitoring its progress closely. In the meanwhile, support for polars remains a high priority. As in the short term we are targeting pandas and polars only it should be manageable to just write the utilities we need once for polars and once for pandas. So the proposed solution for now is just to write 2 implementations for skrub internal functions that need to be specialized and use single dispatch #888 , I'd be interested to know what you think. As we extend our support for polars dataframes I think we may need to adapt a bit the way data is processed in skrub / scikit-learn pipelines to be more compatible with polars expressions and have a chance of eventually benefiting from the lazy api where feasible. This is super vague 😅 but I know we could benefit a lot from your advice on how to use polars well in skrub and I'll try to ping you with more specific questions in other issues/PRs |
Hey @jeromedockes ! I'm afraid I've given up on the dataframe standard project (see data-apis/dataframe-api#346)
great! and always feel free to book some time if you'd like to chat about anything https://calendly.com/marcogorelli |
I'm afraid I've given up on the dataframe standard project (see data-apis/dataframe-api#346)
I'm really sorry to hear that.
great! and always feel free to book some time if you'd like to chat about anything https://calendly.com/marcogorelli
thanks, I will!
|
I'm not expecting this to be merged, but following some conversation with @Vincent-Maladiere I just wanted to demo how the dataframe-api could work here, and this can get the ball rolling for how to proceed
Some questions which I thought of while doing this:
.collect
in order to be able to materialise the dataframe to an array - but I can do that as late as possible in the function. Seetransfom
for an example:namespace.col
)X.select
), and the insertion can happen in parallel in some implementations (e.g. Polars)Note that, in Polars, converting a single column to 1-d ndarray can be zero-copy (numeric data with no
None
s), but converting an entire dataframe to 2-d ndarray always copies data