Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example of a sklearn like pipeline #294
Add example of a sklearn like pipeline #294
Changes from 1 commit
6d93058
42d454a
33c39c8
3f0ab97
c338eec
d458d3a
c87ebf4
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kkraus14 just to clarify about scalar issue - should this be written as
instead, so that
__float__
is called and each implementation knows what to do (e.g. trigger compute, raise, return eager value, ...)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that would force materialization unnecessarily.
self.scalings_
items get defined viascalings.get_column_by_name(column_name).get_value(0)
which could return a ducktyped lazy scalar, and then this division could be lazy as well.I.E. how this works in cuDF today is that the
get_value(0)
call returns acudf.Scalar
object which lives on the GPU, and then the__div__
operation with the column can use that on-GPU value directly.Adding
float(...)
requires returning a Python float, which then prevents doing things like the above.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for explaining, OK with not using 'float' here
Sorry to belabour the point, but the "lazy scalar" here is the part which I'm just not getting
A fairly common pattern I used to work with when I did data science was:
fit
part, potentially expensive)transform
/predict
part, usually much cheaper)E.g.: train a model somewhere, and save some weights. Then, say on a device (like a smart watch), make predictions using that model (and the saved weights)
By the time you're doing inference (
.transform
), doesn't exist any more - at least, not in the environment you're doing inference onSo how can something calculated during the
fit
phase stay lazy?I think this is what @cbourjau was getting to as well, saying that for
fit
, they need to materialise valuesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's an implementation detail of the library in question? The output of
fit
is effectivelyself.scalings_
which is a dictionary of ducktyped scalars. Those ducktyped scalars could effectively be an expression and then the__div__
in.transform
just continues to build expressions until something triggers materialization, whether that's a call to something like__float__
or__bool__
or something else.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue I have is that I would typically do:
fit
in one environmentpredict
in a different environment (e.g. on a smartwatch)If the "something triggers materialization" happens during
predict
, then the expression is going to fail, because the dataframe which is meant to be used for training isn't available any moreWhat I want to do is:
fit
in one environment, and force materialisation so I can export the model weightspredict
in a different environment, using the model weights calculated during thefit
phaseThe way the code is written in this PR achieves this by doing
during the
fit
phase. And we may have to be OK with this if we can't agree on a solution which can be part of the standardThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's the only option. I think we want a
.may_execute()
type method (ignore the exact method name here), which in the example above (and also @cbourjau's ONNX library which has lazy arrays) can simply be do-nothing, while Polars could alias that to.collect()
. It would address the concern about implicit materialization, making it explicit.It wouldn't be super ergonomic to write it as
.may_execute().to_array()
, but probably better than throwing - which isn't a solution really.This solution is kinda circling back to a small part of Marco's lazy/eager design, without the separate classes or expression objects.
Other thought on Polars there: it could also choose to do an optimization pass, where e.g. it could figure out for a block of code with multiple
may_execute
's which ones have to collect and which ones don't. That'd actually be generically useful outside of dataframe API support, because the manual "must execute" imperative nature of.collect
is not so easy to write optimally (as we already discovered in another discussion on this repo).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may_execute
sounds fine, thanksbtw yesterday I saw a colleague write this
. If I was -1 on automated collection before, now I'm more like -10 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in
skrub.DatetimeEncoder.fit
, even Dask would throw an error telling the user to call.compute
I've added an example of this in #307, as well as a proposal to add
maybe_execute
as a hint rather than as a directiveThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a 1:1 relationship between the dataframe and the array API standard then we have successfully kicked the can down the road (i.e. tech-stack) into the array API standard. Calling
to_array
will always trigger materialization if the associated array-API implementation is eager. If the associated array-API implementation is lazy, then it will remain lazy. Would this be acceptable for the Pandas/Polars-backed implementations? I'm afraid I don't quite see the intention behindmay_execute
?The above-described semantics would work well with ONNX as far as the dataframe API is concerned. We are left with the same question of explicit materialization on the array API level, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid not, sorry: #307 (comment)