-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate eager and lazy APIs #249
Conversation
a02583d
to
a6ef75f
Compare
Thanks @MarcoGorelli for the detailed proposal + prototype! Here are some of the bigger picture questions that came to mind:
Please remember that to a large extent this is in the eye of the beholder. To me your preferred version uses both method name and syntax that I don't yet understand and therefore would have to look up. While this, while more verbose, is much easier to understand for me - and arguably also for the average non-initiated reader: col_width = plant_statistics.get_column_by_name("sepal_width")
col_height = plant_statistics.get_column_by_name("sepal_height")
col_species = plant_statistics.get_column_by_name("species")
rows = plant_statistics.get_rows_by_mask((col_width > col_heigh) | col_species) Also, the So while I'm not at all opposed to changes to naming for those who have to read it more, I'd keep in mind that this is not at all clear cut and by itself not the strongest rationale for introducing new classes/objects. That said, I don't think this particular rationale is needed at all here. The stronger rationale is that an expression is nicer to work with as an execution-independent abstraction, so it seems useful and preferable to a separate lazy-only expression object. Also, it was encouraging to hear that something expression-like may be in the works for Pandas. Current impression: this is quite promising. There may be some things to clarify regarding semantics, but I like it so far. |
If any of the expression's expressions is not boolean. | ||
""" | ||
|
||
def any(self, *, skip_nulls: bool = True) -> Expression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct for reductions to be methods on Expression
? The PR description suggests it isn't, because it calls out reductions to be specific to EagerColumn
. Or if this is meant to be a "lazy scalar" type of return, doesn't that give problems with returning an expression object on which most methods cannot be considered valid/callable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup - an expression acts as a function from a dataframe to a column, and so here, it would just return a length-1 column.
Then, broadcasting rules (which I need to write out which I've added to the docs) define the rest
Here's some examples:
In [1]: df
Out[1]:
sepal_length nice_plant
0 5.1 False
1 4.9 False
2 4.7 False
3 4.6 False
4 5.0 False
.. ... ...
145 6.7 True
146 6.3 True
147 6.5 True
148 6.2 False
149 5.9 False
[150 rows x 2 columns]
# just select a single length-1 column, so the result is a length-1 dataframe
In [2]: df.select(col('nice_plant').any())
Out[2]:
nice_plant
0 True
# col('sepal_length') inserts a length-150 column, and
# col('nice_plant').any() inserts a length-1 column. So, the
# second one gets broadcasted to the same length as the first one
In [3]: df.select(col('sepal_length'), col('nice_plant').any())
Out[3]:
sepal_length nice_plant
0 5.1 True
1 4.9 True
2 4.7 True
3 4.6 True
4 5.0 True
.. ... ...
145 6.7 True
146 6.3 True
147 6.5 True
148 6.2 True
149 5.9 True
[150 rows x 2 columns]
# Both columns we insert are length-1, so so is the output
In [4]: df.select(col('sepal_length').mean(), col('nice_plant').any())
Out[4]:
sepal_length nice_plant
0 5.843333 True
# Both columns are length-150, so so is the output
In [5]: df.select(col('sepal_length'), col('nice_plant'))
Out[5]:
sepal_length nice_plant
0 5.1 False
1 4.9 False
2 4.7 False
3 4.6 False
4 5.0 False
.. ... ...
145 6.7 True
146 6.3 True
147 6.5 True
148 6.2 False
149 5.9 False
[150 rows x 2 columns]
Thanks! This is a huge relief, I kind of needed to hear something like that
Let's try :) I'll try to write down what:
We can try this out with a concrete pandas dataframe: In [11]: df = pd.DataFrame({'a': [1, 4, 2, 3, 5], 'b': [9,1,4,2,5]})
In [12]: df.loc[(lambda df: (lambda df: df.loc[:, 'a'])(df) > (lambda df: df.loc[:, 'b'])(df))(df)].reset_index(drop=True)
Out[12]:
a b
0 4 1
1 3 2 The usual rules would still apply - filtering with something not boolean would still raise
I am indeed materialising the whole dataframe though:
If I was only selecting two columns, then it would have been better to first select those two columns, then I expect the maxim "use collect as late as possible, and ideally just once" it to be close-to-optimal for most cases - counterexamples would be very welcome (and very interesting!) |
Not sure I see how this would work at all actually, even with infinite engineering resources Say I do df: DataFrame
x = df.filter(col('score') > 0).select('a', 'b').collect().to_array_object() I now have a array assigned to It depends.
Problem is, if we're defining a Python API, then we're not in a purely compiled language. And so when
I'll try to make this clearer, thanks.
Not saying it needs to do everything eagerly, happy to let implementations figure out the details here - but it does need to have the ability to step out of lazy mode and produce something eager |
That's kind of the same argument as sometimes used against Python garbage collector, but it's a thin one. If
I think this is why PyTorch/JAX/TF (and Numba for NumPy) have "lazy modes" using I'd summarize the decision of using a
I think the "just once" is important, but there's a little more to it - it's not hard to write code that's worse than eager mode would be. In your eager_df = df.collect()
del df Either that or do |
Sure, but what if it's not been allocated? E.g. if I did x = df.select(col('a').sort().head(5)).collect() then, the dataframe could optimise it to x = df.select(col('a').top_k(5)).collect() under-the-hood and never even allocated
OK yes, it would if you then called Maybe we can be more precise in the guidance then - only call |
If it's never been materialized in memory before there's nothing to cache. But that's not really a counterpoint, it's only a "there wasn't an opportunity to miss here, hence it wasn't missed".
This sounds like a good rule to me. No need for a linter I'd say - it can be implemented simply by keeping track of an |
If I write df: DataFrame
x = df.select(col('a').sort().head(5)).collect().to_array_object(Int64()) then should It depends on what comes next:
But we don't know which case we're in - and so long as we're defining a Python API, then I don't know how we can. Anyway, sounds like we're in agreement on just documenting that
It would be a bit more complex, as it would need propagating - for example df: DataFrame # _already_called is False
x = some_function(df).collect() # output of some_function(df) gets _already_called True
return df.std().collect() # allowed, as _already_called on df is still False In any case, I don't think we can enforce that all implementations add such a flag - but we could make a dataframe-api-linter independently of what implementations do Anyway, we're getting a bit sidetracked here, and it sounds like we're in agreement on what to do with regards to @shwina any thoughts on this proposal? (no hurry) |
df1: DataFrame
df2: DataFrame
# problematic call:
df1.filter(df1.get_column_by_name('a') > df2.get_column_by_name('b')) @MarcoGorelli Is this problematic call still possible using |
Polars already has an eager dataframe class (polars.DataFrame) from which you can extract eager columns (polars.Series). On those objects, that call has no issue - no, there's no problem here The call is only problematic for polars.LazyFrame |
P.S. I already see that there were a lot of discussions about lazy calculations, quite easily I could have missed the discussion of the question that I asked, sorry for that and if there is anything you could just post a link to the right discussion (if it's more convenient). |
That would be a performance footgun: In [1]: df1 = pl.LazyFrame({'a': np.random.randn(100_000_000)})
In [2]: df2 = pl.LazyFrame({'a': np.random.randn(100_000_000)})
In [3]: %timeit _ = df1.filter(pl.col('a')>0).collect()
146 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit df1.with_row_count('idx').join(df2.with_row_count('idx'), on='idx').filter(pl.col('a_right')>0).drop('a_right', 'idx').collect()
821 ms ± 33.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) PS. no worries, I don't think there's an existing discussion with this question (but it might have been asked during the latest call - recordings are at https://github.com/data-apis/workgroup/issues/10, let us know if you need access). Overall, please do feel welcome to ask all the questions you'd like! 🙏 |
def __gt__(self, other):
if [some condition that check `self` and `other` from different dataframes]:
result = self.join(other)
[comparison implementation] In the case where the columns do not require a join, the function is still as fast as before (unless the condition requires a lot of computational resources). In the case where a join is needed, we will execute it implicitly in the function (without forcing the user to execute it explicitly before the comparison function). In some situations this may be extremely ineffective, but for this purpose we can provide a warning mode (off by default), which can warn that an implicit join is being made and that it may make sense to do it explicitly in order to reuse the result. This way the code will remain more attractive/familiar for beginners, but there will be hints (via warnings) on how to write the code more efficiently. However, I don’t know whether the use of warnings can be documented in the standard in principle.
|
Thanks for your suggestion - it's true that we could do that, but I'd like to make the case that we shouldn't The whole "it's fast in some cases and painfully slow in others and I have no idea why" thing of pandas is a common pain-point for users - for example, can you guess which of In general, the Consortium is serious about avoiding performance footguns (e.g., #131 (comment), forbidding iterating over the elements of If filtering one dataframe by another ends up being a common use-case, then we can always consider allowing it later on - but it's much easier to start strict and relax later on than to do the other way round |
Have updated, and done some renamings as discussed last time cc @jorisvandenbossche in case you have an opinion here (getting this right here will help sort out some issues with |
@@ -166,11 +191,117 @@ def dataframe_from_2d_array(array: Any, *, names: Sequence[str], dtypes: Mapping | |||
""" | |||
... | |||
|
|||
def any_rowwise(*columns: str | Column | PermissiveColumn, skip_nulls: bool = True) -> Column: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having to type things as Column | PermissiveColumn
feels like a design flaw to me and will make downstream usage potentially annoying. I.E. for code that agnostically handles columns and dataframes, there's now 4 objects that people will need to type check to understand what they're working with.
Given that PermissiveColumn
is a superset of Column
, maybe we could use Column
as a base class and inherit from it in PermissiveColumn
? Similarly for DataFrame
and PermissiveFrame
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is type checking the only concern here? If so, we could define (and export) a type alias, like we do for DType
?
code that agnostically handles columns and dataframes
if you have completely agnostic code, then I'd suggest just accepting DataFrame
and leaving it up to the caller to convert to DataFrame
After all:
- if an end user passes in
df_non_standard
to your function, thendf_non_standard.__dataframe_consortium_standard__()
returns aDataFrame
- if the function is only used internally, then you can control what you pass it. If you have a
PermissiveFrame
, you can call.relax
to convert it toDataFrame
@@ -50,6 +60,21 @@ | |||
implementation of the dataframe API standard. | |||
""" | |||
|
|||
def col(name: str) -> Column: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this isn't bound to a DataFrame, for libraries other than polars and ibis this will be a new concept that will require implementation and maintenance. Do you have a sense for what this would look like for Pandas for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup take a look here data-apis/dataframe-api-compat#13
It's surprisingly simple to just add the syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't really look simple to me: https://github.com/data-apis/dataframe-api-compat/blob/76284fa158ffe0f21ab1758f46caf523427077a3/dataframe_api_compat/pandas_standard/pandas_standard.py#L81-L417
My concern is that we went from pushing for changes in polars to now pushing for changes in most other dataframe libraries. I would love some thoughts from other dataframe library maintainers here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a matter of recording some lambda calls and then unpacking them - e.g.
df: DataFrame
col = df.__dataframe_namespace__().col
df = df.filter(col('a') > col('b')*2)
becomes
df: pd.DataFrame
col_a = lambda df: df.loc[:, 'a']
col_b = lambda df: df.loc[:, 'b']
col_b_doubled = lambda df: col_b(df) * 2
mask = lambda df: col_a(df) > col_b_doubled(df)
df = df.loc[mask(df)]
If this lives in a separate namespace, then there's not really any extra maintenance that needs doing in the main implementation
I would love some thoughts from other dataframe library maintainers here.
For a start, it might be added to pandas (regardless of what the consortium does), check Joris' lightning talk from euroscipy: https://youtu.be/g2JsyNQgcoU?si=ax0ZINFQINf9a5jv&t=512
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that we went from pushing for changes in polars to now pushing for changes in most other dataframe libraries
Also, this isn't "apples to apples" : asking Polars to add complexity to the query optimiser isn't comparable to keeping track of lazy column calls in a separate namespace
Anyway, thanks for your input, and I hope you're keeping well on parental leave!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response! I am looking at this proposal through the lens of an inherently lazy, ONNX-based implementation that can never evaluate eagerly.
I am worried the current proposal encourages users to write code with eager collections in order to optimize the performance of some particular implementations. The plotting example is an instance where the collection is unavoidable since the core business logic needs to process eager values. The other examples can technically remain lazy as far as I can tell.
It seems to me that we have two different kinds of "collections". One is a "collection hint" for the underlying implementation, and the other is an actual collection resulting in values in memory.
From our perspective, it would be very beneficial to express this difference in the API. We would be happy to ignore any "collection hints" and would throw exceptions for any true "collect" call.
Another concern for me is the predictability and semver stability of a library's API. It would be great if we could type-hint on function boundaries that a function taking a DataFrame
never eagerly computes a value. This way, changing a lazy implementation to an eager one would require an explicit semver-breaking API change.
@@ -4,11 +4,12 @@ | |||
|
|||
def my_dataframe_agnostic_function(df_non_standard: SupportsDataFrameAPI) -> Any: | |||
df = df_non_standard.__dataframe_consortium_standard__(api_version='2023.09-beta') | |||
xp = df.__dataframe_namespace__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with the other examples:
xp = df.__dataframe_namespace__() | |
namespace = df.__dataframe_namespace__() |
|
||
for column_name in df.column_names: | ||
if column_name == 'species': | ||
continue | ||
new_column = df.get_column_by_name(column_name) | ||
new_column = xp.col(column_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new_column = xp.col(column_name) | |
new_column = namespace.col(column_name) |
|
||
df = expensive_feature_engineering(df) | ||
|
||
df_permissive = df.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of the current discussion is that a call to collect
may but is not forced to do any eager computation. Is this correct?
x_train = train.drop_columns("y").to_array_object(namespace.Float64()) | ||
y_train = train.get_column_by_name("y").to_array_object(namespace.Float64()) | ||
x_val = val.drop_columns("y").to_array_object(namespace.Float64()) | ||
y_val = val.get_column_by_name("y").to_array_object(namespace.Float64()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal reads:
PermissiveFrame
Like DataFrame, but must be able to support eager evaluation when evaluatingto_array_object
This would render this function unusable for fully lazy implementations such as the ONNX-based one that we are working on, but there is not technical reason why this function can't be lazy. Note that our lazy data frame implemenation is build on top of a lazy array-api implemenation. We therefore would like to return a lazy array with the to_array_object
call.
Sorry, I overlooked the call to fit
further down 😨 . Fitting will usually indeed require actual values. For the ONNX use case, we are primarily interested in lazily evaluating predict
and transform
calls.
] | ||
) | ||
results_df: DataFrame = namespace.dataframe_from_2d_array( | ||
results, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this proposal, but for our use case we would also need a way to instantiate a lazy data frame from a lazy results
tensor.
A real-world example that would take advantage of "cached" results would be our ONNX-based implementation. Consider the following (pseudo-code) example: # Construct a lazy data frame.
# The input columns are of type int64 and will be provided as user input at inference time
df = lazy_standard_df_from_user_input({"a": "int64", "b": "int64")
features = expensive_feature_engineering(df)
std = features.std()
# Build the ONNX model
model = build_onnx_artifact(inputs=df, outputs={"features": features, "std": std})
import onnx
onnx.save(model, "my_model.onnx") Running this ONNX model with user provided input will run the feature engineering step only once in order to produce |
Thanks for the very fruitful discussion we just had! I'd like to provide an example that illustrates how function signatures of the current proposal can be used to indicate that functions require eager implementations, but not to signify that they are compatible with lazy implementations. Let's consider the eager case first: def do_things(df: PermissiveDataFrame) -> PermissiveDataFrame:
... A function of this type would make it very clear that whatever happens inside it may requires eager computation. This is fine since some tasks such as plotting or fitting a model are usually inherently "eager". However, the current proposal would also allow the following implementation: def do_things_sneaky(df: DataFrame) -> DataFrame:
return do_things(df.collect()).relax() This is rather unfortunate for two reasons:
Might it be possible to have an inheritance-based solution where |
Thanks for your comments and discussion @cbourjau , really productive and helpful! From yesterday's call, I think (please correct me if I'm wrong) that we're in agreement that there needs to be a way to say "values need materialising" - the onnx-based dataframe case, you'd be able to use that to determine whether:
An example of when the latter is needed could be fitting an xgboost model (or indeed, as of 2023, any ML model...I'm not aware of any library which can train a lazy model with a lazy array, but please do correct me if I'm wrong).
This is a good point. Would just removing Do you have any thoughts on the |
Technically, the ONNX standard does have provisioning for training, but I'd say it is out of scope. I have never used it, but I think it would essentially require the Array/Dataframe API to standardize training functions such as Eitherway, your takeaway is correct. It would be great if it were possible to differentiate those two points by looking at a function signature.
Going from
I think it looks good and given that it works for Polars I don't see an issue for our lazy use case. |
Yes, I would also love to have this. Gonna think about this one a little more 🤔 |
@cbourjau just to clarify - would a type hint suffice, or would you need it to be a different at runtime? I'd really appreciate it if you could contribute an example to the repo (#281) of some functions you'd envisage people using the API for - I'm particularly interested in examples you may have of functions for which your library (what's it called, btw?) could stay completely lazy and have everything exported at the end |
Thank you @MarcoGorelli for the exploration. Admittedly, it is daunting to review a 2000+ line PR and I have not done a line-by-line review. It's my fault for not following this PR more closely since its earlier iterations. That being said, thank you for the details in the PR description - it does help understand the motivation behind this change, and gives a good idea about the implementation. While I can see how the changes introduced here solve the problem posed, I do feel they negatively impact the complexity and teachability of the standard. From a cuDF perspective, our long-term hope is not to maintain a plethora of APIs (native, standard DataFrame, standard PermissiveDataFrame), but rather a single API that looks more and more like the standard over time. I worry that introducing lazy semantics into the standard will make that increasingly difficult/impossible.
I am more than appreciative and awed at the effort that has gone into this solution, so I do not suggest an alternative lightly: The goal is to disallow operations between columns originating from different DataFrames, a la Polars. As you say, it can be a performance footgun. Are folks open to simply saying that the operation is undefined by the existing standard? Implementations can choose to raise. The "blessed" way to do this would be: lhs = df1.select("a").to_array()
rhs = df2.select("b").to_array()
df.filter(column_from_1d_array(lhs > rhs)) It closes no doors for anyone and allows eager implementations to remain relatively untouched. |
Thanks for your feedback
This wouldn't quite work, |
Ah, right you are. I meant |
Just for my understading, are you expecting that future cuDF syntax will look like this lineitem = lineitem.assign(
[
(
lineitem.get_column_by_name("l_extended_price")
* (1 - lineitem.get_column_by_name("l_discount"))
).rename("l_disc_price"),
(
lineitem.get_column_by_name("l_extended_price")
* (1 - lineitem.get_column_by_name("l_discount"))
* (1 + lineitem.get_column_by_name("l_tax"))
).rename("l_charge"),
]
) even though pandas (main namespace) won't*? *it's not my decision. You're welcome to propose this to pandas, I just though it would be right to give you a heads up that this is almost certainly going to be unanimously rejected by other pandas devs, please don't shoot the messenger
There's more to it than that:
I'm open to that. In fact, I'd suggested it here #224 , but the question went unanswered |
There's no reason cuDF couldn't introduce the |
Sure, if you're happy to have both then that sounds fine I don't think that having an API which makes certain operations look possible but which raise in some implementations makes for a particularly good user experience, but it seems that people are ok with that We also don't solve the whole "when to materialise?" issue with the status quo, but people don't seem to consider that a real issue Closing then - I'm skeptical about the current API's usability and usefulness, but I may well be wrong, so let's just wait for people to try using it and to see if they report issues |
Apologies for the delayed reply! I had some very busy days! I created a PR with an example of how we envision using lazy ONNX-based data frames to trace inference logic in sklean-like pipelines. Concerning the previous discussion, I hope that that example may illustrate the benefits of an API that could signal that the
I think a hint would go a long way since it would make semver-breaking change quite obvious.
I'm afraid our library is not yet open source, but I hope that will change in the next couple of months! |
Just a brief comment, since it's really tricky to collect (pun intended) and express all my thoughts. First, this is a hard and complex topic, and I think closing this PR was a bit abrupt and probably premature. There's a desire to move ahead and perhaps some time constraints to get to a "ready to adopt" state, but this is the single most difficult design topic so far, so we should be ready to spend a bit more time here. I don't want to hit the "reopen" now, but we should discuss and disentangle things here. Leaving this unresolved and then putting it in some of the library-specific implementations or compat layer isn't going to help us in the long run. There are at least 3 different topics here:
My point of view is that we can resolve (1) and (2) quickly, and then with those distractions gone it may be easier to focus on (3). |
Thanks for your comments
How do you propose to resolve (2) quickly? The use-case I'm particularly interested in addressing is the same one Christian had brought up - you need to materialise data for |
None of those are actually in this API. What is happening there is that All this is fully compatible with |
Would it be acceptable to add |
A lot of thought and effort is going into this, so I hope people will be open to at least considering it. No worries if it's rejected, but I hope that the rejection can come with an alternative solution to the problems with the status quo.
Why?
The current API makes it look like certain operations are possible. However, for some implementations (e.g. dataframes as frontends to SQL backends, polars lazyframes), the following would either have to raise, or trigger an implicit join (a "performance footgun"):
The problem is that the current API makes it look like it should run fine.
What's the suggested solution?
The basic idea is that, instead of only having
DataFrame
andColumn
, there will now be 4 classes:DataFrame
Column
PermissiveFrame
PermissiveColumn
DataFrame
Completely agnostic to execution details (so, some implementations may choose to use lazy execution for it).
Initialisation:
namespace.dataframe_from_dict
namespace.dataframe_from_2d_array
df.__dataframe_consortium_standard__()
Individual standalone columns cannot be extracted. All operations must be done on the dataframe level, using
Column
s to refer to individual columns.Column
Can be used to create new columns in a dataframe, update a dataframe's column(s), or filter a dataframe's rows. Examples:
Formally, a Column is lazy and is evaluated within the context of a DataFrame.
The idea is that
should behave a bit like (pandas syntax):
This has two benefits:
In particular, this resolves the "problematic call" from the "Why?" section as it would become impossible to write.
Initialisation:
namespace.col
namespace.any_rowwise
,namespace.all_rowwise
,namespace.sorted_indices
,namespace.unique_indices
)Supports all functions on
Column
, but with the following differences:col('a').mean()
) return a length-1Column
rather than a Scalarto_array_object
.dtype
, as Columns are free-standing. If you want to know the dtype of aDataFrame
's column, useDataFrame.schema[column_name]
name
- instead, there are:root_names
: the column names from the dataframe to consider when creating the output columnoutput_name
: the name of the resulting columnNote that
Column.len
is supported, which (lazily) evaluates the number of rowsPermissiveFrame
Like
DataFrame
, but must be able to support eager evaluation when evaluatingto_array_object
. Other operations can stay lazy - if it wishes - but it needs to be able to evaluate eagerly when required.Initiallisation: can only be done via:
DataFrame.collect()
Supports everything in
DataFrame
(exceptcollect
), but also:to_array_object
(to produce an 2D ndarray)relax
(to go back toDataFrame
, which is totally agnostic to the execution mode and so might be lazy in some implementations)get_column_by_name
(to create a singlePermissiveColumn
)I've documented that
collect
should be called as little and as late as possible.PermissiveColumn
Initialisation:
PermissiveFrame.get_column_by_name
ser.__column_consortium_standard__()
namespace.column_from_sequence
namespace.column_from_1d_array
Just like
Column
, but:Column.mean()
) return scalarsto_array_object
(for creating a 1D array).dtype
propertyname
, and noroot_names
noroutput_name
Assorted
shape
I've removed
shape
as well to discouragedf.collect().shape
. It's not needed anyway:len(df.column_names)
namespace.col('a').len()
If anyone has a use-case for which this is really necessary, we can always add it in again later (adding extra functionality is backwards-compatible)
Column | PermissiveColumn
A
PermissiveColumn
can be thought of as a trivialColumn
. For example (pandas syntax)df.loc[pd.Series([True, False, True])]
can be thought of as the following Column:and so functions which accepts
Column
s (like.assign
/.filter
) should also acceptPermissiveColumn
s.A current goal of the Standard is to have the API be completely independent of execution details. This proposal would slightly relax that goal, so that only
DataFrame
andColumn
would be completely independent of execution detailsExamples
I've included some examples in the PR, which I think would generally be good to include in the repo anyway to demonstrate how this is meant to be used.
1. center dataset based on mean from training set
Function just transforms a
DataFrame
, so everything can happen at that level. No need forcollect
2. Group-by then plot
Function accepts Series, so
PermissiveColumn
needs using here3. Split dataset and train model
In order to train an ML model, conversion to ndarray will have to take place. So,
DataFrame.collect
is needed, but it only needs to be called once, after the more expensive calculations have been doneFAQ
Why
namespace.col
instead ofDataFrame.col
?I've stated in #244 that I'd like the Standard to feel familiar. Multiple libraries (pyspark, polars, ibis, siuba) already have something
namspace.col
-like, and so this would feel familiar to their users.If someone wants the dtype of a column of a
DataFrame
, they can dodf.schema[column_name]
(if #248 is accepted)Why does PermissiveFrame need
namespace.col
? Isn't that just a lazy thing?It can work for eager dataframes as well. If
plant_statistics
is anPermissiveFrame
, you can writeinstead of
and maybe there'd be some scope for limited and basic query optimisation in libraries which up until now have been completely eager (like pandas)
Can I try this out?
By the time I mark this as ready-for-review, I'm hoping it will be possible to try out data-apis/dataframe-api-compat#13
Why do we need
collect
at all? Can't everything be agnostic to execution details?Please see example 3
Why do we need
PermissiveColumn
at all?Please see example 2
See Also
Some related discussions happened at: