Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support converting logical plan to/from substrait #7404

Closed
universalmind303 opened this issue Mar 7, 2023 · 6 comments
Closed

Support converting logical plan to/from substrait #7404

universalmind303 opened this issue Mar 7, 2023 · 6 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@universalmind303
Copy link
Collaborator

Problem description

Substrait is becoming the de-facto cross language spec for sharing logical plans across tools.

It would be nice if we could convert a logical plan to substrait. Substrait supports both json & protobuf reprs of the plan, so it would be ideal to support both of them as well.

lf = pl.scan_csv('./path/to/csv')

json_plan = lf.to_substrait_json()
proto_plan = lf.to_substrait()

# deserialize back to a polars lazyframe
lf = pl.from_substrait_json(json_plan)
lf = pl.from_substrait(proto_plan)
@universalmind303 universalmind303 added the enhancement New feature or an improvement of an existing feature label Mar 7, 2023
@lucazanna
Copy link

hi @universalmind303 ,

my understanding is that this will allow people to write a query once, for example with the Polars API.
Then run the query with any engine they choose.

So if I need to run the Polars query on the Spark engine in the future, I can run the query on Spark. No need to manually translate any code. (if Spark SQL supports substrait)

Is that what this will allow?

@universalmind303
Copy link
Collaborator Author

yes that is correct. you could go both ways with it too -- build your query with any substrait compatible engine & run it via polars.

@lucazanna
Copy link

lucazanna commented Mar 9, 2023

Pyarrow seems to have support for running substrait plans: https://arrow.apache.org/docs/dev/python/api/substrait.html.
Building on top of it might help the development (at least on the Python side), but would not help with Rust.
Also not sure of the performance of the pyarrow computing compared to Polars computing? I imagine the performance would not be the same

That discussion aside, by doing data consulting I see lots of companies that would benefit from Polars but are hesitant to rewrite their code from Pandas to Polars, because that would require time.
Substrait could be an easy way to start with Polars and see the performance benefits. Then move gradually to the more expressive Polars API with time

Or the other way around: start with Polars on one machine, and change the query engine to Spark if they need distributed processing. While keeping the expressive Polars API

@sorhawell
Copy link
Contributor

The py-polars scan_arrow_dataset which produces literal python py-arrow code and evaluates it in order to do predicate pushdown and integration with deltalake which depends on delta-rs have dedicated interface with python are currently difficult for me to reproduce and maintain in R-polars.

A rust-polars substraits consumer/producer could make it easier and more maintainable to provide dataconnections for xyz-polars spin-off to xyz-query-engine.

I wonder how an implementation in theory would like? :)

@orlp
Copy link
Collaborator

orlp commented Apr 11, 2024

Polars is not a relational API, nor a relational engine, but is instead based on 'dataframe algebra' which is a superset of relational algebra. The crucial difference is that in relational algebra the table is a fundamentally unordered bags of tuples, whereas in a dataframe they have an explicit order. I do understand that Substrait has some basic properties regarding ordering (which already is a departure from relational algebra), but not to the extent Polars has in its current API and future roadmap.

For example, the following simple examples have no direct equivalent in relational algebra (but it can get much more complicated when nesting window functions, groupings, etc):

df.select(pl.col.a + pl.col.a.reverse())
df.select(pl.col.a.slice(0, pl.col.b.max()))
df.select(pl.col.a.forward_fill())
df.select(pl.col.a.take_every(7).sum())

Potentially in the (far) future we can support running Substrait plans on the Polars engine, but the reverse is unlikely to ever happen.

@orlp orlp closed this as not planned Won't fix, can't repro, duplicate, stale Apr 11, 2024
@westonpace
Copy link

Disclaimer: I'm on the Substrait SMC and was pointed here and figured I'd chime in. So obviously I'm going to be biased :)

I don't think the Substrait project has any problem with extensions to support dataframe algrebra or constructs for more sophisticated ordering. There are other backends / consumers which would probably appreciate this as well. For example, datafusion consumes Substrait today and some users use datafusion for time series applications (which are streaming, order-dependent operations)

Dataframe ordering is not that hard to represent. The main difference between dataframe ordering and what is in most SQL engines is that It's an "implicit ordering" which means it is not based on any column in the dataframe. You can often approximate dataframe ordering with something like attaching a row_id on the incoming data (but I think we'd be open to extending Substrait to avoid this by allowing for the definition of implicit orderings).

For example, the following simple examples have no direct equivalent in relational algebra (but it can get much more complicated when nesting window functions, groupings, etc):

Are you talking about the eager API or the lazy API? I'm not sure your examples are valid for the lazy API.

For the eager API it's pretty simple. Each operation is a plan.

# df.select(pl.col.a + pl.col.a.reverse())
df2 = SELECT df.a FROM df ORDER BY row_id DESC
SELECT df.a + df2.a FROM df JOIN df2 ON df.row_id = df2.row_id

# df.select(pl.col.a.slice(0, pl.col.b.max())
end = SELECT MAX_BY(row_id, b) FROM df
SELECT a FROM df WHERE row_id < end

# df.select(pl.col.a.forward_fill())
# forward fill can be implemented as a window function but, given
# most query engines haven't moved on to sophisticated window
# function implementations, you would probably be better off with
# a custom operator

# df.select(pl.col.a.take_every(7).sum())
SELECT SUM(df.a) FROM df WHERE row_id % 7 == 0

Also, I would caution that this is a problem that you will have to face at some point if you are committed to out of core processing, streaming APIs, or a lazy frame API. Note that df.select(pl.col.a + pl.col.a.reverse()) is going to be hard to implement in all of those situations.

I'm not saying that Substrait is the answer, or that it is going to be easy. However, I think rejecting substrait "because order" is a little premature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

5 participants