-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support converting logical plan to/from substrait #7404
Comments
hi @universalmind303 , my understanding is that this will allow people to write a query once, for example with the Polars API. So if I need to run the Polars query on the Spark engine in the future, I can run the query on Spark. No need to manually translate any code. (if Spark SQL supports substrait) Is that what this will allow? |
yes that is correct. you could go both ways with it too -- build your query with any substrait compatible engine & run it via polars. |
Pyarrow seems to have support for running substrait plans: https://arrow.apache.org/docs/dev/python/api/substrait.html. That discussion aside, by doing data consulting I see lots of companies that would benefit from Polars but are hesitant to rewrite their code from Pandas to Polars, because that would require time. Or the other way around: start with Polars on one machine, and change the query engine to Spark if they need distributed processing. While keeping the expressive Polars API |
The py-polars scan_arrow_dataset which produces literal python py-arrow code and evaluates it in order to do predicate pushdown and integration with deltalake which depends on delta-rs have dedicated interface with python are currently difficult for me to reproduce and maintain in R-polars. A rust-polars substraits consumer/producer could make it easier and more maintainable to provide dataconnections for xyz-polars spin-off to xyz-query-engine. I wonder how an implementation in theory would like? :) |
Polars is not a relational API, nor a relational engine, but is instead based on 'dataframe algebra' which is a superset of relational algebra. The crucial difference is that in relational algebra the table is a fundamentally unordered bags of tuples, whereas in a dataframe they have an explicit order. I do understand that Substrait has some basic properties regarding ordering (which already is a departure from relational algebra), but not to the extent Polars has in its current API and future roadmap. For example, the following simple examples have no direct equivalent in relational algebra (but it can get much more complicated when nesting window functions, groupings, etc): df.select(pl.col.a + pl.col.a.reverse())
df.select(pl.col.a.slice(0, pl.col.b.max()))
df.select(pl.col.a.forward_fill())
df.select(pl.col.a.take_every(7).sum()) Potentially in the (far) future we can support running Substrait plans on the Polars engine, but the reverse is unlikely to ever happen. |
Disclaimer: I'm on the Substrait SMC and was pointed here and figured I'd chime in. So obviously I'm going to be biased :) I don't think the Substrait project has any problem with extensions to support dataframe algrebra or constructs for more sophisticated ordering. There are other backends / consumers which would probably appreciate this as well. For example, datafusion consumes Substrait today and some users use datafusion for time series applications (which are streaming, order-dependent operations) Dataframe ordering is not that hard to represent. The main difference between dataframe ordering and what is in most SQL engines is that It's an "implicit ordering" which means it is not based on any column in the dataframe. You can often approximate dataframe ordering with something like attaching a row_id on the incoming data (but I think we'd be open to extending Substrait to avoid this by allowing for the definition of implicit orderings).
Are you talking about the eager API or the lazy API? I'm not sure your examples are valid for the lazy API. For the eager API it's pretty simple. Each operation is a plan.
Also, I would caution that this is a problem that you will have to face at some point if you are committed to out of core processing, streaming APIs, or a lazy frame API. Note that I'm not saying that Substrait is the answer, or that it is going to be easy. However, I think rejecting substrait "because order" is a little premature. |
Problem description
Substrait is becoming the de-facto cross language spec for sharing logical plans across tools.
It would be nice if we could convert a logical plan to substrait. Substrait supports both json & protobuf reprs of the plan, so it would be ideal to support both of them as well.
The text was updated successfully, but these errors were encountered: