Support converting logical plan to/from substrait #7404

universalmind303 · 2023-03-07T21:32:09Z

Problem description

Substrait is becoming the de-facto cross language spec for sharing logical plans across tools.

It would be nice if we could convert a logical plan to substrait. Substrait supports both json & protobuf reprs of the plan, so it would be ideal to support both of them as well.

lf = pl.scan_csv('./path/to/csv')

json_plan = lf.to_substrait_json()
proto_plan = lf.to_substrait()

# deserialize back to a polars lazyframe
lf = pl.from_substrait_json(json_plan)
lf = pl.from_substrait(proto_plan)

lucazanna · 2023-03-07T22:00:00Z

hi @universalmind303 ,

my understanding is that this will allow people to write a query once, for example with the Polars API.
Then run the query with any engine they choose.

So if I need to run the Polars query on the Spark engine in the future, I can run the query on Spark. No need to manually translate any code. (if Spark SQL supports substrait)

Is that what this will allow?

universalmind303 · 2023-03-07T22:54:33Z

yes that is correct. you could go both ways with it too -- build your query with any substrait compatible engine & run it via polars.

lucazanna · 2023-03-09T14:02:02Z

Pyarrow seems to have support for running substrait plans: https://arrow.apache.org/docs/dev/python/api/substrait.html.
Building on top of it might help the development (at least on the Python side), but would not help with Rust.
Also not sure of the performance of the pyarrow computing compared to Polars computing? I imagine the performance would not be the same

That discussion aside, by doing data consulting I see lots of companies that would benefit from Polars but are hesitant to rewrite their code from Pandas to Polars, because that would require time.
Substrait could be an easy way to start with Polars and see the performance benefits. Then move gradually to the more expressive Polars API with time

Or the other way around: start with Polars on one machine, and change the query engine to Spark if they need distributed processing. While keeping the expressive Polars API

sorhawell · 2023-05-17T06:52:15Z

The py-polars scan_arrow_dataset which produces literal python py-arrow code and evaluates it in order to do predicate pushdown and integration with deltalake which depends on delta-rs have dedicated interface with python are currently difficult for me to reproduce and maintain in R-polars.

A rust-polars substraits consumer/producer could make it easier and more maintainable to provide dataconnections for xyz-polars spin-off to xyz-query-engine.

I wonder how an implementation in theory would like? :)

orlp · 2024-04-11T11:56:58Z

Polars is not a relational API, nor a relational engine, but is instead based on 'dataframe algebra' which is a superset of relational algebra. The crucial difference is that in relational algebra the table is a fundamentally unordered bags of tuples, whereas in a dataframe they have an explicit order. I do understand that Substrait has some basic properties regarding ordering (which already is a departure from relational algebra), but not to the extent Polars has in its current API and future roadmap.

For example, the following simple examples have no direct equivalent in relational algebra (but it can get much more complicated when nesting window functions, groupings, etc):

df.select(pl.col.a + pl.col.a.reverse())
df.select(pl.col.a.slice(0, pl.col.b.max()))
df.select(pl.col.a.forward_fill())
df.select(pl.col.a.take_every(7).sum())

Potentially in the (far) future we can support running Substrait plans on the Polars engine, but the reverse is unlikely to ever happen.

westonpace · 2024-04-11T16:00:25Z

Disclaimer: I'm on the Substrait SMC and was pointed here and figured I'd chime in. So obviously I'm going to be biased :)

I don't think the Substrait project has any problem with extensions to support dataframe algrebra or constructs for more sophisticated ordering. There are other backends / consumers which would probably appreciate this as well. For example, datafusion consumes Substrait today and some users use datafusion for time series applications (which are streaming, order-dependent operations)

Dataframe ordering is not that hard to represent. The main difference between dataframe ordering and what is in most SQL engines is that It's an "implicit ordering" which means it is not based on any column in the dataframe. You can often approximate dataframe ordering with something like attaching a row_id on the incoming data (but I think we'd be open to extending Substrait to avoid this by allowing for the definition of implicit orderings).

For example, the following simple examples have no direct equivalent in relational algebra (but it can get much more complicated when nesting window functions, groupings, etc):

Are you talking about the eager API or the lazy API? I'm not sure your examples are valid for the lazy API.

For the eager API it's pretty simple. Each operation is a plan.

# df.select(pl.col.a + pl.col.a.reverse())
df2 = SELECT df.a FROM df ORDER BY row_id DESC
SELECT df.a + df2.a FROM df JOIN df2 ON df.row_id = df2.row_id

# df.select(pl.col.a.slice(0, pl.col.b.max())
end = SELECT MAX_BY(row_id, b) FROM df
SELECT a FROM df WHERE row_id < end

# df.select(pl.col.a.forward_fill())
# forward fill can be implemented as a window function but, given
# most query engines haven't moved on to sophisticated window
# function implementations, you would probably be better off with
# a custom operator

# df.select(pl.col.a.take_every(7).sum())
SELECT SUM(df.a) FROM df WHERE row_id % 7 == 0

Also, I would caution that this is a problem that you will have to face at some point if you are committed to out of core processing, streaming APIs, or a lazy frame API. Note that df.select(pl.col.a + pl.col.a.reverse()) is going to be hard to implement in all of those situations.

I'm not saying that Substrait is the answer, or that it is going to be easy. However, I think rejecting substrait "because order" is a little premature.

universalmind303 added the enhancement New feature or an improvement of an existing feature label Mar 7, 2023

universalmind303 mentioned this issue Apr 5, 2023

Interface about query plans #7998

Open

eitsupi mentioned this issue May 17, 2023

Executing an SQL query over a dataset #2871

Closed

universalmind303 mentioned this issue Sep 15, 2023

support querying polars lazyframes GlareDB/glaredb#1770

Open

kszlim mentioned this issue Apr 10, 2024

Allow duckdb as an engine backend via its "relational API" #15581

Closed

orlp closed this as not planned Won't fix, can't repro, duplicate, stale Apr 11, 2024

brayanjuls mentioned this issue May 27, 2024

Metrics Computing: DataFrame API OSS-Latam/df-metrics#7

Open

cmdlineluser mentioned this issue Aug 1, 2024

Querying DuckDB using Polars' logical plan #17997

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support converting logical plan to/from substrait #7404

Support converting logical plan to/from substrait #7404

universalmind303 commented Mar 7, 2023

lucazanna commented Mar 7, 2023

universalmind303 commented Mar 7, 2023

lucazanna commented Mar 9, 2023 •

edited

Loading

sorhawell commented May 17, 2023

orlp commented Apr 11, 2024

westonpace commented Apr 11, 2024

Support converting logical plan to/from substrait #7404

Support converting logical plan to/from substrait #7404

Comments

universalmind303 commented Mar 7, 2023

Problem description

lucazanna commented Mar 7, 2023

universalmind303 commented Mar 7, 2023

lucazanna commented Mar 9, 2023 • edited Loading

sorhawell commented May 17, 2023

orlp commented Apr 11, 2024

westonpace commented Apr 11, 2024

lucazanna commented Mar 9, 2023 •

edited

Loading