Would e-graphs be useful for query optimization + rewrite-rules in query engines? #189

GavinRay97 · 2022-06-19T20:07:55Z

GavinRay97
Jun 19, 2022

As the title implies, I'm curious whether this would be a good usecase.
I don't know anything about e-graphs (or maths, for that matter) but naively it seems well-suited to the task.

I'm a bit more familiar with DB/query engine stuff, where I've never heard of this technique being used.
Curious why I've not run across it before.

Thank you =)

pavpanchekha · 2022-06-20T09:43:36Z

pavpanchekha
Jun 20, 2022

Potentially... One thing you need in a query planner that you'd need to find a way to integrate with the e-graph is statistics about various tables. So for example, a classic query planner optimization is ordering joins. E-graphs can help represent the space of possible joins, but to pick the best one you need to know about the size of individual tables. That would have to be added somehow.

0 replies

mwillsey · 2022-06-29T20:49:09Z

mwillsey
Jun 29, 2022
Maintainer

E-graphs are actually quite similar to the "memo table" in the CASCADES framework for query optimization, which is used many modern SQL engines. There are some differences: CASCADES is smarter/more specialized in how they do optimization, where as equality saturation is more general and does "breadth-first" exploration and then extraction. However, they are data structures for representing many equivalent ways to write a program.

0 replies

djziegler · 2022-08-22T12:50:14Z

djziegler
Aug 22, 2022

Hi Matthew: I am also writing a SQL optimizer using egg - here is an example of what I have ended up with so far: Given this source query: SELECT name, toy_name, weight FROM pet JOIN pet_toy USING (pet_id) JOIN toy USING (toy_id) WHERE toy.weight+7 = 99 AND pet.width = 7 I use ZetaSQL (a super excellent open source SQL analyzer from google) to transform to a typed, resolved, normalized SQL AST. Then I transform the AST into this egg: (project (outcols @2 @7 @8) computedcols (filter (outcols @1 @2 @3 @4 @5 @6 @7 @8) (and (= (+ @8 7) 99) (= @3 7)) (join (outcols @1 @2 @3 @4 @5 @6 @7 @8) (= @5 @6) // join expr (join (outcols @1 @2 @3 @4 @5) (= @1 @4) // Join expr (scan (outcols @1 @2 @3) pet) // Left input to join (scan (outcols @4 @5) pet_toy)) // Right input to join (scan (outcols @6 @7 @8) toy)))) There are no vararg things for anything requiring optimization (just for informational things like 'outcols'). Here is the optimization output: (project (outcols @2 @7 @8) computedcols (hashjoin (outcols @1 @2 @3 @4 @5 @6 @7 @8) @5 // left key expr @6 // right key expr (indexeqscan (outcols @6 @7 @8) toy @8 92) // Index scan of table 'toy' where Col @8 = 92 (hashjoin (outcols @1 @2 @3 @4 @5) @1 // left key expr @4 // right key expr (scan (outcols @4 @5) pet_toy) (indexeqscan (outcols @1 @2 @3) pet @3 7)))) The @/outcols stuff works like this: Leaf nodes can introduce cols ***@***.***), and these are accessible up the tree as long as they are propagated up through an (outcols) decl on the node. Introduced outcols ids are unique - for example if there are two scans of 'pet', they will introduce different out col ids. Arranging things in this way is a critical trick for optimization performance - it allows expressions to be moved between nodes without having to rewrite the embedded variable references for the new context. The 'outcols' thing (among other purposes) allows us to quickly check which @cols are bound by each source node. (this outcols style is also native to ZetaSQL). There are also @cols with a downward scope that are introduced for parameterized subquery expressions. Here are the filter transformation rules: rw!("filter-combine-with-and"; "(filter ?cols ?c1 (filter ?cols ?c2 ?r1))" => "(filter ?cols (and ?c1 ?c2) ?r1)"), rw!("filter-split-and"; "(filter ?cols (and ?c1 ?c2) ?r1)" => "(filter ?cols ?c1 (filter ?cols ?c2 ?r1))"), rw!("filter-true"; "(filter ?cols true ?r1)" => "(?r1)"), rw!("filter-combine-with-scan-1"; "(filter ?cols (= ?filterkey ?v) (scan ?cols ?table))" => "(indexeqscan ?cols ?table ?filterkey ?v) if expr_is_col_ref(var("?filterkey")) if expr_is_const(var("?v"))), // is_const is too strict. rw!("push-filter-though-join"; "(filter ?cols ?filtercond (join ?cols ?joincond ?left ?right))" => { WithRelopColsApplier { relop_var: var("?left"), relop_cols_var: var("?relopcols"), pattern: "(join ?cols ?joincond (filter ?relopcols ?filtercond ?left) ?right)".parse().unwrap(), }} if expr_bound_by_cols(var("?filtercond"), var("?left"))), It has been a real pleasure doing this with egg, and the analysis stuff has worked out very well. I may still hit a wall - there could easily be something I am not considering (this is my first SQL optimizer), and I presently running all my optimizations to saturation - I will have do do something more sophisticated and SQL specific at some point to keep giving good results without boiling the seas while exploring all possible join orders. - Regards David Ziegler

…

On Sun, Aug 21, 2022 at 4:19 PM Matthew Pope ***@***.***> wrote: From #195 <#195> : Could you try a cons-list representation I have seen the light: rewrite!( "IN_LIST / SELECT_LIST flatten"; "(ON (= (IN_LIST ?in-list-element ?in-list-rest) (SELECT_LIST ?select-list-element ?select-list-rest)))" => "(ON (AND (= ?select-list-element ?in-list-element) (= ?select-list-rest ?in-list-rest)))" ), — Reply to this email directly, view it on GitHub <#189 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ4EXW6XWOY46IXVPN3ZYLV2KFLLANCNFSM5ZG4CG7Q> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

2 replies

GavinRay97 Aug 22, 2022
Author

This is really cool, thank you for sharing

GeorgeOctavian Jan 23, 2023

Hi @djziegler - very cool project! Would you mind sharing the operator cost function used during extraction? I am assuming you have also implemented logical to physical rewrites? Many thanks

djziegler · 2022-08-22T16:18:22Z

djziegler
Aug 22, 2022

Hello Matthew: As you know, SQL is crazy complicated. Your input language to egg seems pretty close to the SQL source, so a lot of this complexity is exposed to the egg optimization rules. I have in the past attempted to write a SQL frontend, and while I could get a toy SQL frontend working pretty easily, I quickly realized that to make something that was spec compliant was a *HUGE* job, way more than I could reasonably do as a solo developer. Even something as seemingly simple as variable scoping is crazy complicated. So, I am using ZetaSQL to do the SQL analysis. It transforms a SQL query into a typed, resolved, simplified AST. In my world there is a small number of relational algebra operators + normal expressions. So, for example, by the time I see a join, there is no 'in' clause - I get a join type, a left scan relnode, a right relnode and a join expr (which would consist of a bunch of and's in your in case). This is what the whole AST looks like: https://github.com/google/zetasql/blob/master/docs/resolved_ast.md Thus, I am not really doing a SQL optimizer - I am doing an optimizer for the much simplified, normalized, cleaned up, language that ZetaSQL tranforms SQL into. (ZetaSQL is about 350K lines of code just to do this!) So, instead of the varargs IN clause, that you represent as a 'cons', I have a join expr, that is a tree of 'and' exprs - achieving a similar effect. My egg equality rules then rewrite these ands into all the possible orders (making checking whether a join condition is top-level dependent on a particular variable much easier). Anyway, the optimization you propose seems to be easy to do on top of the form I am using - but I will see if I am missing something when I do it for real! I am not opposed to using 'cons' cells if I have to - it just hasn't come up yet. I am indeed a bit scared that I will discover that I am missing something critical as I try to pass all the compliance tests (and do all the important optimizations), but I see no way but to plow forward and hope for the best. (Once my project is a bit further along, I will release it under Apache, if you are interested) - Regards David Ziegler

…

On Mon, Aug 22, 2022 at 11:15 AM Matthew Pope ***@***.***> wrote: I think that makes sense generally, but part of planning/optimizing can include logical rewrite ontop of physical rewrites. Here is a more complete example to show what I mean. An IN clause with a subquery like: SELECT iFROM my_tblWHERE i IN (SELECT y FROM your_tbl); Can often be re-written to a LEFT OUTER JOIN to more easily optimize the scan. SELECT my_tbl.iFROM your_tblLEFT OUTER JOIN my_tbl i = y; So if there were two columns in the IN clause like: SELECT i FROM my_tblWHERE (i, j) IN (SELECT y, z FROM your_tbl); Then the expected join would have two equality predicates SELECT my_tbl.i, my_tbl.jFROM your_tblLEFT OUTER JOIN my_tbl i = y AND j = z; I think this is where things get tricky, as the output_cols of the subquery now have to be matched upon the columns present in the IN clause, to reduce to a boolean + equality expr. In this case, variable length can be an issue. In a previous answer it was stated that consing should be considered. I think this makes sense generally, consider the following expression and simplified rule (SELECT (CONS i (CONS j nil)) (FROM test1) (WHERE (IN (CONS i (CONS j nil)) (SELECT (CONS a (CONS b nil)) (FROM test2))))) rewrite!( "IN to LOJ"; "(SELECT ?select-list (FROM ?from-outer) (WHERE (IN ?in-list (SELECT ?select-list-inner (FROM ?from-inner)))))" => "(LEFT_OUTER_JOIN ?from-outer (ON (= ?in-list ?select-list-inner)) <- CONS / CONS equality (SELECT ?select-list (FROM ?from-inner)))" ), This can reduce a subquery to a LOJ. The next step would be cons / cons optimization to compound equality, which is difficult in my eyes, see this issue here: #199 <#199> which clears some of my questions up and gives paths forward. If a single (select_columns c1 c2 c3) element was used here then (as I understand it) no single rule would be able to transform it into a compound equality in the form of (AND (= a b) (AND (= c d) (=e f))) — Reply to this email directly, view it on GitHub <#189 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ4EXWVMDVWD3JSJ5ZG4GDV2OKRDANCNFSM5ZG4CG7Q> . You are receiving this because you commented.Message ID: ***@***.***>

1 reply

mwillsey Aug 22, 2022
Maintainer

This looks like a very exciting project! I'm eager to hear the details whenever you get to a point where you can share them :)

remysucre · 2022-08-22T16:50:38Z

remysucre
Aug 22, 2022

Agree with @djziegler - directly rewriting SQL is hopeless; the way to go is to pick a simple intermediate representation and rewrite over that. The de facto IR for SQL is relational algebra, however there are countless flavors and extensions of relational algebra. ZetaSQL looks reasonable; I've also looked at cockroachDB's IR, and they already built a Cascade optimizer for it. For Rust, there's Data Fusion which seems to provide interfaces for physical plans. Self-advertising a bit, I'm a strong proponent of the Datalog-style abstraction, and have written a simple optimizer using it (See Table 2, Fig. 1 and Fig. 2).

3 replies

remysucre Aug 23, 2022

SQL represented as a lisp-y relational algebra

That's exactly what I'm getting at - you need to somehow convert SQL to some version of relational algebra. Orca's transformation engine also works over their version of relational algebra, defined as operators.

GavinRay97 Aug 23, 2022
Author

While we're mentioning tools in this realm, I can recommend Apache Calcite, which is an all-dialect SQL parser + optimizer + planner:

https://github.com/apache/calcite

You can use it just to parse a query to AST, (optionally optimize/simplify it), and then dump it as JSON if you like 👍

remysucre Aug 23, 2022

Calcite is great! It'd be fantastic to replace Calcite's optimization passes with equality saturation.

djziegler · 2023-01-24T13:15:36Z

djziegler
Jan 24, 2023

Hello George: Yes I have implemented logical to physical rewrites - but I am focussed on getting queries running end-to-end before I work on a quality cost function (I am using AstSize for now). If you are implementing one, please let me know how it goes! - Best Regards David

…

On Mon, Jan 23, 2023 at 10:49 AM GeorgeOctavian ***@***.***> wrote: Hi @djziegler <https://github.com/djziegler> - very cool project! Would you mind sharing the operator cost function used during extraction? I am assuming you have also implemented logical to physical rewrites? Many thanks — Reply to this email directly, view it on GitHub <#189 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ4EXR3QQMSMNIPYJKBOH3WT2SAJANCNFSM5ZG4CG7Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

GeorgeOctavian Jan 24, 2023

Hi David,

I am working on a very similar project. We can connect via email: [email protected]

Thanks,
George

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would e-graphs be useful for query optimization + rewrite-rules in query engines? #189

{{title}}

Replies: 6 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Would e-graphs be useful for query optimization + rewrite-rules in query engines? #189

Replies: 6 comments · 7 replies

mwillsey Jun 29, 2022 Maintainer

GavinRay97 Aug 22, 2022 Author

mwillsey Aug 22, 2022 Maintainer

GavinRay97 Aug 23, 2022 Author

Replies: 6 comments 7 replies

mwillsey
Jun 29, 2022
Maintainer

GavinRay97 Aug 22, 2022
Author

mwillsey Aug 22, 2022
Maintainer

GavinRay97 Aug 23, 2022
Author