Would e-graphs be useful for query optimization + rewrite-rules in query engines? #189
Replies: 6 comments 7 replies
-
Potentially... One thing you need in a query planner that you'd need to find a way to integrate with the e-graph is statistics about various tables. So for example, a classic query planner optimization is ordering joins. E-graphs can help represent the space of possible joins, but to pick the best one you need to know about the size of individual tables. That would have to be added somehow. |
Beta Was this translation helpful? Give feedback.
-
E-graphs are actually quite similar to the "memo table" in the CASCADES framework for query optimization, which is used many modern SQL engines. There are some differences: CASCADES is smarter/more specialized in how they do optimization, where as equality saturation is more general and does "breadth-first" exploration and then extraction. However, they are data structures for representing many equivalent ways to write a program. |
Beta Was this translation helpful? Give feedback.
-
Hi Matthew:
I am also writing a SQL optimizer using egg - here is an example of what I
have ended up with so far:
Given this source query:
SELECT name, toy_name, weight FROM pet JOIN pet_toy USING (pet_id) JOIN toy
USING (toy_id) WHERE toy.weight+7 = 99 AND pet.width = 7
I use ZetaSQL (a super excellent open source SQL analyzer from google) to
transform to a typed, resolved, normalized SQL AST. Then I transform the
AST into this egg:
(project
(outcols @2 @7 @8)
computedcols
(filter
(outcols @1 @2 @3 @4 @5 @6 @7 @8)
(and (= (+ @8 7) 99) (= @3 7))
(join
(outcols @1 @2 @3 @4 @5 @6 @7 @8)
(= @5 @6) // join expr
(join
(outcols @1 @2 @3 @4 @5)
(= @1 @4) // Join expr
(scan (outcols @1 @2 @3) pet) // Left input to join
(scan (outcols @4 @5) pet_toy)) // Right input to join
(scan (outcols @6 @7 @8) toy))))
There are no vararg things for anything requiring optimization (just for
informational things like 'outcols').
Here is the optimization output:
(project
(outcols @2 @7 @8)
computedcols
(hashjoin
(outcols @1 @2 @3 @4 @5 @6 @7 @8)
@5 // left key expr
@6 // right key expr
(indexeqscan (outcols @6 @7 @8) toy @8 92) // Index scan of table
'toy' where Col @8 = 92
(hashjoin
(outcols @1 @2 @3 @4 @5)
@1 // left key expr
@4 // right key expr
(scan (outcols @4 @5) pet_toy)
(indexeqscan (outcols @1 @2 @3) pet @3 7))))
The @/outcols stuff works like this:
Leaf nodes can introduce cols ***@***.***), and these are accessible up the tree as
long as they are propagated up through an (outcols) decl on the node.
Introduced outcols ids are unique - for example if there are two scans of
'pet', they will introduce different out col ids.
Arranging things in this way is a critical trick for optimization
performance - it allows expressions to be moved between nodes without
having to rewrite the embedded variable references for the new context.
The 'outcols' thing (among other purposes) allows us to quickly check
which @cols are bound by each source node. (this outcols style is also
native to ZetaSQL).
There are also @cols with a downward scope that are introduced for
parameterized subquery expressions.
Here are the filter transformation rules:
rw!("filter-combine-with-and";
"(filter ?cols ?c1 (filter ?cols ?c2 ?r1))" =>
"(filter ?cols (and ?c1 ?c2) ?r1)"),
rw!("filter-split-and";
"(filter ?cols (and ?c1 ?c2) ?r1)" =>
"(filter ?cols ?c1 (filter ?cols ?c2 ?r1))"),
rw!("filter-true";
"(filter ?cols true ?r1)" =>
"(?r1)"),
rw!("filter-combine-with-scan-1";
"(filter ?cols (= ?filterkey ?v) (scan ?cols ?table))" =>
"(indexeqscan ?cols ?table ?filterkey ?v)
if expr_is_col_ref(var("?filterkey"))
if expr_is_const(var("?v"))), // is_const is too strict.
rw!("push-filter-though-join";
"(filter ?cols ?filtercond
(join ?cols ?joincond ?left ?right))"
=>
{ WithRelopColsApplier {
relop_var: var("?left"),
relop_cols_var: var("?relopcols"),
pattern: "(join ?cols ?joincond
(filter ?relopcols ?filtercond ?left)
?right)".parse().unwrap(),
}}
if expr_bound_by_cols(var("?filtercond"), var("?left"))),
It has been a real pleasure doing this with egg, and the analysis stuff has
worked out very well. I may still hit a wall - there could easily be
something I am not considering (this is my first SQL optimizer), and I
presently running all my optimizations to saturation - I will have do do
something more sophisticated and SQL specific at some point to keep giving
good results without boiling the seas while exploring all possible join
orders.
- Regards
David Ziegler
…On Sun, Aug 21, 2022 at 4:19 PM Matthew Pope ***@***.***> wrote:
From #195 <#195> :
Could you try a cons-list representation
I have seen the light:
rewrite!(
"IN_LIST / SELECT_LIST flatten";
"(ON (= (IN_LIST ?in-list-element ?in-list-rest) (SELECT_LIST ?select-list-element ?select-list-rest)))" =>
"(ON (AND (= ?select-list-element ?in-list-element) (= ?select-list-rest ?in-list-rest)))"
),
—
Reply to this email directly, view it on GitHub
<#189 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ4EXW6XWOY46IXVPN3ZYLV2KFLLANCNFSM5ZG4CG7Q>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hello Matthew:
As you know, SQL is crazy complicated. Your input language to egg seems
pretty close to the SQL source, so a lot of this complexity is exposed to
the egg optimization rules.
I have in the past attempted to write a SQL frontend, and while I could get
a toy SQL frontend working pretty easily, I quickly realized that to make
something that was spec compliant was a *HUGE* job, way more than I could
reasonably do as a solo developer. Even something as seemingly simple as
variable scoping is crazy complicated.
So, I am using ZetaSQL to do the SQL analysis. It transforms a SQL query
into a typed, resolved, simplified AST. In my world there is a small
number of relational algebra operators + normal expressions. So, for
example, by the time I see a join, there is no 'in' clause - I get a join
type, a left scan relnode, a right relnode and a join expr (which
would consist of a bunch of and's in your in case).
This is what the whole AST looks like:
https://github.com/google/zetasql/blob/master/docs/resolved_ast.md
Thus, I am not really doing a SQL optimizer - I am doing an optimizer for
the much simplified, normalized, cleaned up, language that ZetaSQL
tranforms SQL into. (ZetaSQL is about 350K lines of code just to do this!)
So, instead of the varargs IN clause, that you represent as a 'cons', I
have a join expr, that is a tree of 'and' exprs - achieving a similar
effect. My egg equality rules then rewrite these ands into all the
possible orders (making checking whether a join
condition is top-level dependent on a particular variable much easier).
Anyway, the optimization you propose seems to be easy to do on top of the
form I am using - but I will see if I am missing something when I do it for
real!
I am not opposed to using 'cons' cells if I have to - it just hasn't come
up yet. I am indeed a bit scared that I will discover that I am missing
something critical as I try to pass all the compliance tests (and do all
the important optimizations), but I see no way but to plow forward and hope
for the best. (Once my project is a bit further along, I will release it
under Apache, if you are interested)
- Regards
David Ziegler
…On Mon, Aug 22, 2022 at 11:15 AM Matthew Pope ***@***.***> wrote:
I think that makes sense generally, but part of planning/optimizing can
include logical rewrite ontop of physical rewrites.
Here is a more complete example to show what I mean. An IN clause with a
subquery like:
SELECT iFROM my_tblWHERE i IN (SELECT y FROM your_tbl);
Can often be re-written to a LEFT OUTER JOIN to more easily optimize the
scan.
SELECT my_tbl.iFROM your_tblLEFT OUTER JOIN my_tbl i = y;
So if there were two columns in the IN clause like:
SELECT i FROM my_tblWHERE (i, j) IN (SELECT y, z FROM your_tbl);
Then the expected join would have two equality predicates
SELECT my_tbl.i, my_tbl.jFROM your_tblLEFT OUTER JOIN my_tbl i = y AND j = z;
I think this is where things get tricky, as the output_cols of the
subquery now have to be matched upon the columns present in the IN
clause, to reduce to a boolean + equality expr. In this case, variable
length can be an issue. In a previous answer it was stated that consing
should be considered. I think this makes sense generally, consider the
following expression and simplified rule
(SELECT
(CONS i (CONS j nil))
(FROM test1)
(WHERE (IN (CONS i (CONS j nil))
(SELECT
(CONS a (CONS b nil))
(FROM test2)))))
rewrite!(
"IN to LOJ";
"(SELECT ?select-list (FROM ?from-outer) (WHERE (IN ?in-list (SELECT ?select-list-inner (FROM ?from-inner)))))" =>
"(LEFT_OUTER_JOIN ?from-outer (ON (= ?in-list ?select-list-inner)) <- CONS / CONS equality (SELECT ?select-list (FROM ?from-inner)))"
),
This can reduce a subquery to a LOJ. The next step would be cons / cons
optimization to compound equality, which is difficult in my eyes, see this
issue here: #199 <#199> which
clears some of my questions up and gives paths forward. If a single (select_columns
c1 c2 c3) element was used here then (as I understand it) no single rule
would be able to transform it into a compound equality in the form of
(AND (= a b) (AND (= c d) (=e f)))
—
Reply to this email directly, view it on GitHub
<#189 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ4EXWVMDVWD3JSJ5ZG4GDV2OKRDANCNFSM5ZG4CG7Q>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Agree with @djziegler - directly rewriting SQL is hopeless; the way to go is to pick a simple intermediate representation and rewrite over that. The de facto IR for SQL is relational algebra, however there are countless flavors and extensions of relational algebra. ZetaSQL looks reasonable; I've also looked at cockroachDB's IR, and they already built a Cascade optimizer for it. For Rust, there's Data Fusion which seems to provide interfaces for physical plans. Self-advertising a bit, I'm a strong proponent of the Datalog-style abstraction, and have written a simple optimizer using it (See Table 2, Fig. 1 and Fig. 2). |
Beta Was this translation helpful? Give feedback.
-
Hello George:
Yes I have implemented logical to physical rewrites - but I am focussed on
getting queries running end-to-end before I work on a quality cost function
(I am using AstSize for now).
If you are implementing one, please let me know how it goes!
- Best Regards
David
…On Mon, Jan 23, 2023 at 10:49 AM GeorgeOctavian ***@***.***> wrote:
Hi @djziegler <https://github.com/djziegler> - very cool project! Would
you mind sharing the operator cost function used during extraction? I am
assuming you have also implemented logical to physical rewrites? Many thanks
—
Reply to this email directly, view it on GitHub
<#189 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ4EXR3QQMSMNIPYJKBOH3WT2SAJANCNFSM5ZG4CG7Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
As the title implies, I'm curious whether this would be a good usecase.
I don't know anything about e-graphs (or maths, for that matter) but naively it seems well-suited to the task.
I'm a bit more familiar with DB/query engine stuff, where I've never heard of this technique being used.
Curious why I've not run across it before.
Thank you =)
Beta Was this translation helpful? Give feedback.
All reactions