Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFCS: SQL query planning #19135

Merged

Conversation

petermattis
Copy link
Collaborator

I put metaphorical pen-to-paper this weekend and sketched out the
high-level modules for a SQL optimizer. This overlaps with Raphael's SQL
changes document (#18977), but has a more singular focus on SQL
optimization. I consider the documents complementary.

@petermattis petermattis requested a review from a team as a code owner October 9, 2017 18:04
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis petermattis force-pushed the pmattis/sql-optimizer-outline branch from 0ee7a09 to 3c1dee4 Compare October 9, 2017 18:04
@petermattis petermattis requested review from knz and RaduBerinde October 9, 2017 18:04
@petermattis
Copy link
Collaborator Author

Cc @albler

@RaduBerinde
Copy link
Member

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


docs/RFCS/20171008_sql_optimizer.md, line 129 at r1 (raw file):

beneficial. This is the phase where correlated subqueries are
decorrelated and predicate push down occurs. As an example of the
later, consider the query:

[nit] latter


docs/RFCS/20171008_sql_optimizer.md, line 155 at r1 (raw file):

Memo is a data structure for maintaining a forest of query
plans. Conceptually, the memo is composed of a set of equivalency
groups where each group is a logical node in the query plan. Each

One important question here is if nodes that are equivalent in terms of results but have different physical properties (e.g. ordering) are "equivalent" nodes. If yes, this caveat of "equivalent" needs to be cleared up. This also introduces some difficulties during the search: if nodes were truly equivalent, we could always use the lowest-cost node in each MEMO group (e.g. when applying transformations). But if we can exploit different orderings we may need to go through multiple nodes per MEMO group.


docs/RFCS/20171008_sql_optimizer.md, line 155 at r1 (raw file):

Memo is a data structure for maintaining a forest of query
plans. Conceptually, the memo is composed of a set of equivalency
groups where each group is a logical node in the query plan. Each

I think this description overloads "logical node" with a couple of different meanings (or needs to be clarified). Perhaps "each group corresponds to a logical node in the query plan" and "each group is represented by ..".

It would also help if we had a more clear definition of what's a physical node and how it differs from a logical node.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/sql-optimizer-outline branch from 3c1dee4 to 66343e3 Compare October 9, 2017 19:03
@petermattis
Copy link
Collaborator Author

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


docs/RFCS/20171008_sql_optimizer.md, line 129 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] latter

Done.


docs/RFCS/20171008_sql_optimizer.md, line 155 at r1 (raw file):

Previously, RaduBerinde wrote…

One important question here is if nodes that are equivalent in terms of results but have different physical properties (e.g. ordering) are "equivalent" nodes. If yes, this caveat of "equivalent" needs to be cleared up. This also introduces some difficulties during the search: if nodes were truly equivalent, we could always use the lowest-cost node in each MEMO group (e.g. when applying transformations). But if we can exploit different orderings we may need to go through multiple nodes per MEMO group.

I believe the equivalency groups are based on the logical properties. For example, the same equivalency group will hold scan A as well as the various possible table and index scans, despite those scans providing different physical properties. I'm mildly fuzzy on how this is utilized. My fuzzy understanding is that nodes in the memo sometimes point to equivalency groups and sometimes point directly to other nodes.

The Memo structure deserves its own RFC and, prior to that, more experimentation.


docs/RFCS/20171008_sql_optimizer.md, line 155 at r1 (raw file):

Previously, RaduBerinde wrote…

I think this description overloads "logical node" with a couple of different meanings (or needs to be clarified). Perhaps "each group corresponds to a logical node in the query plan" and "each group is represented by ..".

It would also help if we had a more clear definition of what's a physical node and how it differs from a logical node.

Adjusted the language here. I agree I was overloading "logical node".


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Oct 9, 2017

I understand this document! Yay 🎆


Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.


docs/RFCS/20171008_sql_optimizer.md, line 15 at r2 (raw file):

# Motivation

SQL optimization is concerned with transforming a SQL query into a

Technically, transforming a SQL query plan into a better plan.


docs/RFCS/20171008_sql_optimizer.md, line 115 at r2 (raw file):

maintains a list of filters and projections applied at the node.

Variable numbering involves assigning every base attribute and

Define "attribute".


docs/RFCS/20171008_sql_optimizer.md, line 121 at r2 (raw file):

representation allows fast determination of compatibility between
expression nodes and is utilized during rewrites and transformations
to determine the legality of such operations.

Example needed here. Please clarify the bitmap story on SELECT v, k FROM kv UNION ALL SELECT k, v FROM kv - v and k have the same index on both sides but a vector representation would cause them to be misaligned. How do you solve this?


docs/RFCS/20171008_sql_optimizer.md, line 157 at r2 (raw file):

groups where each group is a node in the query plan. Each equivalency
group contains one or more equivalent logical or physical nodes. For
example, one equivalancy group might contain both "a JOIN b" and "b

nit: equivalency


docs/RFCS/20171008_sql_optimizer.md, line 240 at r2 (raw file):

number of query plans considered. But it may be useful to use it
during prep in order to avoid a conversion from the query trees used
by prep to the memo structure.

You should be talking about pruning in this section or below in search.
I miss an overarching story over how these things interact, where e.g. the reader would learn that the memo starts empty, and that Search is driving the optimization, by populating the memo using Rewrites and Pruning it at the same time using Cost.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/sql-optimizer-outline branch from 66343e3 to 661477b Compare October 9, 2017 20:35
@petermattis
Copy link
Collaborator Author

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.


docs/RFCS/20171008_sql_optimizer.md, line 15 at r2 (raw file):

Previously, knz (kena) wrote…

Technically, transforming a SQL query plan into a better plan.

Well, it does output a physical query plan while the input might not be directly executable (i.e. not a plan at all).


docs/RFCS/20171008_sql_optimizer.md, line 115 at r2 (raw file):

Previously, knz (kena) wrote…

Define "attribute".

I've removed usage of the term attribute. It is used in literature I've been reading, but we use the term column or variable.


docs/RFCS/20171008_sql_optimizer.md, line 121 at r2 (raw file):

Previously, knz (kena) wrote…

Example needed here. Please clarify the bitmap story on SELECT v, k FROM kv UNION ALL SELECT k, v FROM kv - v and k have the same index on both sides but a vector representation would cause them to be misaligned. How do you solve this?

That's a good question. My thinking is muddled. I need to work through a couple of examples. For the purposes of this RFC I'm going to wave my hands wildly and note that a full RFC on only this topic is merited.


docs/RFCS/20171008_sql_optimizer.md, line 157 at r2 (raw file):

Previously, knz (kena) wrote…

nit: equivalency

Done.


docs/RFCS/20171008_sql_optimizer.md, line 240 at r2 (raw file):

Previously, knz (kena) wrote…

You should be talking about pruning in this section or below in search.
I miss an overarching story over how these things interact, where e.g. the reader would learn that the memo starts empty, and that Search is driving the optimization, by populating the memo using Rewrites and Pruning it at the same time using Cost.

Roger. Added a paragraph in Search. Note that Rewrite is a separate phase that occurs before Search.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Oct 9, 2017

Reviewed 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 6 unresolved discussions, some commit checks pending.


docs/RFCS/20171008_sql_optimizer.md, line 15 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Well, it does output a physical query plan while the input might not be directly executable (i.e. not a plan at all).

This is a nit really, but the very notion of optimization implies that the optimization logic as a whole can be disabled and the rest still be functional.
Presented like you did, the optimization is not really an optimization.

I think if you want to make a fuller picture you can rename the entire RFC as "SQL Query planning" and then outline that the iteration of rewrite and search constitutes what we can call "optimization".


docs/RFCS/20171008_sql_optimizer.md, line 115 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I've removed usage of the term attribute. It is used in literature I've been reading, but we use the term column or variable.

Ack.


docs/RFCS/20171008_sql_optimizer.md, line 240 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Roger. Added a paragraph in Search. Note that Rewrite is a separate phase that occurs before Search.

I think you skipped a beat in the music. Search and Rewrite are coroutines. You can't pre-populate all the alternatives in Rewrite upfront, there are simply too many (hundreds even with just the few rewrite rules we know of already, more realistically thousands as you mentioned already in writing). Instead Search guides the generation of alternatives, each generated by the application of Rewrite, by avoiding the use of rewrite rules in some cases, and discarding previously rewritten alternatives in other cases. Rewrite does not precede Search, it is subjugated to it.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


docs/RFCS/20171008_sql_optimizer.md, line 15 at r2 (raw file):

Previously, knz (kena) wrote…

This is a nit really, but the very notion of optimization implies that the optimization logic as a whole can be disabled and the rest still be functional.
Presented like you did, the optimization is not really an optimization.

I think if you want to make a fuller picture you can rename the entire RFC as "SQL Query planning" and then outline that the iteration of rewrite and search constitutes what we can call "optimization".

Ok, I've adjusted per this suggestion.


docs/RFCS/20171008_sql_optimizer.md, line 240 at r2 (raw file):

Previously, knz (kena) wrote…

I think you skipped a beat in the music. Search and Rewrite are coroutines. You can't pre-populate all the alternatives in Rewrite upfront, there are simply too many (hundreds even with just the few rewrite rules we know of already, more realistically thousands as you mentioned already in writing). Instead Search guides the generation of alternatives, each generated by the application of Rewrite, by avoiding the use of rewrite rules in some cases, and discarding previously rewritten alternatives in other cases. Rewrite does not precede Search, it is subjugated to it.

No beat skipped, we're at odds on terminology. My understanding is that between Prep and Search, there is a second phase named Rewrite where unconditional transformations are performed. These unconditional transformations are not costed or explored, but always applied as they are always beneficial. De-correlation and predicate push-down are the two transformations I'm aware of that fall into this category. I need to go back at look at the papers to see if there are other transformations to include here.

Search iteratively applies transforms, costs the resulting plans, and prunes (or ignores) low cost plans. So, in my usage of the terminology (which is trying to match our recent learnings), Rewrite is independent of Search, though both phases apply transforms.

To reiterate, my understanding of the distinction between Rewrite and Search is that Rewrite doesn't bother to keep the alternatives around, or even to cost them, because the transformations it applies always produce better plans. It is an open question as to whether Rewrite should operate on top of Memo. It certainly doesn't require it.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/sql-optimizer-outline branch from 661477b to 41bc813 Compare October 10, 2017 00:41
@knz
Copy link
Contributor

knz commented Oct 10, 2017

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


docs/RFCS/20171008_sql_optimizer.md, line 240 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

No beat skipped, we're at odds on terminology. My understanding is that between Prep and Search, there is a second phase named Rewrite where unconditional transformations are performed. These unconditional transformations are not costed or explored, but always applied as they are always beneficial. De-correlation and predicate push-down are the two transformations I'm aware of that fall into this category. I need to go back at look at the papers to see if there are other transformations to include here.

Search iteratively applies transforms, costs the resulting plans, and prunes (or ignores) low cost plans. So, in my usage of the terminology (which is trying to match our recent learnings), Rewrite is independent of Search, though both phases apply transforms.

To reiterate, my understanding of the distinction between Rewrite and Search is that Rewrite doesn't bother to keep the alternatives around, or even to cost them, because the transformations it applies always produce better plans. It is an open question as to whether Rewrite should operate on top of Memo. It certainly doesn't require it.

Okay maybe the fact that both rewrite and search use rewrite rules, which we'll call "transforms", should be outlined in the text.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/sql-optimizer-outline branch from 41bc813 to afbfaf0 Compare October 10, 2017 12:32
@petermattis
Copy link
Collaborator Author

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks pending.


docs/RFCS/20171008_sql_optimizer.md, line 240 at r2 (raw file):

Previously, knz (kena) wrote…

Okay maybe the fact that both rewrite and search use rewrite rules, which we'll call "transforms", should be outlined in the text.

Ok. Reworded the first paragraph of the Rewrite section to make this clear. Small clarification: the transform rules used by Rewrite are not the same as those used by Search. I think there will be some overlap, but most of the transforms used by Search will not be used by Rewrite.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


docs/RFCS/20171008_sql_optimizer.md, line 121 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

That's a good question. My thinking is muddled. I need to work through a couple of examples. For the purposes of this RFC I'm going to wave my hands wildly and note that a full RFC on only this topic is merited.

My thinking is still muddled, but slightly clearer. We want to model the expression nodes using the relational algebra operators. Each node defines a relation where a relation is a set of attribute names (i.e. column names). In your UNION ALL example, this would look like:

unionAll [v, k]      -> in=[0,1] out=[0,1]
  project [v, k]     -> in=[0,1] out=[0,1]
    scan [kv (k, v)] ->          out=[0,1]
  project [k, v]     -> in=[0,1] out=[0,1]
    scan [kv (k, v)] ->          out=[0,1]

The part I'm still muddled about is what happens if we perform a selection on the union:

select [k > 1]         -> in=[0,1] out=[0,1]
  unionAll [v, k]      -> in=[0,1] out=[0,1]
    project [v, k]     -> in=[0,1] out=[0,1]
      scan [kv (k, v)] ->          out=[0,1]
    project [k, v]     -> in=[0,1] out=[0,1]
      scan [kv (k, v)] ->          out=[0,1]

Now it looks like we can push the selection through the union. I think what is missing here is a rename operator (union is interesting because the columns it outputs are named by the first relation, not the second):

select [k > 1]           -> in=[0,1] out=[0,1]
  unionAll [v, k]        -> in=[0,1] out=[0,1]
    project [v, k]       -> in=[0,1] out=[0,1]
      scan [kv (k, v)]   ->          out=[0,1]
    rename [k->v, v->k]  -> in=[0,1] out=[2,3]
      project [k, v]     -> in=[0,1] out=[0,1]
        scan [kv (k, v)] ->          out=[0,1]

Now if we want to push the selection down through the union, we have to substitute k = v when we push it through the rename.

unionAll [v, k]            -> in=[0,1] out=[0,1]
  project [v, k]           -> in=[0,1] out=[0,1]
    select [k > 1]         -> in=[0,1] out=[0,1]
      scan [kv (k, v)]     ->          out=[0,1]
    rename [k->v, v->k]    -> in=[0,1] out=[2,3]
      project [k, v]       -> in=[0,1] out=[0,1]
        select [k > 1]     -> in=[0,1] out=[0,1]
          scan [kv (k, v)] ->          out=[0,1]

Once again, I'm going to wave my hands wildly. I see the general outline of how this would work, but the devil is in the details and those are still obscure. Getting those details right will require a full RFC and lots of experimentation. Let's move this discussion to a better forum (e.g. https://github.com/petermattis/opttoy).

PS Apologies for falling back on the relation/attribute terminology which might be confusing.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Oct 10, 2017

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


docs/RFCS/20171008_sql_optimizer.md, line 121 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

My thinking is still muddled, but slightly clearer. We want to model the expression nodes using the relational algebra operators. Each node defines a relation where a relation is a set of attribute names (i.e. column names). In your UNION ALL example, this would look like:

unionAll [v, k]      -> in=[0,1] out=[0,1]
  project [v, k]     -> in=[0,1] out=[0,1]
    scan [kv (k, v)] ->          out=[0,1]
  project [k, v]     -> in=[0,1] out=[0,1]
    scan [kv (k, v)] ->          out=[0,1]

The part I'm still muddled about is what happens if we perform a selection on the union:

select [k > 1]         -> in=[0,1] out=[0,1]
  unionAll [v, k]      -> in=[0,1] out=[0,1]
    project [v, k]     -> in=[0,1] out=[0,1]
      scan [kv (k, v)] ->          out=[0,1]
    project [k, v]     -> in=[0,1] out=[0,1]
      scan [kv (k, v)] ->          out=[0,1]

Now it looks like we can push the selection through the union. I think what is missing here is a rename operator (union is interesting because the columns it outputs are named by the first relation, not the second):

select [k > 1]           -> in=[0,1] out=[0,1]
  unionAll [v, k]        -> in=[0,1] out=[0,1]
    project [v, k]       -> in=[0,1] out=[0,1]
      scan [kv (k, v)]   ->          out=[0,1]
    rename [k->v, v->k]  -> in=[0,1] out=[2,3]
      project [k, v]     -> in=[0,1] out=[0,1]
        scan [kv (k, v)] ->          out=[0,1]

Now if we want to push the selection down through the union, we have to substitute k = v when we push it through the rename.

unionAll [v, k]            -> in=[0,1] out=[0,1]
  project [v, k]           -> in=[0,1] out=[0,1]
    select [k > 1]         -> in=[0,1] out=[0,1]
      scan [kv (k, v)]     ->          out=[0,1]
    rename [k->v, v->k]    -> in=[0,1] out=[2,3]
      project [k, v]       -> in=[0,1] out=[0,1]
        select [k > 1]     -> in=[0,1] out=[0,1]
          scan [kv (k, v)] ->          out=[0,1]

Once again, I'm going to wave my hands wildly. I see the general outline of how this would work, but the devil is in the details and those are still obscure. Getting those details right will require a full RFC and lots of experimentation. Let's move this discussion to a better forum (e.g. https://github.com/petermattis/opttoy).

PS Apologies for falling back on the relation/attribute terminology which might be confusing.

Solution here I think:
petermattis/opttoy#10 (comment)


Comments from Reviewable

@a-robinson
Copy link
Contributor

:lgtm: This description matches my understanding from the sessions


Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Oct 18, 2017

I added the glossary of terms as we discussed.

Also identified properties as a module, and created a dedicated section.

PTAL

@tbg
Copy link
Member

tbg commented Oct 18, 2017

Reviewed 1 of 1 files at r4, 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

The additions look great.


Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.


docs/RFCS/20171008_sql_optimizer.md, line 67 at r5 (raw file):

       v
   .---------. - done every EXECUTE to capture placeholder values / timestamps
   | Rewrite | - includes always-good simplifications, eg. predicate push-down

How do you feel about the term "cost-agnostic transformations"? This allows us to distinguish them from "cost-based transformations".


docs/RFCS/20171008_sql_optimizer.md, line 110 at r5 (raw file):

- [**cardinality**](#stats)
- [**condition** in transformations](#search)
- [**decorrelating**](#rewrite)

Perhaps "a.k.a. unnesting"


docs/RFCS/20171008_sql_optimizer.md, line 133 at r5 (raw file):

- [**top-down** and **bottom-up** search strategies](#search)
- [**transformation** of expressions](#rewrite)
- [**unnesting**](#rewrite)

Perhaps "a.k.a. decorrelating".


docs/RFCS/20171008_sql_optimizer.md, line 181 at r5 (raw file):

The Prep phase also starts computing *logical properties*, such as the
input and output variables of each (sub-)expression and its functional

Perhaps s/its/various/g, otherwise this gives me the impression that we'll be tracking all functional dependencies and that functional dependencies are a single thing.


docs/RFCS/20171008_sql_optimizer.md, line 187 at r5 (raw file):

variables that are necessary and sufficient to compute the
expression's result. This will be later used to derive more properties
(e.g. ordering) by using the edges of the functional dependency graph.

While correct that the functional dependencies form a graph, I haven't found that attribute to be useful so far. Have you?


docs/RFCS/20171008_sql_optimizer.md, line 220 at r5 (raw file):

Rewrite is the phase where correlated subqueries are *decorrelated*,
*unnesting* and *predicate push down* occurs,

I believe join elimination also occurs during Rewrite.


docs/RFCS/20171008_sql_optimizer.md, line 301 at r5 (raw file):

the [section below](#properties).

During Search, m-expressions might get enumerated in order to

I'm finding this sentence a bit awkward due to might and execute.

During search, m-expressions in the memo are walked over and progressively transformed creating new m-expressions in order to generate alternative plans.


docs/RFCS/20171008_sql_optimizer.md, line 526 at r5 (raw file):

basic logical transformation is join order enumeration (e.g. `a JOIN
b` -> `b JOIN a`).
The transformations that enumerate alternate plans that are *algebraically

Did you intend for there to be a blank line before this line? I've noticed a few instances of odd spacing and line wrapping in the additions.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Oct 19, 2017

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


docs/RFCS/20171008_sql_optimizer.md, line 67 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

How do you feel about the term "cost-agnostic transformations"? This allows us to distinguish them from "cost-based transformations".

👍 - updated


docs/RFCS/20171008_sql_optimizer.md, line 110 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps "a.k.a. unnesting"

Done.


docs/RFCS/20171008_sql_optimizer.md, line 133 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps "a.k.a. decorrelating".

Done.


docs/RFCS/20171008_sql_optimizer.md, line 181 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps s/its/various/g, otherwise this gives me the impression that we'll be tracking all functional dependencies and that functional dependencies are a single thing.

Done.


docs/RFCS/20171008_sql_optimizer.md, line 187 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

While correct that the functional dependencies form a graph, I haven't found that attribute to be useful so far. Have you?

The fact it is a graph is not used directly in the code; however it is a graph, where the vertices are the variables and the edges the "dependency info" that the code does compute. So in memory you end up having graph vertices and edges. If it quacks like a duck...

I think it is useful for the human that the prosaic explanation points to the graph and say "look this is really what's happening here".


docs/RFCS/20171008_sql_optimizer.md, line 220 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I believe join elimination also occurs during Rewrite.

Done.


docs/RFCS/20171008_sql_optimizer.md, line 301 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm finding this sentence a bit awkward due to might and execute.

During search, m-expressions in the memo are walked over and progressively transformed creating new m-expressions in order to generate alternative plans.

Done.


docs/RFCS/20171008_sql_optimizer.md, line 526 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Did you intend for there to be a blank line before this line? I've noticed a few instances of odd spacing and line wrapping in the additions.

In my patch I tried as much as possible to not re-justify paragraphs so that the line diff would be minimal.
Now I see that it doesn't really matter for reviewable, so I'll reflow.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions.


docs/RFCS/20171008_sql_optimizer.md, line 187 at r5 (raw file):

Previously, knz (kena) wrote…

The fact it is a graph is not used directly in the code; however it is a graph, where the vertices are the variables and the edges the "dependency info" that the code does compute. So in memory you end up having graph vertices and edges. If it quacks like a duck...

I think it is useful for the human that the prosaic explanation points to the graph and say "look this is really what's happening here".

My point is that mentioning that the functional dependencies are a graph provides no benefit to me. Perhaps for some readers, but I'd rather call out what the functional dependencies are. Also, the first sentence implies that the input variables are the functional dependencies, but there are other dependencies that we'll be maintaining. For example, we'll likely be tracking the "keys" for each expression by propagating the null-ability of input variables and "candidate keys" from input expressions.

Concretely, here is my suggestion:

The functional dependencies for an expression are constraints between two sets of columns. Specific examples of functional dependencies are the projections, where 1 or more input variables determine an output variable, and "keys" which are a set of columns where no two rows output by the expression are equal after projection on to that set (e.g. a unique index for a table where all of the columns are NOT NULL). Conceptually, the functional dependencies form a graph, though they are not represented as such in code.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Oct 19, 2017

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


docs/RFCS/20171008_sql_optimizer.md, line 187 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

My point is that mentioning that the functional dependencies are a graph provides no benefit to me. Perhaps for some readers, but I'd rather call out what the functional dependencies are. Also, the first sentence implies that the input variables are the functional dependencies, but there are other dependencies that we'll be maintaining. For example, we'll likely be tracking the "keys" for each expression by propagating the null-ability of input variables and "candidate keys" from input expressions.

Concretely, here is my suggestion:

The functional dependencies for an expression are constraints between two sets of columns. Specific examples of functional dependencies are the projections, where 1 or more input variables determine an output variable, and "keys" which are a set of columns where no two rows output by the expression are equal after projection on to that set (e.g. a unique index for a table where all of the columns are NOT NULL). Conceptually, the functional dependencies form a graph, though they are not represented as such in code.

Done.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Oct 20, 2017

I have collated the 3 documents in this PR as discussed.

@petermattis
Copy link
Collaborator Author

Review status: 0 of 1 files reviewed at latest revision, 20 unresolved discussions.


docs/RFCS/sql_query_planning.md, line 777 at r16 (raw file):

Previously, knz (kena) wrote…

ACK, then you can copy-paste that answer in the text too.

Done.


docs/RFCS/sql_query_planning.md, line 1031 at r19 (raw file):

Previously, knz (kena) wrote…
  • OFFSET
  • CREATE TABLE ... AS... (an INSERT in disguise)
  • subqueries in scalar context:
    • (select ... order by ...) = (select ... order by ...)
    • (a, b, c, d) = (select x order by ...)
    • array((select ... order by ...))
  • window functions (maybe?)

Regarding other physical props to capture: index "hints" (really: constraints), ORDER BY INDEX / ORDER BY PRIMARY KEY (a CockroachDB extension), join hints (when we have them)

I'm not sure some of these are problematic. For example, comparing two subqueries in scalar context presumably requires that the subqueries return a single row. Regardless, we don't need to be exhaustive here in my opinion. I've added some more text here.


docs/RFCS/sql_query_planning.md, line 1024 at r20 (raw file):

Previously, knz (kena) wrote…
  • None of the window function semantics are covered by the proposed approach yet.

Done.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/sql-optimizer-outline branch from 46e1afb to 15e024d Compare December 6, 2017 00:18
@petermattis
Copy link
Collaborator Author

@knz The latest commit removes the sql_plan_properties.md and data_structures_for_logical_planning.md documents. I think I included the relevant bits from those documents. Let me know if I missed anything that should be retained.


Review status: 0 of 1 files reviewed at latest revision, 20 unresolved discussions, all commit checks successful.


Comments from Reviewable

@rytaft
Copy link
Collaborator

rytaft commented Dec 6, 2017

:lgtm: Thanks for documenting all of this!


Reviewed 1 of 2 files at r11, 1 of 3 files at r21.
Review status: all files reviewed at latest revision, 22 unresolved discussions, all commit checks successful.


docs/RFCS/sql_query_planning.md, line 802 at r21 (raw file):

the expression tree will use to process all results and modelling how
data flows through the expression tree. [Table statistics](#stats) are
used to power cardinality estimates of base relations which in term

Maybe update this section to include our new understanding about passing histograms up the query plan?


docs/RFCS/sql_query_planning.md, line 921 at r21 (raw file):

indexed by the root operator of their pattern. Transformations are
further categorized as exploration and implementation and divided
amongst the search stages best on generality and expected benefit.

best on generality -> based on generality


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 0 of 1 files reviewed at latest revision, 22 unresolved discussions.


docs/RFCS/sql_query_planning.md, line 802 at r21 (raw file):

Previously, rytaft wrote…

Maybe update this section to include our new understanding about passing histograms up the query plan?

Good idea. I've added a sentence about propagating histograms up through the intermediate nodes. More detail than that is beyond my knowledge. Let me know if you have something additional to add.


Comments from Reviewable

@rytaft
Copy link
Collaborator

rytaft commented Dec 6, 2017

Reviewed 1 of 2 files at r11, 1 of 1 files at r22.
Review status: all files reviewed at latest revision, 21 unresolved discussions, all commit checks successful.


docs/RFCS/sql_query_planning.md, line 802 at r21 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Good idea. I've added a sentence about propagating histograms up through the intermediate nodes. More detail than that is beyond my knowledge. Let me know if you have something additional to add.

LGTM


Comments from Reviewable

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Dec 7, 2017
This change brings in a subset of
https://github.com/petermattis/opttoy/tree/master/v3

This change introduces:

 - the expr tree: cascades-style optimizers operate on expression
   trees which can represent both scalar and relational expressions;
   this is a departure from the way we represent expressions and
   statements (sem/tree) so we need a new tree structure.

 - scalar operators: initially, we focus only on scalar expressions.

 - building an expr tree from a sem/tree.TypedExpr.

 - opt version of logic tests

See the RFC in cockroachdb#19135 for more context on the optimizer.

This is the first step of an initial project related to the optimizer:
generating index constraints from scalar expressions. This will be a
rewrite of the current index constraint generation code (which has
many problems, see cockroachdb#6346).  Roughly, the existing
`makeIndexConstraints` will call into the optimizer with a `TypedExpr`
and the optimizer will return index constraints.

Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Dec 8, 2017
This change brings in a subset of
https://github.com/petermattis/opttoy/tree/master/v3

This change introduces:

 - the expr tree: cascades-style optimizers operate on expression
   trees which can represent both scalar and relational expressions;
   this is a departure from the way we represent expressions and
   statements (sem/tree) so we need a new tree structure.

 - scalar operators: initially, we focus only on scalar expressions.

 - building an expr tree from a sem/tree.TypedExpr.

 - opt version of logic tests

See the RFC in cockroachdb#19135 for more context on the optimizer.

This is the first step of an initial project related to the optimizer:
generating index constraints from scalar expressions. This will be a
rewrite of the current index constraint generation code (which has
many problems, see cockroachdb#6346).  Roughly, the existing
`makeIndexConstraints` will call into the optimizer with a `TypedExpr`
and the optimizer will return index constraints.

Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Dec 9, 2017
This change brings in a subset of
https://github.com/petermattis/opttoy/tree/master/v3

This change introduces:

 - the expr tree: cascades-style optimizers operate on expression
   trees which can represent both scalar and relational expressions;
   this is a departure from the way we represent expressions and
   statements (sem/tree) so we need a new tree structure.

 - scalar operators: initially, we focus only on scalar expressions.

 - building an expr tree from a sem/tree.TypedExpr.

 - opt version of logic tests

See the RFC in cockroachdb#19135 for more context on the optimizer.

This is the first step of an initial project related to the optimizer:
generating index constraints from scalar expressions. This will be a
rewrite of the current index constraint generation code (which has
many problems, see cockroachdb#6346).  Roughly, the existing
`makeIndexConstraints` will call into the optimizer with a `TypedExpr`
and the optimizer will return index constraints.

Release note: None
@petermattis petermattis changed the title RFCS: SQL optimizer outline RFCS: SQL query planning Dec 11, 2017
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Dec 11, 2017
This change brings in a subset of
https://github.com/petermattis/opttoy/tree/master/v3

This change introduces:

 - the expr tree: cascades-style optimizers operate on expression
   trees which can represent both scalar and relational expressions;
   this is a departure from the way we represent expressions and
   statements (sem/tree) so we need a new tree structure.

 - scalar operators: initially, we focus only on scalar expressions.

 - building an expr tree from a sem/tree.TypedExpr.

 - opt version of logic tests

See the RFC in cockroachdb#19135 for more context on the optimizer.

This is the first step of an initial project related to the optimizer:
generating index constraints from scalar expressions. This will be a
rewrite of the current index constraint generation code (which has
many problems, see cockroachdb#6346).  Roughly, the existing
`makeIndexConstraints` will call into the optimizer with a `TypedExpr`
and the optimizer will return index constraints.

Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Dec 11, 2017
This change brings in a subset of
https://github.com/petermattis/opttoy/tree/master/v3

This change introduces:

 - the expr tree: cascades-style optimizers operate on expression
   trees which can represent both scalar and relational expressions;
   this is a departure from the way we represent expressions and
   statements (sem/tree) so we need a new tree structure.

 - scalar operators: initially, we focus only on scalar expressions.

 - building an expr tree from a sem/tree.TypedExpr.

 - opt version of logic tests

See the RFC in cockroachdb#19135 for more context on the optimizer.

This is the first step of an initial project related to the optimizer:
generating index constraints from scalar expressions. This will be a
rewrite of the current index constraint generation code (which has
many problems, see cockroachdb#6346).  Roughly, the existing
`makeIndexConstraints` will call into the optimizer with a `TypedExpr`
and the optimizer will return index constraints.

Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Dec 12, 2017
This change brings in a subset of
https://github.com/petermattis/opttoy/tree/master/v3

This change introduces:

 - the expr tree: cascades-style optimizers operate on expression
   trees which can represent both scalar and relational expressions;
   this is a departure from the way we represent expressions and
   statements (sem/tree) so we need a new tree structure.

 - scalar operators: initially, we focus only on scalar expressions.

 - building an expr tree from a sem/tree.TypedExpr.

 - opt version of logic tests

See the RFC in cockroachdb#19135 for more context on the optimizer.

This is the first step of an initial project related to the optimizer:
generating index constraints from scalar expressions. This will be a
rewrite of the current index constraint generation code (which has
many problems, see cockroachdb#6346).  Roughly, the existing
`makeIndexConstraints` will call into the optimizer with a `TypedExpr`
and the optimizer will return index constraints.

Release note: None
@petermattis petermattis force-pushed the pmattis/sql-optimizer-outline branch from fcb79fc to c29d5ec Compare December 13, 2017 12:19
High-level modules of next generation SQL query planning including a
full-featured optimizer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants