RFCS: SQL query planning #19135

petermattis · 2017-10-09T18:04:07Z

I put metaphorical pen-to-paper this weekend and sketched out the
high-level modules for a SQL optimizer. This overlaps with Raphael's SQL
changes document (#18977), but has a more singular focus on SQL
optimization. I consider the documents complementary.

cockroach-teamcity · 2017-10-09T18:04:19Z

This change is

petermattis · 2017-10-09T18:06:13Z

Cc @albler

RaduBerinde · 2017-10-09T18:52:13Z

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

docs/RFCS/20171008_sql_optimizer.md, line 129 at r1 (raw file):

beneficial. This is the phase where correlated subqueries are
decorrelated and predicate push down occurs. As an example of the
later, consider the query:

[nit] latter

docs/RFCS/20171008_sql_optimizer.md, line 155 at r1 (raw file):

Memo is a data structure for maintaining a forest of query
plans. Conceptually, the memo is composed of a set of equivalency
groups where each group is a logical node in the query plan. Each

One important question here is if nodes that are equivalent in terms of results but have different physical properties (e.g. ordering) are "equivalent" nodes. If yes, this caveat of "equivalent" needs to be cleared up. This also introduces some difficulties during the search: if nodes were truly equivalent, we could always use the lowest-cost node in each MEMO group (e.g. when applying transformations). But if we can exploit different orderings we may need to go through multiple nodes per MEMO group.

docs/RFCS/20171008_sql_optimizer.md, line 155 at r1 (raw file):

Memo is a data structure for maintaining a forest of query
plans. Conceptually, the memo is composed of a set of equivalency
groups where each group is a logical node in the query plan. Each

I think this description overloads "logical node" with a couple of different meanings (or needs to be clarified). Perhaps "each group corresponds to a logical node in the query plan" and "each group is represented by ..".

It would also help if we had a more clear definition of what's a physical node and how it differs from a logical node.

Comments from Reviewable

petermattis · 2017-10-09T19:03:47Z

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

docs/RFCS/20171008_sql_optimizer.md, line 129 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] latter

Done.

docs/RFCS/20171008_sql_optimizer.md, line 155 at r1 (raw file):

Previously, RaduBerinde wrote…

One important question here is if nodes that are equivalent in terms of results but have different physical properties (e.g. ordering) are "equivalent" nodes. If yes, this caveat of "equivalent" needs to be cleared up. This also introduces some difficulties during the search: if nodes were truly equivalent, we could always use the lowest-cost node in each MEMO group (e.g. when applying transformations). But if we can exploit different orderings we may need to go through multiple nodes per MEMO group.

I believe the equivalency groups are based on the logical properties. For example, the same equivalency group will hold scan A as well as the various possible table and index scans, despite those scans providing different physical properties. I'm mildly fuzzy on how this is utilized. My fuzzy understanding is that nodes in the memo sometimes point to equivalency groups and sometimes point directly to other nodes.

The Memo structure deserves its own RFC and, prior to that, more experimentation.

docs/RFCS/20171008_sql_optimizer.md, line 155 at r1 (raw file):

Previously, RaduBerinde wrote…

I think this description overloads "logical node" with a couple of different meanings (or needs to be clarified). Perhaps "each group corresponds to a logical node in the query plan" and "each group is represented by ..".

It would also help if we had a more clear definition of what's a physical node and how it differs from a logical node.

Adjusted the language here. I agree I was overloading "logical node".

Comments from Reviewable

knz · 2017-10-09T19:50:07Z

I understand this document! Yay 🎆

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.

docs/RFCS/20171008_sql_optimizer.md, line 15 at r2 (raw file):

# Motivation

SQL optimization is concerned with transforming a SQL query into a

Technically, transforming a SQL query plan into a better plan.

docs/RFCS/20171008_sql_optimizer.md, line 115 at r2 (raw file):

maintains a list of filters and projections applied at the node.

Variable numbering involves assigning every base attribute and

Define "attribute".

docs/RFCS/20171008_sql_optimizer.md, line 121 at r2 (raw file):

representation allows fast determination of compatibility between
expression nodes and is utilized during rewrites and transformations
to determine the legality of such operations.

Example needed here. Please clarify the bitmap story on SELECT v, k FROM kv UNION ALL SELECT k, v FROM kv - v and k have the same index on both sides but a vector representation would cause them to be misaligned. How do you solve this?

docs/RFCS/20171008_sql_optimizer.md, line 157 at r2 (raw file):

groups where each group is a node in the query plan. Each equivalency
group contains one or more equivalent logical or physical nodes. For
example, one equivalancy group might contain both "a JOIN b" and "b

nit: equivalency

docs/RFCS/20171008_sql_optimizer.md, line 240 at r2 (raw file):

number of query plans considered. But it may be useful to use it
during prep in order to avoid a conversion from the query trees used
by prep to the memo structure.

You should be talking about pruning in this section or below in search.
I miss an overarching story over how these things interact, where e.g. the reader would learn that the memo starts empty, and that Search is driving the optimization, by populating the memo using Rewrites and Pruning it at the same time using Cost.

Comments from Reviewable

petermattis · 2017-10-09T20:36:00Z

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.

docs/RFCS/20171008_sql_optimizer.md, line 15 at r2 (raw file):

Previously, knz (kena) wrote…

Technically, transforming a SQL query plan into a better plan.

Well, it does output a physical query plan while the input might not be directly executable (i.e. not a plan at all).

docs/RFCS/20171008_sql_optimizer.md, line 115 at r2 (raw file):

Previously, knz (kena) wrote…

Define "attribute".

I've removed usage of the term attribute. It is used in literature I've been reading, but we use the term column or variable.

docs/RFCS/20171008_sql_optimizer.md, line 121 at r2 (raw file):

Previously, knz (kena) wrote…

Example needed here. Please clarify the bitmap story on SELECT v, k FROM kv UNION ALL SELECT k, v FROM kv - v and k have the same index on both sides but a vector representation would cause them to be misaligned. How do you solve this?

That's a good question. My thinking is muddled. I need to work through a couple of examples. For the purposes of this RFC I'm going to wave my hands wildly and note that a full RFC on only this topic is merited.

docs/RFCS/20171008_sql_optimizer.md, line 157 at r2 (raw file):

Previously, knz (kena) wrote…

nit: equivalency

Done.

docs/RFCS/20171008_sql_optimizer.md, line 240 at r2 (raw file):

Previously, knz (kena) wrote…

You should be talking about pruning in this section or below in search.
I miss an overarching story over how these things interact, where e.g. the reader would learn that the memo starts empty, and that Search is driving the optimization, by populating the memo using Rewrites and Pruning it at the same time using Cost.

Roger. Added a paragraph in Search. Note that Rewrite is a separate phase that occurs before Search.

Comments from Reviewable

knz · 2017-10-09T20:48:30Z

Reviewed 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 6 unresolved discussions, some commit checks pending.

docs/RFCS/20171008_sql_optimizer.md, line 15 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Well, it does output a physical query plan while the input might not be directly executable (i.e. not a plan at all).

This is a nit really, but the very notion of optimization implies that the optimization logic as a whole can be disabled and the rest still be functional.
Presented like you did, the optimization is not really an optimization.

I think if you want to make a fuller picture you can rename the entire RFC as "SQL Query planning" and then outline that the iteration of rewrite and search constitutes what we can call "optimization".

docs/RFCS/20171008_sql_optimizer.md, line 115 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I've removed usage of the term attribute. It is used in literature I've been reading, but we use the term column or variable.

Ack.

docs/RFCS/20171008_sql_optimizer.md, line 240 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Roger. Added a paragraph in Search. Note that Rewrite is a separate phase that occurs before Search.

I think you skipped a beat in the music. Search and Rewrite are coroutines. You can't pre-populate all the alternatives in Rewrite upfront, there are simply too many (hundreds even with just the few rewrite rules we know of already, more realistically thousands as you mentioned already in writing). Instead Search guides the generation of alternatives, each generated by the application of Rewrite, by avoiding the use of rewrite rules in some cases, and discarding previously rewritten alternatives in other cases. Rewrite does not precede Search, it is subjugated to it.

Comments from Reviewable

petermattis · 2017-10-10T00:33:58Z

Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

docs/RFCS/20171008_sql_optimizer.md, line 15 at r2 (raw file):

Previously, knz (kena) wrote…

This is a nit really, but the very notion of optimization implies that the optimization logic as a whole can be disabled and the rest still be functional.
Presented like you did, the optimization is not really an optimization.

I think if you want to make a fuller picture you can rename the entire RFC as "SQL Query planning" and then outline that the iteration of rewrite and search constitutes what we can call "optimization".

Ok, I've adjusted per this suggestion.

docs/RFCS/20171008_sql_optimizer.md, line 240 at r2 (raw file):

Previously, knz (kena) wrote…

I think you skipped a beat in the music. Search and Rewrite are coroutines. You can't pre-populate all the alternatives in Rewrite upfront, there are simply too many (hundreds even with just the few rewrite rules we know of already, more realistically thousands as you mentioned already in writing). Instead Search guides the generation of alternatives, each generated by the application of Rewrite, by avoiding the use of rewrite rules in some cases, and discarding previously rewritten alternatives in other cases. Rewrite does not precede Search, it is subjugated to it.

No beat skipped, we're at odds on terminology. My understanding is that between Prep and Search, there is a second phase named Rewrite where unconditional transformations are performed. These unconditional transformations are not costed or explored, but always applied as they are always beneficial. De-correlation and predicate push-down are the two transformations I'm aware of that fall into this category. I need to go back at look at the papers to see if there are other transformations to include here.

Search iteratively applies transforms, costs the resulting plans, and prunes (or ignores) low cost plans. So, in my usage of the terminology (which is trying to match our recent learnings), Rewrite is independent of Search, though both phases apply transforms.

To reiterate, my understanding of the distinction between Rewrite and Search is that Rewrite doesn't bother to keep the alternatives around, or even to cost them, because the transformations it applies always produce better plans. It is an open question as to whether Rewrite should operate on top of Memo. It certainly doesn't require it.

Comments from Reviewable

knz · 2017-10-10T09:11:57Z

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.

docs/RFCS/20171008_sql_optimizer.md, line 240 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

No beat skipped, we're at odds on terminology. My understanding is that between Prep and Search, there is a second phase named Rewrite where unconditional transformations are performed. These unconditional transformations are not costed or explored, but always applied as they are always beneficial. De-correlation and predicate push-down are the two transformations I'm aware of that fall into this category. I need to go back at look at the papers to see if there are other transformations to include here.

Search iteratively applies transforms, costs the resulting plans, and prunes (or ignores) low cost plans. So, in my usage of the terminology (which is trying to match our recent learnings), Rewrite is independent of Search, though both phases apply transforms.

To reiterate, my understanding of the distinction between Rewrite and Search is that Rewrite doesn't bother to keep the alternatives around, or even to cost them, because the transformations it applies always produce better plans. It is an open question as to whether Rewrite should operate on top of Memo. It certainly doesn't require it.

Okay maybe the fact that both rewrite and search use rewrite rules, which we'll call "transforms", should be outlined in the text.

Comments from Reviewable

petermattis · 2017-10-10T12:35:43Z

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks pending.

docs/RFCS/20171008_sql_optimizer.md, line 240 at r2 (raw file):

Previously, knz (kena) wrote…

Okay maybe the fact that both rewrite and search use rewrite rules, which we'll call "transforms", should be outlined in the text.

Ok. Reworded the first paragraph of the Rewrite section to make this clear. Small clarification: the transform rules used by Rewrite are not the same as those used by Search. I think there will be some overlap, but most of the transforms used by Search will not be used by Rewrite.

Comments from Reviewable

petermattis · 2017-10-10T14:09:56Z

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

docs/RFCS/20171008_sql_optimizer.md, line 121 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

That's a good question. My thinking is muddled. I need to work through a couple of examples. For the purposes of this RFC I'm going to wave my hands wildly and note that a full RFC on only this topic is merited.

My thinking is still muddled, but slightly clearer. We want to model the expression nodes using the relational algebra operators. Each node defines a relation where a relation is a set of attribute names (i.e. column names). In your UNION ALL example, this would look like:

unionAll [v, k]      -> in=[0,1] out=[0,1]
  project [v, k]     -> in=[0,1] out=[0,1]
    scan [kv (k, v)] ->          out=[0,1]
  project [k, v]     -> in=[0,1] out=[0,1]
    scan [kv (k, v)] ->          out=[0,1]

The part I'm still muddled about is what happens if we perform a selection on the union:

select [k > 1]         -> in=[0,1] out=[0,1]
  unionAll [v, k]      -> in=[0,1] out=[0,1]
    project [v, k]     -> in=[0,1] out=[0,1]
      scan [kv (k, v)] ->          out=[0,1]
    project [k, v]     -> in=[0,1] out=[0,1]
      scan [kv (k, v)] ->          out=[0,1]

Now it looks like we can push the selection through the union. I think what is missing here is a rename operator (union is interesting because the columns it outputs are named by the first relation, not the second):

select [k > 1]           -> in=[0,1] out=[0,1]
  unionAll [v, k]        -> in=[0,1] out=[0,1]
    project [v, k]       -> in=[0,1] out=[0,1]
      scan [kv (k, v)]   ->          out=[0,1]
    rename [k->v, v->k]  -> in=[0,1] out=[2,3]
      project [k, v]     -> in=[0,1] out=[0,1]
        scan [kv (k, v)] ->          out=[0,1]

Now if we want to push the selection down through the union, we have to substitute k = v when we push it through the rename.

unionAll [v, k]            -> in=[0,1] out=[0,1]
  project [v, k]           -> in=[0,1] out=[0,1]
    select [k > 1]         -> in=[0,1] out=[0,1]
      scan [kv (k, v)]     ->          out=[0,1]
    rename [k->v, v->k]    -> in=[0,1] out=[2,3]
      project [k, v]       -> in=[0,1] out=[0,1]
        select [k > 1]     -> in=[0,1] out=[0,1]
          scan [kv (k, v)] ->          out=[0,1]

Once again, I'm going to wave my hands wildly. I see the general outline of how this would work, but the devil is in the details and those are still obscure. Getting those details right will require a full RFC and lots of experimentation. Let's move this discussion to a better forum (e.g. https://github.com/petermattis/opttoy).

PS Apologies for falling back on the relation/attribute terminology which might be confusing.

Comments from Reviewable

knz · 2017-10-10T14:15:29Z

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

docs/RFCS/20171008_sql_optimizer.md, line 121 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

My thinking is still muddled, but slightly clearer. We want to model the expression nodes using the relational algebra operators. Each node defines a relation where a relation is a set of attribute names (i.e. column names). In your UNION ALL example, this would look like:
unionAll [v, k]      -> in=[0,1] out=[0,1]
  project [v, k]     -> in=[0,1] out=[0,1]
    scan [kv (k, v)] ->          out=[0,1]
  project [k, v]     -> in=[0,1] out=[0,1]
    scan [kv (k, v)] ->          out=[0,1]
The part I'm still muddled about is what happens if we perform a selection on the union:
select [k > 1]         -> in=[0,1] out=[0,1]
  unionAll [v, k]      -> in=[0,1] out=[0,1]
    project [v, k]     -> in=[0,1] out=[0,1]
      scan [kv (k, v)] ->          out=[0,1]
    project [k, v]     -> in=[0,1] out=[0,1]
      scan [kv (k, v)] ->          out=[0,1]
Now it looks like we can push the selection through the union. I think what is missing here is a rename operator (union is interesting because the columns it outputs are named by the first relation, not the second):
select [k > 1]           -> in=[0,1] out=[0,1]
  unionAll [v, k]        -> in=[0,1] out=[0,1]
    project [v, k]       -> in=[0,1] out=[0,1]
      scan [kv (k, v)]   ->          out=[0,1]
    rename [k->v, v->k]  -> in=[0,1] out=[2,3]
      project [k, v]     -> in=[0,1] out=[0,1]
        scan [kv (k, v)] ->          out=[0,1]
Now if we want to push the selection down through the union, we have to substitute k = v when we push it through the rename.
unionAll [v, k]            -> in=[0,1] out=[0,1]
  project [v, k]           -> in=[0,1] out=[0,1]
    select [k > 1]         -> in=[0,1] out=[0,1]
      scan [kv (k, v)]     ->          out=[0,1]
    rename [k->v, v->k]    -> in=[0,1] out=[2,3]
      project [k, v]       -> in=[0,1] out=[0,1]
        select [k > 1]     -> in=[0,1] out=[0,1]
          scan [kv (k, v)] ->          out=[0,1]
Once again, I'm going to wave my hands wildly. I see the general outline of how this would work, but the devil is in the details and those are still obscure. Getting those details right will require a full RFC and lots of experimentation. Let's move this discussion to a better forum (e.g. https://github.com/petermattis/opttoy).

PS Apologies for falling back on the relation/attribute terminology which might be confusing.

Solution here I think:
petermattis/opttoy#10 (comment)

Comments from Reviewable

a-robinson · 2017-10-10T18:19:21Z

This description matches my understanding from the sessions

Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

Comments from Reviewable

knz · 2017-10-18T13:55:19Z

I added the glossary of terms as we discussed.

Also identified properties as a module, and created a dedicated section.

PTAL

tbg · 2017-10-18T16:17:39Z

Reviewed 1 of 1 files at r4, 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

Comments from Reviewable

petermattis · 2017-10-18T16:59:23Z

The additions look great.

Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.

docs/RFCS/20171008_sql_optimizer.md, line 67 at r5 (raw file):

       v
   .---------. - done every EXECUTE to capture placeholder values / timestamps
   | Rewrite | - includes always-good simplifications, eg. predicate push-down

How do you feel about the term "cost-agnostic transformations"? This allows us to distinguish them from "cost-based transformations".

docs/RFCS/20171008_sql_optimizer.md, line 110 at r5 (raw file):

- [**cardinality**](#stats)
- [**condition** in transformations](#search)
- [**decorrelating**](#rewrite)

Perhaps "a.k.a. unnesting"

docs/RFCS/20171008_sql_optimizer.md, line 133 at r5 (raw file):

- [**top-down** and **bottom-up** search strategies](#search)
- [**transformation** of expressions](#rewrite)
- [**unnesting**](#rewrite)

Perhaps "a.k.a. decorrelating".

docs/RFCS/20171008_sql_optimizer.md, line 181 at r5 (raw file):

The Prep phase also starts computing *logical properties*, such as the
input and output variables of each (sub-)expression and its functional

Perhaps s/its/various/g, otherwise this gives me the impression that we'll be tracking all functional dependencies and that functional dependencies are a single thing.

docs/RFCS/20171008_sql_optimizer.md, line 187 at r5 (raw file):

variables that are necessary and sufficient to compute the
expression's result. This will be later used to derive more properties
(e.g. ordering) by using the edges of the functional dependency graph.

While correct that the functional dependencies form a graph, I haven't found that attribute to be useful so far. Have you?

docs/RFCS/20171008_sql_optimizer.md, line 220 at r5 (raw file):

Rewrite is the phase where correlated subqueries are *decorrelated*,
*unnesting* and *predicate push down* occurs,

I believe join elimination also occurs during Rewrite.

docs/RFCS/20171008_sql_optimizer.md, line 301 at r5 (raw file):

the [section below](#properties).

During Search, m-expressions might get enumerated in order to

I'm finding this sentence a bit awkward due to might and execute.

During search, m-expressions in the memo are walked over and progressively transformed creating new m-expressions in order to generate alternative plans.

docs/RFCS/20171008_sql_optimizer.md, line 526 at r5 (raw file):

basic logical transformation is join order enumeration (e.g. `a JOIN
b` -> `b JOIN a`).
The transformations that enumerate alternate plans that are *algebraically

Did you intend for there to be a blank line before this line? I've noticed a few instances of odd spacing and line wrapping in the additions.

Comments from Reviewable

knz · 2017-10-19T12:48:48Z

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.

docs/RFCS/20171008_sql_optimizer.md, line 67 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

How do you feel about the term "cost-agnostic transformations"? This allows us to distinguish them from "cost-based transformations".

👍 - updated

docs/RFCS/20171008_sql_optimizer.md, line 110 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps "a.k.a. unnesting"

Done.

docs/RFCS/20171008_sql_optimizer.md, line 133 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps "a.k.a. decorrelating".

Done.

docs/RFCS/20171008_sql_optimizer.md, line 181 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps s/its/various/g, otherwise this gives me the impression that we'll be tracking all functional dependencies and that functional dependencies are a single thing.

Done.

docs/RFCS/20171008_sql_optimizer.md, line 187 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

While correct that the functional dependencies form a graph, I haven't found that attribute to be useful so far. Have you?

The fact it is a graph is not used directly in the code; however it is a graph, where the vertices are the variables and the edges the "dependency info" that the code does compute. So in memory you end up having graph vertices and edges. If it quacks like a duck...

I think it is useful for the human that the prosaic explanation points to the graph and say "look this is really what's happening here".

docs/RFCS/20171008_sql_optimizer.md, line 220 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I believe join elimination also occurs during Rewrite.

Done.

docs/RFCS/20171008_sql_optimizer.md, line 301 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm finding this sentence a bit awkward due to might and execute.

During search, m-expressions in the memo are walked over and progressively transformed creating new m-expressions in order to generate alternative plans.

Done.

docs/RFCS/20171008_sql_optimizer.md, line 526 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Did you intend for there to be a blank line before this line? I've noticed a few instances of odd spacing and line wrapping in the additions.

In my patch I tried as much as possible to not re-justify paragraphs so that the line diff would be minimal.
Now I see that it doesn't really matter for reviewable, so I'll reflow.

Comments from Reviewable

petermattis · 2017-10-19T13:48:59Z

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions.

docs/RFCS/20171008_sql_optimizer.md, line 187 at r5 (raw file):

Previously, knz (kena) wrote…

The fact it is a graph is not used directly in the code; however it is a graph, where the vertices are the variables and the edges the "dependency info" that the code does compute. So in memory you end up having graph vertices and edges. If it quacks like a duck...

I think it is useful for the human that the prosaic explanation points to the graph and say "look this is really what's happening here".

My point is that mentioning that the functional dependencies are a graph provides no benefit to me. Perhaps for some readers, but I'd rather call out what the functional dependencies are. Also, the first sentence implies that the input variables are the functional dependencies, but there are other dependencies that we'll be maintaining. For example, we'll likely be tracking the "keys" for each expression by propagating the null-ability of input variables and "candidate keys" from input expressions.

Concretely, here is my suggestion:

The functional dependencies for an expression are constraints between two sets of columns. Specific examples of functional dependencies are the projections, where 1 or more input variables determine an output variable, and "keys" which are a set of columns where no two rows output by the expression are equal after projection on to that set (e.g. a unique index for a table where all of the columns are NOT NULL). Conceptually, the functional dependencies form a graph, though they are not represented as such in code.

Comments from Reviewable

knz · 2017-10-19T16:03:38Z

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

docs/RFCS/20171008_sql_optimizer.md, line 187 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

My point is that mentioning that the functional dependencies are a graph provides no benefit to me. Perhaps for some readers, but I'd rather call out what the functional dependencies are. Also, the first sentence implies that the input variables are the functional dependencies, but there are other dependencies that we'll be maintaining. For example, we'll likely be tracking the "keys" for each expression by propagating the null-ability of input variables and "candidate keys" from input expressions.

Concretely, here is my suggestion:

The functional dependencies for an expression are constraints between two sets of columns. Specific examples of functional dependencies are the projections, where 1 or more input variables determine an output variable, and "keys" which are a set of columns where no two rows output by the expression are equal after projection on to that set (e.g. a unique index for a table where all of the columns are NOT NULL). Conceptually, the functional dependencies form a graph, though they are not represented as such in code.

Done.

Comments from Reviewable

knz · 2017-10-20T16:05:50Z

I have collated the 3 documents in this PR as discussed.

petermattis · 2017-12-05T23:48:30Z

Review status: 0 of 1 files reviewed at latest revision, 20 unresolved discussions.

docs/RFCS/sql_query_planning.md, line 777 at r16 (raw file):

Previously, knz (kena) wrote…

ACK, then you can copy-paste that answer in the text too.

Done.

docs/RFCS/sql_query_planning.md, line 1031 at r19 (raw file):

Previously, knz (kena) wrote…

OFFSET

CREATE TABLE ... AS... (an INSERT in disguise)

subqueries in scalar context:

(select ... order by ...) = (select ... order by ...)

(a, b, c, d) = (select x order by ...)

array((select ... order by ...))

window functions (maybe?)

Regarding other physical props to capture: index "hints" (really: constraints), ORDER BY INDEX / ORDER BY PRIMARY KEY (a CockroachDB extension), join hints (when we have them)

I'm not sure some of these are problematic. For example, comparing two subqueries in scalar context presumably requires that the subqueries return a single row. Regardless, we don't need to be exhaustive here in my opinion. I've added some more text here.

docs/RFCS/sql_query_planning.md, line 1024 at r20 (raw file):

Previously, knz (kena) wrote…

None of the window function semantics are covered by the proposed approach yet.

Done.

Comments from Reviewable

petermattis · 2017-12-06T01:46:05Z

@knz The latest commit removes the sql_plan_properties.md and data_structures_for_logical_planning.md documents. I think I included the relevant bits from those documents. Let me know if I missed anything that should be retained.

Review status: 0 of 1 files reviewed at latest revision, 20 unresolved discussions, all commit checks successful.

Comments from Reviewable

rytaft · 2017-12-06T21:26:10Z

Thanks for documenting all of this!

Reviewed 1 of 2 files at r11, 1 of 3 files at r21.
Review status: all files reviewed at latest revision, 22 unresolved discussions, all commit checks successful.

docs/RFCS/sql_query_planning.md, line 802 at r21 (raw file):

the expression tree will use to process all results and modelling how
data flows through the expression tree. [Table statistics](#stats) are
used to power cardinality estimates of base relations which in term

Maybe update this section to include our new understanding about passing histograms up the query plan?

docs/RFCS/sql_query_planning.md, line 921 at r21 (raw file):

indexed by the root operator of their pattern. Transformations are
further categorized as exploration and implementation and divided
amongst the search stages best on generality and expected benefit.

best on generality -> based on generality

Comments from Reviewable

petermattis · 2017-12-06T22:27:43Z

Review status: 0 of 1 files reviewed at latest revision, 22 unresolved discussions.

docs/RFCS/sql_query_planning.md, line 802 at r21 (raw file):

Previously, rytaft wrote…

Maybe update this section to include our new understanding about passing histograms up the query plan?

Good idea. I've added a sentence about propagating histograms up through the intermediate nodes. More detail than that is beyond my knowledge. Let me know if you have something additional to add.

Comments from Reviewable

rytaft · 2017-12-06T23:38:56Z

Reviewed 1 of 2 files at r11, 1 of 1 files at r22.
Review status: all files reviewed at latest revision, 21 unresolved discussions, all commit checks successful.

docs/RFCS/sql_query_planning.md, line 802 at r21 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Good idea. I've added a sentence about propagating histograms up through the intermediate nodes. More detail than that is beyond my knowledge. Let me know if you have something additional to add.

LGTM

Comments from Reviewable

This change brings in a subset of https://github.com/petermattis/opttoy/tree/master/v3 This change introduces: - the expr tree: cascades-style optimizers operate on expression trees which can represent both scalar and relational expressions; this is a departure from the way we represent expressions and statements (sem/tree) so we need a new tree structure. - scalar operators: initially, we focus only on scalar expressions. - building an expr tree from a sem/tree.TypedExpr. - opt version of logic tests See the RFC in cockroachdb#19135 for more context on the optimizer. This is the first step of an initial project related to the optimizer: generating index constraints from scalar expressions. This will be a rewrite of the current index constraint generation code (which has many problems, see cockroachdb#6346). Roughly, the existing `makeIndexConstraints` will call into the optimizer with a `TypedExpr` and the optimizer will return index constraints. Release note: None

High-level modules of next generation SQL query planning including a full-featured optimizer.

petermattis requested a review from a team as a code owner October 9, 2017 18:04

petermattis force-pushed the pmattis/sql-optimizer-outline branch from 0ee7a09 to 3c1dee4 Compare October 9, 2017 18:04

petermattis requested review from knz and RaduBerinde October 9, 2017 18:04

petermattis force-pushed the pmattis/sql-optimizer-outline branch from 3c1dee4 to 66343e3 Compare October 9, 2017 19:03

petermattis force-pushed the pmattis/sql-optimizer-outline branch from 66343e3 to 661477b Compare October 9, 2017 20:35

petermattis force-pushed the pmattis/sql-optimizer-outline branch from 661477b to 41bc813 Compare October 10, 2017 00:41

petermattis force-pushed the pmattis/sql-optimizer-outline branch from 41bc813 to afbfaf0 Compare October 10, 2017 12:32

knz mentioned this pull request Oct 18, 2017

sql: WINDOW expressions and GROUP BY interact messily with the selectNode constructor #12482

Closed

knz mentioned this pull request Oct 19, 2017

rfcs: complementary RFC to inventory the plan properties #19366

Closed

petermattis force-pushed the pmattis/sql-optimizer-outline branch from 46e1afb to 15e024d Compare December 6, 2017 00:18

RaduBerinde mentioned this pull request Dec 7, 2017

opt: introduce optimizer expression trees and build them from TypedExprs #20557

Merged

petermattis changed the title ~~RFCS: SQL optimizer outline~~ RFCS: SQL query planning Dec 11, 2017

petermattis force-pushed the pmattis/sql-optimizer-outline branch from fcb79fc to c29d5ec Compare December 13, 2017 12:19

RFCS: SQL query planning

684d868

High-level modules of next generation SQL query planning including a full-featured optimizer.

petermattis force-pushed the pmattis/sql-optimizer-outline branch from c29d5ec to 684d868 Compare December 13, 2017 12:19

petermattis merged commit 01a8865 into cockroachdb:master Dec 13, 2017

petermattis deleted the pmattis/sql-optimizer-outline branch December 13, 2017 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFCS: SQL query planning #19135

RFCS: SQL query planning #19135

petermattis commented Oct 9, 2017

cockroach-teamcity commented Oct 9, 2017

petermattis commented Oct 9, 2017

RaduBerinde commented Oct 9, 2017

petermattis commented Oct 9, 2017

knz commented Oct 9, 2017

petermattis commented Oct 9, 2017

knz commented Oct 9, 2017

petermattis commented Oct 10, 2017

knz commented Oct 10, 2017

petermattis commented Oct 10, 2017

petermattis commented Oct 10, 2017

knz commented Oct 10, 2017

a-robinson commented Oct 10, 2017

knz commented Oct 18, 2017

tbg commented Oct 18, 2017

petermattis commented Oct 18, 2017

knz commented Oct 19, 2017

petermattis commented Oct 19, 2017

knz commented Oct 19, 2017

knz commented Oct 20, 2017

petermattis commented Dec 5, 2017

petermattis commented Dec 6, 2017

rytaft commented Dec 6, 2017

petermattis commented Dec 6, 2017

rytaft commented Dec 6, 2017

RFCS: SQL query planning #19135

RFCS: SQL query planning #19135

Conversation

petermattis commented Oct 9, 2017

cockroach-teamcity commented Oct 9, 2017

petermattis commented Oct 9, 2017

RaduBerinde commented Oct 9, 2017

petermattis commented Oct 9, 2017

knz commented Oct 9, 2017

petermattis commented Oct 9, 2017

knz commented Oct 9, 2017

petermattis commented Oct 10, 2017

knz commented Oct 10, 2017

petermattis commented Oct 10, 2017

petermattis commented Oct 10, 2017

knz commented Oct 10, 2017

a-robinson commented Oct 10, 2017

knz commented Oct 18, 2017

tbg commented Oct 18, 2017

petermattis commented Oct 18, 2017

knz commented Oct 19, 2017

petermattis commented Oct 19, 2017

knz commented Oct 19, 2017

knz commented Oct 20, 2017

petermattis commented Dec 5, 2017

petermattis commented Dec 6, 2017

rytaft commented Dec 6, 2017

petermattis commented Dec 6, 2017

rytaft commented Dec 6, 2017