Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC [WIP] an intermediate representation for SQL #10055

Closed
wants to merge 1 commit into from

Conversation

knz
Copy link
Contributor

@knz knz commented Oct 18, 2016

Reaction to last week's meeting (with @irfansharif @arjunravinarayan @petermattis @RaduBerinde @andreimatei)

@eisenstatdavid I'd like you to help me shape this up before we submit it for discussion to a larger group. Some special questions for you:

  • choice of HLL, would something else read better to the uneducated reader?
  • how much more than this do we need to be able to define tree transformations in the HLL and have this translated to Go automatically?
  • do you recall the algorithm to check whether a pattern match is exhaustive, or a reference to such an algorithm?

This change is Reviewable

@eisenstatdavid
Copy link

  1. OCaml is fine.
  2. Not much more? Rather than define a toy language and write the transformations operationally though, maybe we should implement a term rewriting system and have the compiler write the traversals for us.
  3. Determining whether a pattern match is exhaustive is coNP-hard via the obvious reduction from CNF-SAT. Seems like people use this in practice, which also helps compiling the pattern matches efficiently: http://stackoverflow.com/q/7883023 .

Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Oct 19, 2016

Thanks for your input. Do you know of anything close to term rewriting systems / libraries available in Go already?

@andreimatei
Copy link
Contributor

I'd love more discussion about why we need to write the IR definition in ML, versus writing it in Go.
Sounds like you're not suggesting we actually write transformations in ML, are you? Then what's the advantage of using this other language and generating Go structs? What's the code that'd be hard to write in Go? Is it the serialization? Or the matchers? (for the matchers, I commented elsewhere that I think we can write a generic one).


Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


docs/RFCS/sql-ir.md, line 15 at r1 (raw file):

# Motivation

Primary motivations:

how about an extra one - to allow the distsql logical planner to recognize subtrees that it can compile? Or more generally, to allow different compilation "backends".


docs/RFCS/sql-ir.md, line 17 at r1 (raw file):

Primary motivations:

- to encode the query of non-materialized views.

can you please expand this bullet?


docs/RFCS/sql-ir.md, line 19 at r1 (raw file):

- to encode the query of non-materialized views.
- to enable more straightforward definitions for query optimization algorithms.
- to transfer filter expressions across the wire when spawning distSQL processors.

is this one a legitimate need? I think we can transfer the filter expressions just fine today, as the string from the current Expression Node


docs/RFCS/sql-ir.md, line 23 at r1 (raw file):

Side concerns that influence the design:

- robustness against function renames across cockroachdb versions.

I'm not sure I follow these constraints... Is this assuming that we're gonna save IR to use later/on a different node? Or is this about saving views?


docs/RFCS/sql-ir.md, line 117 at r1 (raw file):

```go
oexpr := match(iexpr, 
  "BinOp(Gt, a, b)",              func(a, b Expr) Expr { return BinOp{Lt, b, a} },

on the left hand side you could also have structures (struct literals), not strings.


docs/RFCS/sql-ir.md, line 128 at r1 (raw file):

var matchers := map[string]matcher

func match1(iexpr Expr, actionfn func(...interface{})) {

I'm not sure why/if we need to generate these specialized matchers/"actionfns". Can't we write a generic matching algorithm?
Is the concern performance? Cause I'd argue we should be then first be working on prepared statements and caching this stuff.


Comments from Reviewable

@RaduBerinde
Copy link
Member

I also don't quite understand why we need to start with a new language. I thought that the idea is to start with the current parser nodes and add more as necessary.

In particular, I don't see the benefit for expressions. We already have all the machinery to use, analyze, normalize (and even a simple way to serialize/deserialize) SQL expressions, why maintain two different languages/grammars/etc? It would mean that whenever we want to add some feature and support something we would need to mess with both.

In the IR, how do you express that you are using a certain index? The IR needs to allow representing this right? (In fact, I can't even find how the general idea of reading from a table is represented, IOW what IR "node" would correspond to the current scanNode? Looks like tref is only used for insert/update/delete)

@knz knz force-pushed the rfc-sql-ir branch 4 times, most recently from 8b8717b to 06718b9 Compare November 11, 2016 18:49
@knz
Copy link
Contributor Author

knz commented Nov 11, 2016

Guys I have completed the 2nd iteration, the text is rather .. different now. PTAL.

Note it's probably not complete yet, we likely want to be a bit more specific as to what we want as attributes in the IR nodes in this text (i.e. before they are implemented), but I'd like your opinion about the overall shape of the work before we dive into specifics.

cc @cuongdo @arjunravinarayan @irfansharif @petermattis @nvanbenschoten @a-robinson

@bdarnell
Copy link
Contributor

Review status: 0 of 1 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


docs/RFCS/sql-ir.md, line 429 at r2 (raw file):

   - regular join, with annotations:
     - join predicate: cross/on/using/natural/equality (also see issue #10630)
   - which algorithm to use

This doc mixes tabs and spaces in a few places; I don't know what the markdown renderer will do with that. Replace all tabs with spaces.


docs/RFCS/sql-ir.md, line 476 at r2 (raw file):

1. The IR nodes must be appropriately *named*, at least in documentation,
   so as to provide a common language to drive further discussions and serve
   as references in code TODOs. (FIXME: perhaps this RFC should do just this?)

Yeah, I think the RFC is the appropriate place to define names.


Comments from Reviewable

@RaduBerinde
Copy link
Member

Great stuff, thanks for all the additions! Overall shape/direction is great!


Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


docs/RFCS/sql-ir.md, line 437 at r2 (raw file):

   - show
   - explain
   - index join (this is really a hybrid)

Note that the "join by lookup" method could be used for other joins, not just index join. I agree it's somewhat of a hybrid though.


Comments from Reviewable

@andreimatei
Copy link
Contributor

:lgtm:

What I'd like a bit more detail on is how exactly the "backends" are going to interact with the IR tree. Are we going to pass the IR to DistSQL, which then has an opportunity to do more transformations? And then DistSQL is going to delegate the parts that it doesn't want to deal with to another "backend" that does further transformations and then produces planNodes?


Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions, some commit checks failed.


docs/RFCS/sql-ir.md, line 117 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

on the left hand side you could also have structures (struct literals), not strings.

I still maintain my comment

docs/RFCS/sql-ir.md, line 14 at r2 (raw file):

Summarized implementation strategy:
1. identify transforms (we're not doing great on this yet)

nit: this doesn't render on new lines for some reason. Maybe you need an empty line above 1.


docs/RFCS/sql-ir.md, line 29 at r2 (raw file):

- to encode the query of non-materialized views.
- to provide robustness against db/table/column/view renames in a lightweight fashion.
- (hopefully) to version built-in function and virtual table references.

There's multiple references to this versioning throughout the document, but it's not very clear (at least to me) what the point is. Might be worth explaining a bit. Or dropping.
Is it about using prepared statements /views between versions of the server? If so, would we really maintain multiple versions of descriptors/functions, as opposed to "recompiling" everything for every new version?


docs/RFCS/sql-ir.md, line 380 at r2 (raw file):

    - strength reduction (e.g. `a LIKE 'foo%'` to `a >= "foo" AND a < "fop"`)
    - comparison simplification
    - **[new]** normalization to disjunctive normal form (an OR of ANDs)

you might want to clarify that this helps with index selection


docs/RFCS/sql-ir.md, line 381 at r2 (raw file):

    - comparison simplification
    - **[new]** normalization to disjunctive normal form (an OR of ANDs)
    - partial evaluation of disjunctions and conjunctions

what does this mean?


docs/RFCS/sql-ir.md, line 402 at r2 (raw file):

   2. **[new]** ordering propagation (this can simplify nodes) 

   3. **[new]** ordering relaxation, where every ordering that is not

can you give an example of this?


docs/RFCS/sql-ir.md, line 551 at r2 (raw file):

- SQL's "IN" has two IR encodings, depending on whether one of the operands is a subquery.

The bulk of what makes SQL special is the IR for "table expressions".

"table expressions" are "2. logical nodes that act as data sources" from your taxonomy above?


docs/RFCS/sql-ir.md, line 557 at r2 (raw file):

- the IR for table expressions reifies the operators from relational
  algebra: Join Filter, Render, Sort, etc.
- Insert, Update, Delete, Explain and Show are also table expressions in the IR.

for INSERT/UPDATE/DELETE, you might want to explain that a COUNT operation gets magically placed above the table expression if RETURNING is not used.


docs/RFCS/sql-ir.md, line 639 at r2 (raw file):

- Expression tree can contain nodes with Go values (expecially
  timestamps, dates). What to do with those with respect to

@RaduBerinde what does DistSQL do with them today?


docs/RFCS/sql-ir.md, line 914 at r2 (raw file):

(* DoAndCount(T) means run the clause T and return the count of its results. *)
(* For example INSERT/UPDATE/DELETE is initially encoded as DoAndCount(<Clause>). *)
(* DoAndCount(T) then disappears during normalization into Do(GroupAgg(T, *, COUNT( * ))). *)

I wonder how we'll actually run these DoAndCounts... Seems to me like, with any backend, we wouldn't actually put a COUNT node, but instead we'd directly configure the inserter to return the count. And maybe DistSQL would put a summer processor on top. How would that fit? Would there be a further "normalization" step, possibly dependent of the backend, that would remove this COUNT?


Comments from Reviewable

@eisenstatdavid
Copy link

:lgtm:

but is this IR essentially going to serve as the abstract syntax tree too? The phases of the compiler that check for errors (name resolution, type checking) need to operate on something close to the surface syntax to give good error messages, but subsequent phases are simpler if there's a desugaring step that reduces the number of constructs that they must consider.


Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions, some commit checks failed.


Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Nov 15, 2016

@andreimatei: I have added a paragraph in the summary to explain the interaction.

@eisenstatdavid: I have added a paragraph to clarify that pretty-printing back to SQL will only be possible up to semantic analysis and that we may lose that opportunity after that, and instead we can keep track of the text location that the IR node originates from in the input for error reporting.


Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions, some commit checks failed.


docs/RFCS/sql-ir.md, line 15 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

how about an extra one - to allow the distsql logical planner to recognize subtrees that it can compile? Or more generally, to allow different compilation "backends".

Done.

docs/RFCS/sql-ir.md, line 17 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

can you please expand this bullet?

Done.

docs/RFCS/sql-ir.md, line 19 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

is this one a legitimate need? I think we can transfer the filter expressions just fine today, as the string from the current Expression Node

Clarified in text.

docs/RFCS/sql-ir.md, line 23 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm not sure I follow these constraints... Is this assuming that we're gonna save IR to use later/on a different node? Or is this about saving views?

We discussed that in a meeting.

docs/RFCS/sql-ir.md, line 117 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I still maintain my comment

I think I want to revisit this section after we get some code working.

docs/RFCS/sql-ir.md, line 14 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: this doesn't render on new lines for some reason. Maybe you need an empty line above 1.

Done.

docs/RFCS/sql-ir.md, line 29 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

There's multiple references to this versioning throughout the document, but it's not very clear (at least to me) what the point is. Might be worth explaining a bit. Or dropping.
Is it about using prepared statements /views between versions of the server? If so, would we really maintain multiple versions of descriptors/functions, as opposed to "recompiling" everything for every new version?

Clarified in meeting.

docs/RFCS/sql-ir.md, line 380 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

you might want to clarify that this helps with index selection

Done.

docs/RFCS/sql-ir.md, line 381 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

what does this mean?

Done.

docs/RFCS/sql-ir.md, line 402 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

can you give an example of this?

Done.

docs/RFCS/sql-ir.md, line 429 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This doc mixes tabs and spaces in a few places; I don't know what the markdown renderer will do with that. Replace all tabs with spaces.

Done.

docs/RFCS/sql-ir.md, line 437 at r2 (raw file):

Previously, RaduBerinde wrote…

Note that the "join by lookup" method could be used for other joins, not just index join. I agree it's somewhat of a hybrid though.

Clarified.

docs/RFCS/sql-ir.md, line 551 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

"table expressions" are "2. logical nodes that act as data sources" from your taxonomy above?

Indeed!

docs/RFCS/sql-ir.md, line 557 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

for INSERT/UPDATE/DELETE, you might want to explain that a COUNT operation gets magically placed above the table expression if RETURNING is not used.

Done.

docs/RFCS/sql-ir.md, line 914 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I wonder how we'll actually run these DoAndCounts... Seems to me like, with any backend, we wouldn't actually put a COUNT node, but instead we'd directly configure the inserter to return the count. And maybe DistSQL would put a summer processor on top. How would that fit? Would there be a further "normalization" step, possibly dependent of the backend, that would remove this COUNT?

I'd hope we can optimize this a bit, since count is distributive and commutative it is easy to parallelize!

Comments from Reviewable

@a-robinson
Copy link
Contributor

@knz it doesn't look like you pushed the changes you made in response to everyone's comments?


Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions, some commit checks failed.


Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Nov 16, 2016

Oh indeed, my bad. Done.

@a-robinson
Copy link
Contributor

:lgtm: Thanks for writing this all up!


Reviewed 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.


docs/RFCS/sql-ir.md, line 51 at r3 (raw file):

- The IR needs not be pretty-printable back to valid SQL. In the few
  cases where something needs to produce SQL syntax back to a user
  (e.g. SHOW CREATE), the IR node can keep a copy of the string in

I understand the importance of keeping things simple and completely agree with this decision, but things like showing a view definition to the user after a bunch of the tables/columns it uses have been renamed will unfortunately require more logic than just keeping a copy of the original string.


Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Nov 17, 2016

Review status: all files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.


docs/RFCS/sql-ir.md, line 51 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I understand the importance of keeping things simple and completely agree with this decision, but things like showing a view definition to the user after a bunch of the tables/columns it uses have been renamed will unfortunately require more logic than just keeping a copy of the original string.

Will it? Why not simply store a copy of the IR tree at the end of the semantic check phase in the view descriptor, at a point where it's still pretty-printable? Then complete the IR processing every time the view is loaded (we could possibly introduce a cache)

Comments from Reviewable

@a-robinson
Copy link
Contributor

Review status: all files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.


docs/RFCS/sql-ir.md, line 51 at r3 (raw file):

Previously, knz (kena) wrote…

Will it? Why not simply store a copy of the IR tree at the end of the semantic check phase in the view descriptor, at a point where it's still pretty-printable? Then complete the IR processing every time the view is loaded (we could possibly introduce a cache)

That would be great if it's an option. I was just taking the point "The IR needs not be pretty-printable back to valid SQL" at face value :)

Comments from Reviewable

@knz knz force-pushed the rfc-sql-ir branch 4 times, most recently from 396617f to 589ae68 Compare December 1, 2016 01:02
@RaduBerinde
Copy link
Member

RaduBerinde commented Dec 2, 2016

Reminder for the discussion in #11736 (comment) about "ghost columns partcipating in ordering"

@knz knz added the do-not-merge bors won't merge a PR with this label. label Apr 25, 2017
@knz knz requested a review from a team as a code owner September 4, 2017 16:16
@knz
Copy link
Contributor Author

knz commented Sep 4, 2017

I had another iteration on this PR after recent developments:

  1. I have a rather simple and elegant solution for the problem "Go data in IR" - will send out separate RFC which will need implementing before this IR merges.
  2. I realized thanks to our optimizer training that my view on pattern matching was naive and inadequate. I simply removed that section and pattern matching will probably receive better treatment in a separate RFC.
  3. I also realized that the IR for abstract syntax, which describes the user input of a query, is necessarily different from the IR for logical plans. Trying to merge them together as this PR was initially attempting makes the picture murky and hard to read. I will have another iteration to separate them further and probably split this RFC in two.

The big next steps at this point are:

  • propose+implement solution to point 1 above
  • iterate on preliminary list of IR node names, to complete this RFC
  • merge this PR, then continue work in subsequent RFCs.

Review status: 0 of 1 files reviewed at latest revision, 19 unresolved discussions.


docs/RFCS/sql-ir.md, line 128 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm not sure why/if we need to generate these specialized matchers/"actionfns". Can't we write a generic matching algorithm?
Is the concern performance? Cause I'd argue we should be then first be working on prepared statements and caching this stuff.

Removed this section. Will address in a later / separate RFC.


docs/RFCS/sql-ir.md, line 476 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, I think the RFC is the appropriate place to define names.

Done.

Will need to iterate a little bit before this merges though.


docs/RFCS/sql-ir.md, line 639 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

@RaduBerinde what does DistSQL do with them today?

I removed this section - I have a solution, will write a separate RFC.


docs/RFCS/sql-ir.md, line 51 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

That would be great if it's an option. I was just taking the point "The IR needs not be pretty-printable back to valid SQL" at face value :)

I added a qualifier sub-clause in the sentence.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 19 unresolved discussions, all commit checks successful.


docs/RFCS/sql-ir.md, line 389 at r4 (raw file):

      - etc. (this extracts the logic currently present in planNode constructors)
      - **at the end of this stage, a statement is valid and can
        be run**

If we're going by some of the terminology we've learned recently, I think the above we all be considered "prep".


docs/RFCS/sql-ir.md, line 530 at r4 (raw file):

   - accessor methods.
   - walk (visitor) recursion.
   - match methods meant to ease pattern matching (see below for details)

I'm still fuzzy on how the pattern matching would ease the writing of transformations. That is, I can imagine this at a high-level, but the specifics are where the devil is. I think it would be worthwhile, perhaps in a companion RFC, to give an example of 2 transformations in the current code and how they would be simplified. Two good examples would be expression normalization and propagating filters across inner joins.


Comments from Reviewable

@petermattis
Copy link
Collaborator

This RFC has been in draft form for quite a while. An over-arching concern I have is that it presents justification about one approach while not listing out any alternatives. And I don't have a concrete grasp about whether the proposed IR nodes will be a material improvement for the types of transformations we want to perform during SQL optimization. I think it is worth revisiting this RFC in light of some of the knowledge we've gained recently.


Review status: 0 of 1 files reviewed at latest revision, 21 unresolved discussions, all commit checks successful.


docs/RFCS/sql-ir.md, line 238 at r4 (raw file):

sizeable language processing component.

## General implementation direction: auto-generate the code

As my understanding of the needs of this IR have become clearer, I'm not sure if auto-generated code is necessary, at least for the IR node structure. As an alternative, I think a single node structure used for both relational and scalar expressions can satisfy our needs and enable easy transformations. What seems to be important for the IR structure is to be able to easily determine and update logical properties about a node, such as the input and output variables. This could be achieved by making the IR nodes implement an interface (inputVars(), outputVars(), etc), or we can pull those logical properties into a node structure and have the node-specific data attached via some hook field.

There is still an exploration to be made about whether transformations should be represented in a higher-level language or directly in Go. The latter certainly gives the most flexibility, but perhaps at the expense of expressiveness.


docs/RFCS/sql-ir.md, line 361 at r4 (raw file):

   the nodes already defined, so they can be used as drop-in
   replacement into the existing code (i.e. preserve struct and
   attribute names)

This approach to using the IR sounds like it is trying to replace the AST nodes. An alternative would be to leave the AST nodes in place and provide a transformation from the AST to the IR. The advantage of this approach is that we can choose when we perform the transformation. For example, we can leave all of the existing name resolution, type checking and normalization code in place for the time being and use the IR for higher-level transformations and query planning. Over time, we can move the conversion from the AST to the IR closer to the raw AST.


Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Jan 25, 2018

Superseded by #19135.

@knz knz closed this Jan 25, 2018
@knz knz deleted the rfc-sql-ir branch April 27, 2018 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge bors won't merge a PR with this label.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants