opttoy2: construction of unified expression tree #10

petermattis · 2017-10-07T15:31:46Z

The single expression tree contains both relational operators (scan,
join, etc) and scalar operators (and, or, plus, minus, variables, etc).

petermattis · 2017-10-07T15:34:01Z

This mostly does what I was describing yesterday, using the Cockroach sql parser and then converting it into a unified expression tree. Not yet done is any handling of join conditions.

~ go run *.go
CREATE TABLE a (x INT, y INT)
CREATE TABLE b (x INT, z INT)
SELECT a.y FROM a
scanOp (a (x, y)) [in= out=1]
  project:
    variableOp (a.y) [in=1 out=1]

SELECT * FROM b WHERE b.z > 10
scanOp (b (x, z)) [in= out=0-1]
  project:
    variableOp (*) [in=0-1 out=0-1]
  filter:
    scalarOp (>) [in=1 out=1]
      inputs:
        variableOp (b.z) [in=1 out=1]
        scalarOp (const) (10)

SELECT a.y, b.x FROM a, b WHERE (a.x > 7) AND (b.z = 3)
joinOp [in=0-3 out=1-2]
  project:
    variableOp (a.y) [in=1 out=1]
    variableOp (b.x) [in=2 out=2]
  filter:
    scalarOp (AND) [in=0,3 out=0,3]
      inputs:
        scalarOp (>) [in=0 out=0]
          inputs:
            variableOp (a.x) [in=0 out=0]
            scalarOp (const) (7)
        scalarOp (=) [in=3 out=3]
          inputs:
            variableOp (b.z) [in=3 out=3]
            scalarOp (const) (3)
  inputs:
    scanOp (a (x, y)) [in= out=0-1]
    scanOp (b (x, z)) [in= out=2-3]

petermattis · 2017-10-07T21:19:32Z

Added basic predicate push down and propagation of predicates across join conditions. I'd be willing to bet I'm missing complexities (e.g. I'm only considering inner joins).

CREATE TABLE a (x INT, y INT)
CREATE TABLE b (x INT, z INT)
SELECT b.x FROM a, b WHERE (a.x > 7) AND (b.z = 3)
joinOp [in=0,2-3 out=2]
  project:
    variableOp (b.x) [in=2 out=2]
  filter:
    scalarOp (>) [in=0]
      inputs:
        variableOp (a.x) [in=0 out=0]
        scalarOp (const) (7)
    scalarOp (=) [in=3]
      inputs:
        variableOp (b.z) [in=3 out=3]
        scalarOp (const) (3)
    scalarOp (=) [in=0,2]
      inputs:
        variableOp (a.x) [in=0 out=0]
        variableOp (b.x) [in=2 out=2]
  inputs:
    scanOp (a (x, y)) [out=0-1]
    scanOp (b (x, z)) [out=2-3]

PREDICATE PUSH DOWN:
joinOp [in=0,2-3 out=2]
  project:
    variableOp (b.x) [in=2 out=2]
  filter:
    scalarOp (=) [in=0,2]
      inputs:
        variableOp (a.x) [in=0 out=0]
        variableOp (b.x) [in=2 out=2]
  inputs:
    scanOp (a (x, y)) [out=0-1]
      filter:
        scalarOp (>) [in=0]
          inputs:
            variableOp (a.x) [in=0 out=0]
            scalarOp (const) (7)
    scanOp (b (x, z)) [out=2-3]
      filter:
        scalarOp (>) [in=2]
          inputs:
            variableOp (b.x) [in=2 out=2]
            scalarOp (const) (7)
        scalarOp (=) [in=3]
          inputs:
            variableOp (b.z) [in=3 out=3]
            scalarOp (const) (3)

Cc @knz, @radu. Take a look at expr.pushDownFilters. This experimentation came about from discussions with @albler about how to structure expression manipulation code, but the actual code and bugs are all mine.

petermattis · 2017-10-07T21:22:24Z

PS I realize this is a sizable chunk of code and the comments are sparse. Happy to talk through the mechanisms involved.

petermattis · 2017-10-09T12:34:19Z

Cc @RaduBerinde (just realized you're not @radu).

RaduBerinde · 2017-10-09T14:44:32Z

opttoy2/opttoy.go

+	return (filter.inputVars & e.inputVars) == filter.inputVars
+}
+
+func buildEquivalencyMap(filters []*expr) map[bitmap]*expr {


This map could use some comments. We are mapping sets of variables to equivalent expressions?

It seems off that we are treating (a, b) = (c, d) differently than a = b AND b = c (the equivalency map is different between these two). If we have filters a = e, b = f, and (c, d) = (a, b) (with one input providing a, b and another providing c, d,e,f), we won't be able to infer (c, d) = (e, f)

It seems that the right thing to do here is to create equivalency groups between variables (e.g. UnionFind) and then trying to find a variable in each group that is in filter.inputVars

This is only mapping single variables to equivalent variables. Yes, I don't think it is as general as it could be. The physicalProps stuff you've added recently is definitely more sophisticated.

@albler tells me that functional dependencies should be part of the per-node logical properties. I'm not quite sure how to achieve this yet in the structure I have in this PR. Currently, functional dependencies are implicit in the filters. Should each expr node contain an equivalency group map? (Perhaps lazily initialized).

Something I wanted to get your opinion on is if you find the manipulations in this PR more readable than those present in the existing Cockroach code base. Certainly, the code here is a toy and incomplete so the comparison isn't exactly apples-to-apples. My big take-away so far is that tracking input and output vars per-node makes some of the logic much clearer, but perhaps I'm enamored with my own code.

Definitely any node corresponding to a relational operator should have a set of physical properties (which includes equivalency groups).

if you find the manipulations in this PR more readable than those present in the existing Cockroach code base.

Absolutely. This seems much nicer and easy to understand. Though there may be downsides I'm not seeing right now.

One thing I'm not quite clear how it will work is when we have rendered expressions (e.g. an a+b somewhere in there). Would these show up as special variables in our structures?

Definitely any node corresponding to a relational operator should have a set of physical properties (which includes equivalency groups).

In my discussions with Alberto, he's indicated that we should separate the logical and physical properties. The physical properties only arise when we get to physical nodes such as index scans. Though, I'm not sure if there is a downside to including the physical properties on all nodes. Probably worth experimenting with this.

Absolutely. This seems much nicer and easy to understand. Though there may be downsides I'm not seeing right now.

Great. I'm planning to extend this toy a bit more to see if we can capture some of the downsides. Do you have particular transformations that seem complex in the current code? Or a transformation that you'd like to see an example of? One item I'm working on right now is simple decorrelation as that has been a dream for a long time.

One thing I'm not quite clear how it will work is when we have rendered expressions (e.g. an a+b somewhere in there). Would these show up as special variables in our structures?

Yes, that's exactly what happens. See opttoy.go:579. The way this works is that variableOp nodes are defined as passing through their input vars to their output vars, but other scalar expressions swallow the output vars. So an expression such as a+1 in a render expression (i.e. a projection) will have outputVars == 0 and we'll know that we need to number that expression.

I'm not sure if my handling of variableOp is abstracted enough. Currently such expressions only refer to columns, but I think they need to refer to input variables.

I don't understand the logical vs physical properties distinction, can you point me to some resource? Are the logical properties those that are true regardless of how we run that node (e.g. equivalency between equality columns in a join) while physical properties depend on the specific way a node is run (e.g. ordering)?

Yes, decorrelation would be a big one. I don't think any of the transformations we are currently doing would be difficult to handle in this model.

Yes, that's pretty much the distinction between logical and physical properties as I understand it, though I think I'd restate that as physical properties depend on the specific node.

One reason for keeping them separate is the Memo structure. The logical properties are the same for all nodes in the same equivalency group while the physical properties are different per node. This can be represented in code by pulling the logical properties out into an equivalencyGroup structure and have each memoExpr node hold a pointer to the equivalencyGroup it is part of. I believe this may also facilitate finding the equivalency group for a node, but I'm still hand-wavy on how that will work.

There is an overlap between logical and physical properties here that is confusing me right now. Some operators such as order by impose physical properties on their output which implies a different equivalency group.

Yes, that lat part confuses me as well.

Are there any other things besides ordering that are physical properties? I can't think of anything.

Is unique a physical property? For a scan operation, seems like we know that based on whether there is a unique index on the set of columns, not whether we actually use that index.

@albler can you weigh in?

Even though we may learn about it from a physical aspect of the plan, it's still a logical property in that it's true regardless of how we run it.

Note that "keys" fall under the umbrella of functional dependencies.

knz · 2017-10-10T14:05:32Z

opttoy2/opttoy.go

+//   -- @1 -> y
+//
+// This is akin to the way parser.IndexedVar works except that we're taking
+// care to make the indexes globally unique. Because each of the relational


"to make the indexes unique across the entire statement"

knz · 2017-10-10T14:12:02Z

opttoy2/opttoy.go

+// intersection.
+//
+// For scalar expressions the input variables bitmap allows an easy
+// determination of whether the expression is constant (the bitmap is empty)


New paragraph:
"""
The bitmap determines which scalars are used at each level of a logical plan but it does not determine the order in which the values are presented in memory, e.g. as a result row. For this each expression must also carry, next to the bitmap, an (optional) reordering array which maps the positions in the result row to the indexes in the bitmap. For example, the two queries SELECT k,v FROM kv and SELECT v,k FROM kv have the same bitmap, but the first reorders @1 -> idx 0, @2 -> idx 1 whereas the second reorders @1 -> idx 1, @2 -> idx 0. A subsequent RFC is to determine whether a single array is sufficient for this (indexed by the output column position) or whether both backward and forward associations must be maintained.
"""

I think with this we do not need a separate rename stage.

Interesting. This is effectively a rename operator, though baked into every node as selection and projection are. I think you might be correct that this is necessary. I've added your paragraph as a TODO, sketching out this idea.

During the initial construction of the expression this rename will be indeed present at every stage, because that's what SQL allows.

But perhaps we do not need it embedded in every stage, and instead have the initial construction "weave" it along during the recursion and resolve it at every stage, keeping the rename only for the end.

Although this is possible, I predict that the construction of the final physical plans will need to compute a rename at every stage where the entries in the bitmap are not contiguous and starting with position 0; so as to compact the data structure that represents the values in memory. But I would agree it's too early to say.

knz · 2017-10-10T14:14:00Z

opttoy2/opttoy.go

+//
+//   SELECT @0 FROM a WHERE @1 > 0
+//   -- @0 -> x
+//   -- @1 -> y


According to the code below I understand there should be a 3rd expression @3 -> @1 > 0

i.e. all predicates also get indexes in the bitmap, for otherwise we can't group them together.

Currently in this code, predicates do not get indexes in the bitmap. A scalar expression only gets an index if it is used as a projection. I'm not following your concern about not being able to group them together. You've probably stumbled on to a case where the current code is inadequate.

The single expression tree contains both relational operators (scan, join, etc) and scalar operators (and, or, plus, minus, variables, etc).

petermattis · 2017-10-11T00:54:30Z

I'm going to merge this PR. Feel free to leave additional comments which I can address in follow-on PRs.

petermattis requested a review from a user October 7, 2017 15:31

petermattis force-pushed the pmattis/opttoy2 branch 5 times, most recently from 50a350d to 13dff7f Compare October 7, 2017 21:06

petermattis force-pushed the pmattis/opttoy2 branch from 13dff7f to 3fed4b7 Compare October 7, 2017 21:19

petermattis force-pushed the pmattis/opttoy2 branch 13 times, most recently from 058d3dc to 3512bcf Compare October 9, 2017 01:09

RaduBerinde reviewed Oct 9, 2017

View reviewed changes

petermattis force-pushed the pmattis/opttoy2 branch from 3512bcf to d6bc2c1 Compare October 9, 2017 20:05

knz reviewed Oct 10, 2017

View reviewed changes

knz mentioned this pull request Oct 10, 2017

RFCS: SQL query planning cockroachdb/cockroach#19135

Merged

opttoy2: construction of unified expression tree

eb736b5

The single expression tree contains both relational operators (scan, join, etc) and scalar operators (and, or, plus, minus, variables, etc).

petermattis force-pushed the pmattis/opttoy2 branch from d6bc2c1 to eb736b5 Compare October 10, 2017 14:28

petermattis merged commit e90193a into master Oct 11, 2017

petermattis deleted the pmattis/opttoy2 branch October 11, 2017 01:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opttoy2: construction of unified expression tree #10

opttoy2: construction of unified expression tree #10

petermattis commented Oct 7, 2017

petermattis commented Oct 7, 2017 •

edited

Loading

petermattis commented Oct 7, 2017

petermattis commented Oct 7, 2017

petermattis commented Oct 9, 2017

RaduBerinde Oct 9, 2017

petermattis Oct 9, 2017

RaduBerinde Oct 9, 2017

petermattis Oct 9, 2017

RaduBerinde Oct 9, 2017

petermattis Oct 9, 2017

RaduBerinde Oct 9, 2017

petermattis Oct 9, 2017

RaduBerinde Oct 9, 2017

knz Oct 10, 2017

petermattis Oct 10, 2017

knz Oct 10, 2017

knz Oct 10, 2017

petermattis Oct 10, 2017 •

edited

Loading

knz Oct 10, 2017

knz Oct 10, 2017

petermattis Oct 10, 2017

petermattis commented Oct 11, 2017

opttoy2: construction of unified expression tree #10

opttoy2: construction of unified expression tree #10

Conversation

petermattis commented Oct 7, 2017

petermattis commented Oct 7, 2017 • edited Loading

petermattis commented Oct 7, 2017

petermattis commented Oct 7, 2017

petermattis commented Oct 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petermattis Oct 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petermattis commented Oct 11, 2017

petermattis commented Oct 7, 2017 •

edited

Loading

petermattis Oct 10, 2017 •

edited

Loading