Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opttoy2: construction of unified expression tree #10

Merged
merged 1 commit into from
Oct 11, 2017

Conversation

petermattis
Copy link
Owner

The single expression tree contains both relational operators (scan,
join, etc) and scalar operators (and, or, plus, minus, variables, etc).

@petermattis petermattis requested a review from a user October 7, 2017 15:31
@petermattis
Copy link
Owner Author

petermattis commented Oct 7, 2017

This mostly does what I was describing yesterday, using the Cockroach sql parser and then converting it into a unified expression tree. Not yet done is any handling of join conditions.

~ go run *.go
CREATE TABLE a (x INT, y INT)
CREATE TABLE b (x INT, z INT)
SELECT a.y FROM a
scanOp (a (x, y)) [in= out=1]
  project:
    variableOp (a.y) [in=1 out=1]

SELECT * FROM b WHERE b.z > 10
scanOp (b (x, z)) [in= out=0-1]
  project:
    variableOp (*) [in=0-1 out=0-1]
  filter:
    scalarOp (>) [in=1 out=1]
      inputs:
        variableOp (b.z) [in=1 out=1]
        scalarOp (const) (10)

SELECT a.y, b.x FROM a, b WHERE (a.x > 7) AND (b.z = 3)
joinOp [in=0-3 out=1-2]
  project:
    variableOp (a.y) [in=1 out=1]
    variableOp (b.x) [in=2 out=2]
  filter:
    scalarOp (AND) [in=0,3 out=0,3]
      inputs:
        scalarOp (>) [in=0 out=0]
          inputs:
            variableOp (a.x) [in=0 out=0]
            scalarOp (const) (7)
        scalarOp (=) [in=3 out=3]
          inputs:
            variableOp (b.z) [in=3 out=3]
            scalarOp (const) (3)
  inputs:
    scanOp (a (x, y)) [in= out=0-1]
    scanOp (b (x, z)) [in= out=2-3]

@petermattis petermattis force-pushed the pmattis/opttoy2 branch 5 times, most recently from 50a350d to 13dff7f Compare October 7, 2017 21:06
@petermattis
Copy link
Owner Author

Added basic predicate push down and propagation of predicates across join conditions. I'd be willing to bet I'm missing complexities (e.g. I'm only considering inner joins).

CREATE TABLE a (x INT, y INT)
CREATE TABLE b (x INT, z INT)
SELECT b.x FROM a, b WHERE (a.x > 7) AND (b.z = 3)
joinOp [in=0,2-3 out=2]
  project:
    variableOp (b.x) [in=2 out=2]
  filter:
    scalarOp (>) [in=0]
      inputs:
        variableOp (a.x) [in=0 out=0]
        scalarOp (const) (7)
    scalarOp (=) [in=3]
      inputs:
        variableOp (b.z) [in=3 out=3]
        scalarOp (const) (3)
    scalarOp (=) [in=0,2]
      inputs:
        variableOp (a.x) [in=0 out=0]
        variableOp (b.x) [in=2 out=2]
  inputs:
    scanOp (a (x, y)) [out=0-1]
    scanOp (b (x, z)) [out=2-3]

PREDICATE PUSH DOWN:
joinOp [in=0,2-3 out=2]
  project:
    variableOp (b.x) [in=2 out=2]
  filter:
    scalarOp (=) [in=0,2]
      inputs:
        variableOp (a.x) [in=0 out=0]
        variableOp (b.x) [in=2 out=2]
  inputs:
    scanOp (a (x, y)) [out=0-1]
      filter:
        scalarOp (>) [in=0]
          inputs:
            variableOp (a.x) [in=0 out=0]
            scalarOp (const) (7)
    scanOp (b (x, z)) [out=2-3]
      filter:
        scalarOp (>) [in=2]
          inputs:
            variableOp (b.x) [in=2 out=2]
            scalarOp (const) (7)
        scalarOp (=) [in=3]
          inputs:
            variableOp (b.z) [in=3 out=3]
            scalarOp (const) (3)

Cc @knz, @radu. Take a look at expr.pushDownFilters. This experimentation came about from discussions with @albler about how to structure expression manipulation code, but the actual code and bugs are all mine.

@petermattis
Copy link
Owner Author

PS I realize this is a sizable chunk of code and the comments are sparse. Happy to talk through the mechanisms involved.

@petermattis petermattis force-pushed the pmattis/opttoy2 branch 13 times, most recently from 058d3dc to 3512bcf Compare October 9, 2017 01:09
@petermattis
Copy link
Owner Author

Cc @RaduBerinde (just realized you're not @radu).

return (filter.inputVars & e.inputVars) == filter.inputVars
}

func buildEquivalencyMap(filters []*expr) map[bitmap]*expr {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map could use some comments. We are mapping sets of variables to equivalent expressions?

It seems off that we are treating (a, b) = (c, d) differently than a = b AND b = c (the equivalency map is different between these two). If we have filters a = e, b = f, and (c, d) = (a, b) (with one input providing a, b and another providing c, d,e,f), we won't be able to infer (c, d) = (e, f)

It seems that the right thing to do here is to create equivalency groups between variables (e.g. UnionFind) and then trying to find a variable in each group that is in filter.inputVars

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only mapping single variables to equivalent variables. Yes, I don't think it is as general as it could be. The physicalProps stuff you've added recently is definitely more sophisticated.

@albler tells me that functional dependencies should be part of the per-node logical properties. I'm not quite sure how to achieve this yet in the structure I have in this PR. Currently, functional dependencies are implicit in the filters. Should each expr node contain an equivalency group map? (Perhaps lazily initialized).

Something I wanted to get your opinion on is if you find the manipulations in this PR more readable than those present in the existing Cockroach code base. Certainly, the code here is a toy and incomplete so the comparison isn't exactly apples-to-apples. My big take-away so far is that tracking input and output vars per-node makes some of the logic much clearer, but perhaps I'm enamored with my own code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely any node corresponding to a relational operator should have a set of physical properties (which includes equivalency groups).

if you find the manipulations in this PR more readable than those present in the existing Cockroach code base.

Absolutely. This seems much nicer and easy to understand. Though there may be downsides I'm not seeing right now.

One thing I'm not quite clear how it will work is when we have rendered expressions (e.g. an a+b somewhere in there). Would these show up as special variables in our structures?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely any node corresponding to a relational operator should have a set of physical properties (which includes equivalency groups).

In my discussions with Alberto, he's indicated that we should separate the logical and physical properties. The physical properties only arise when we get to physical nodes such as index scans. Though, I'm not sure if there is a downside to including the physical properties on all nodes. Probably worth experimenting with this.

Absolutely. This seems much nicer and easy to understand. Though there may be downsides I'm not seeing right now.

Great. I'm planning to extend this toy a bit more to see if we can capture some of the downsides. Do you have particular transformations that seem complex in the current code? Or a transformation that you'd like to see an example of? One item I'm working on right now is simple decorrelation as that has been a dream for a long time.

One thing I'm not quite clear how it will work is when we have rendered expressions (e.g. an a+b somewhere in there). Would these show up as special variables in our structures?

Yes, that's exactly what happens. See opttoy.go:579. The way this works is that variableOp nodes are defined as passing through their input vars to their output vars, but other scalar expressions swallow the output vars. So an expression such as a+1 in a render expression (i.e. a projection) will have outputVars == 0 and we'll know that we need to number that expression.

I'm not sure if my handling of variableOp is abstracted enough. Currently such expressions only refer to columns, but I think they need to refer to input variables.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the logical vs physical properties distinction, can you point me to some resource? Are the logical properties those that are true regardless of how we run that node (e.g. equivalency between equality columns in a join) while physical properties depend on the specific way a node is run (e.g. ordering)?

Yes, decorrelation would be a big one. I don't think any of the transformations we are currently doing would be difficult to handle in this model.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's pretty much the distinction between logical and physical properties as I understand it, though I think I'd restate that as physical properties depend on the specific node.

One reason for keeping them separate is the Memo structure. The logical properties are the same for all nodes in the same equivalency group while the physical properties are different per node. This can be represented in code by pulling the logical properties out into an equivalencyGroup structure and have each memoExpr node hold a pointer to the equivalencyGroup it is part of. I believe this may also facilitate finding the equivalency group for a node, but I'm still hand-wavy on how that will work.

There is an overlap between logical and physical properties here that is confusing me right now. Some operators such as order by impose physical properties on their output which implies a different equivalency group.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that lat part confuses me as well.

Are there any other things besides ordering that are physical properties? I can't think of anything.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is unique a physical property? For a scan operation, seems like we know that based on whether there is a unique index on the set of columns, not whether we actually use that index.

@albler can you weigh in?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though we may learn about it from a physical aspect of the plan, it's still a logical property in that it's true regardless of how we run it.

Note that "keys" fall under the umbrella of functional dependencies.

// -- @1 -> y
//
// This is akin to the way parser.IndexedVar works except that we're taking
// care to make the indexes globally unique. Because each of the relational
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to make the indexes unique across the entire statement"

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// intersection.
//
// For scalar expressions the input variables bitmap allows an easy
// determination of whether the expression is constant (the bitmap is empty)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New paragraph:
"""
The bitmap determines which scalars are used at each level of a logical plan but it does not determine the order in which the values are presented in memory, e.g. as a result row. For this each expression must also carry, next to the bitmap, an (optional) reordering array which maps the positions in the result row to the indexes in the bitmap. For example, the two queries SELECT k,v FROM kv and SELECT v,k FROM kv have the same bitmap, but the first reorders @1 -> idx 0, @2 -> idx 1 whereas the second reorders @1 -> idx 1, @2 -> idx 0. A subsequent RFC is to determine whether a single array is sufficient for this (indexed by the output column position) or whether both backward and forward associations must be maintained.
"""

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with this we do not need a separate rename stage.

Copy link
Owner Author

@petermattis petermattis Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. This is effectively a rename operator, though baked into every node as selection and projection are. I think you might be correct that this is necessary. I've added your paragraph as a TODO, sketching out this idea.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the initial construction of the expression this rename will be indeed present at every stage, because that's what SQL allows.

But perhaps we do not need it embedded in every stage, and instead have the initial construction "weave" it along during the recursion and resolve it at every stage, keeping the rename only for the end.

Although this is possible, I predict that the construction of the final physical plans will need to compute a rename at every stage where the entries in the bitmap are not contiguous and starting with position 0; so as to compact the data structure that represents the values in memory. But I would agree it's too early to say.

//
// SELECT @0 FROM a WHERE @1 > 0
// -- @0 -> x
// -- @1 -> y
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the code below I understand there should be a 3rd expression @3 -> @1 > 0

i.e. all predicates also get indexes in the bitmap, for otherwise we can't group them together.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently in this code, predicates do not get indexes in the bitmap. A scalar expression only gets an index if it is used as a projection. I'm not following your concern about not being able to group them together. You've probably stumbled on to a case where the current code is inadequate.

The single expression tree contains both relational operators (scan,
join, etc) and scalar operators (and, or, plus, minus, variables, etc).
@petermattis
Copy link
Owner Author

I'm going to merge this PR. Feel free to leave additional comments which I can address in follow-on PRs.

@petermattis petermattis merged commit e90193a into master Oct 11, 2017
@petermattis petermattis deleted the pmattis/opttoy2 branch October 11, 2017 01:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants