-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opttoy2: construction of unified expression tree #10
Conversation
This mostly does what I was describing yesterday, using the Cockroach sql parser and then converting it into a unified expression tree. Not yet done is any handling of join conditions.
|
50a350d
to
13dff7f
Compare
Added basic predicate push down and propagation of predicates across join conditions. I'd be willing to bet I'm missing complexities (e.g. I'm only considering inner joins).
Cc @knz, @radu. Take a look at |
13dff7f
to
3fed4b7
Compare
PS I realize this is a sizable chunk of code and the comments are sparse. Happy to talk through the mechanisms involved. |
058d3dc
to
3512bcf
Compare
Cc @RaduBerinde (just realized you're not |
return (filter.inputVars & e.inputVars) == filter.inputVars | ||
} | ||
|
||
func buildEquivalencyMap(filters []*expr) map[bitmap]*expr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This map could use some comments. We are mapping sets of variables to equivalent expressions?
It seems off that we are treating (a, b) = (c, d)
differently than a = b AND b = c
(the equivalency map is different between these two). If we have filters a = e
, b = f
, and (c, d) = (a, b)
(with one input providing a, b
and another providing c, d,e,f
), we won't be able to infer (c, d) = (e, f)
It seems that the right thing to do here is to create equivalency groups between variables (e.g. UnionFind) and then trying to find a variable in each group that is in filter.inputVars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only mapping single variables to equivalent variables. Yes, I don't think it is as general as it could be. The physicalProps
stuff you've added recently is definitely more sophisticated.
@albler tells me that functional dependencies should be part of the per-node logical properties. I'm not quite sure how to achieve this yet in the structure I have in this PR. Currently, functional dependencies are implicit in the filters. Should each expr
node contain an equivalency group map? (Perhaps lazily initialized).
Something I wanted to get your opinion on is if you find the manipulations in this PR more readable than those present in the existing Cockroach code base. Certainly, the code here is a toy and incomplete so the comparison isn't exactly apples-to-apples. My big take-away so far is that tracking input and output vars per-node makes some of the logic much clearer, but perhaps I'm enamored with my own code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely any node corresponding to a relational operator should have a set of physical properties (which includes equivalency groups).
if you find the manipulations in this PR more readable than those present in the existing Cockroach code base.
Absolutely. This seems much nicer and easy to understand. Though there may be downsides I'm not seeing right now.
One thing I'm not quite clear how it will work is when we have rendered expressions (e.g. an a+b
somewhere in there). Would these show up as special variables in our structures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely any node corresponding to a relational operator should have a set of physical properties (which includes equivalency groups).
In my discussions with Alberto, he's indicated that we should separate the logical and physical properties. The physical properties only arise when we get to physical nodes such as index scans. Though, I'm not sure if there is a downside to including the physical properties on all nodes. Probably worth experimenting with this.
Absolutely. This seems much nicer and easy to understand. Though there may be downsides I'm not seeing right now.
Great. I'm planning to extend this toy a bit more to see if we can capture some of the downsides. Do you have particular transformations that seem complex in the current code? Or a transformation that you'd like to see an example of? One item I'm working on right now is simple decorrelation as that has been a dream for a long time.
One thing I'm not quite clear how it will work is when we have rendered expressions (e.g. an a+b somewhere in there). Would these show up as special variables in our structures?
Yes, that's exactly what happens. See opttoy.go:579
. The way this works is that variableOp
nodes are defined as passing through their input vars to their output vars, but other scalar expressions swallow the output vars. So an expression such as a+1
in a render expression (i.e. a projection) will have outputVars == 0
and we'll know that we need to number that expression.
I'm not sure if my handling of variableOp
is abstracted enough. Currently such expressions only refer to columns, but I think they need to refer to input variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the logical vs physical properties distinction, can you point me to some resource? Are the logical properties those that are true regardless of how we run that node (e.g. equivalency between equality columns in a join) while physical properties depend on the specific way a node is run (e.g. ordering)?
Yes, decorrelation would be a big one. I don't think any of the transformations we are currently doing would be difficult to handle in this model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's pretty much the distinction between logical and physical properties as I understand it, though I think I'd restate that as physical properties depend on the specific node.
One reason for keeping them separate is the Memo structure. The logical properties are the same for all nodes in the same equivalency group while the physical properties are different per node. This can be represented in code by pulling the logical properties out into an equivalencyGroup
structure and have each memoExpr
node hold a pointer to the equivalencyGroup
it is part of. I believe this may also facilitate finding the equivalency group for a node, but I'm still hand-wavy on how that will work.
There is an overlap between logical and physical properties here that is confusing me right now. Some operators such as order by
impose physical properties on their output which implies a different equivalency group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that lat part confuses me as well.
Are there any other things besides ordering that are physical properties? I can't think of anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is unique
a physical property? For a scan operation, seems like we know that based on whether there is a unique index on the set of columns, not whether we actually use that index.
@albler can you weigh in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though we may learn about it from a physical aspect of the plan, it's still a logical property in that it's true regardless of how we run it.
Note that "keys" fall under the umbrella of functional dependencies.
3512bcf
to
d6bc2c1
Compare
opttoy2/opttoy.go
Outdated
// -- @1 -> y | ||
// | ||
// This is akin to the way parser.IndexedVar works except that we're taking | ||
// care to make the indexes globally unique. Because each of the relational |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to make the indexes unique across the entire statement"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
opttoy2/opttoy.go
Outdated
// intersection. | ||
// | ||
// For scalar expressions the input variables bitmap allows an easy | ||
// determination of whether the expression is constant (the bitmap is empty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New paragraph:
"""
The bitmap determines which scalars are used at each level of a logical plan but it does not determine the order in which the values are presented in memory, e.g. as a result row. For this each expression must also carry, next to the bitmap, an (optional) reordering array which maps the positions in the result row to the indexes in the bitmap. For example, the two queries SELECT k,v FROM kv
and SELECT v,k FROM kv
have the same bitmap, but the first reorders @1 -> idx 0, @2 -> idx 1
whereas the second reorders @1 -> idx 1, @2 -> idx 0
. A subsequent RFC is to determine whether a single array is sufficient for this (indexed by the output column position) or whether both backward and forward associations must be maintained.
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with this we do not need a separate rename stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. This is effectively a rename operator, though baked into every node as selection and projection are. I think you might be correct that this is necessary. I've added your paragraph as a TODO, sketching out this idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the initial construction of the expression this rename will be indeed present at every stage, because that's what SQL allows.
But perhaps we do not need it embedded in every stage, and instead have the initial construction "weave" it along during the recursion and resolve it at every stage, keeping the rename only for the end.
Although this is possible, I predict that the construction of the final physical plans will need to compute a rename at every stage where the entries in the bitmap are not contiguous and starting with position 0; so as to compact the data structure that represents the values in memory. But I would agree it's too early to say.
// | ||
// SELECT @0 FROM a WHERE @1 > 0 | ||
// -- @0 -> x | ||
// -- @1 -> y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the code below I understand there should be a 3rd expression @3 -> @1 > 0
i.e. all predicates also get indexes in the bitmap, for otherwise we can't group them together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently in this code, predicates do not get indexes in the bitmap. A scalar expression only gets an index if it is used as a projection. I'm not following your concern about not being able to group them together. You've probably stumbled on to a case where the current code is inadequate.
The single expression tree contains both relational operators (scan, join, etc) and scalar operators (and, or, plus, minus, variables, etc).
d6bc2c1
to
eb736b5
Compare
I'm going to merge this PR. Feel free to leave additional comments which I can address in follow-on PRs. |
The single expression tree contains both relational operators (scan,
join, etc) and scalar operators (and, or, plus, minus, variables, etc).