sql, distsql: planning for interleave joins between ancestor and descendant #19853

richardwu · 2017-11-06T22:27:12Z

For two tables parent and child

CREATE TABLE parent (pid1 INT, pid2 INT, v INT, PRIMARY KEY (pid1, pid2));

CREATE TABLE child (pid1 INT, pid2 INT, cid1 INT, v INT, PRIMARY KEY (pid1, pid2, cid1)) INTERLEAVE IN PARENT parent (pid1, pid2);

A query on the full interleave prefix

SELECT * FROM parent JOIN child USING (pid1, pid2)

uses the InterleaveReaderJoiner.

A query on just a subset of the interleave prefix will simply default back to a MergeJoiner.

SELECT * FROM parent JOIN child USING (pid1)

Fixes #18948

cockroach-teamcity · 2017-11-06T22:27:21Z

This change is

RaduBerinde · 2017-11-07T20:39:35Z

Great stuff overall!

Review status: 0 of 27 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.

pkg/sql/distsql_join.go, line 28 at r2 (raw file):

)

func (dsp *DistSQLPlanner) createPlanForJoin(

You should move the existing code in a separate commit so (hopefully) we don't lose history information (plus this makes it hard to review what actually changed). Or just keep it in the same file for now.

pkg/sql/distsql_join.go, line 475 at r2 (raw file):

}

func fixInterleavePartitions(n *joinNode, partitions []spanPartition) ([]spanPartition, error) {

this definitely needs tests

pkg/sql/distsql_join.go, line 481 at r2 (raw file):

	}

	// Given a span with the start and end keys:

This makes my head spin so I have to give it some more careful thought. But I'm confused why it's enough to just modify the EndKeys. Don't we need to make sure we don't scan the same rows in another partition? Seems like in this example, the next span that starts at /parent/1/42/#/child/4 would now need to start at /parent/1/43

pkg/sql/distsql_join.go, line 489 at r2 (raw file):

	// EndKey:   /parent/1/43
	// To do this, we must truncate up to the prefix /parent/1/42 to invoke
	// encoding.PrefixEnd().

where are we invoking PrefixEnd?

pkg/sql/distsql_join.go, line 494 at r2 (raw file):

	// and note that the number of times we need to invoke encoding.PeekLength()
	// (for each value in the key) to figure out the total length is
	//   3 * count(interleave ancestors) + sum(shared_prefix_len) - 1

I don't get the 3. There can be multiple columns in each ancestor "section". The key is actually `<1st-tableid>/<1st-indexid>/<1st-index-column-1>/<1st-index-column-2>.../#/<2nd-tableid>/<2nd-indexid>/<2nd-index-column-1>.../#/...

pkg/sql/distsql_join.go, line 526 at r2 (raw file):

					break
				}
				len, err := encoding.PeekLength(endKey)

[nit] don't overload len

pkg/sql/join.go, line 861 at r2 (raw file):

}

func (n *joinNode) computeJoinHint() joinHint {

Why is it helpful to compute this separately? We could check this stuff directly during distsql planning (we are already re-checking them..) I would change createPlanForInterleaveJoin to tryCreatePlanForInterleaveJoin and return an ok flag.

pkg/sql/distsqlrun/flow_diagram.go, line 160 at r2 (raw file):

func (irj *InterleaveReaderJoinerSpec) summary() (string, []string) {
	details := make([]string, 1, 2)
	details[0] = fmt.Sprintf(

We need to include more details on the post-processing of each table.

Comments from Reviewable

richardwu · 2017-11-08T04:55:30Z

Added some unit tests to some helper methods. Will be adding some query-level logic tests in the next patch.

Review status: 0 of 31 files reviewed at latest revision, 8 unresolved discussions.

pkg/sql/distsql_join.go, line 28 at r2 (raw file):

Previously, RaduBerinde wrote…

You should move the existing code in a separate commit so (hopefully) we don't lose history information (plus this makes it hard to review what actually changed). Or just keep it in the same file for now.

I moved this back to distsql_physical_planner.go.

pkg/sql/distsql_join.go, line 475 at r2 (raw file):

Previously, RaduBerinde wrote…

this definitely needs tests

Added some tests, let me know what you think.

pkg/sql/distsql_join.go, line 481 at r2 (raw file):

Previously, RaduBerinde wrote…

This makes my head spin so I have to give it some more careful thought. But I'm confused why it's enough to just modify the EndKeys. Don't we need to make sure we don't scan the same rows in another partition? Seems like in this example, the next span that starts at /parent/1/42/#/child/4 would now need to start at /parent/1/43

You're right, I've updated the comment with a better example and added logic that "cascades" the "fixed" end key.

pkg/sql/distsql_join.go, line 489 at r2 (raw file):

Previously, RaduBerinde wrote…

where are we invoking PrefixEnd?

Good catch, guess I missed this the first iteration. Added some tests and made this part more correct.

pkg/sql/distsql_join.go, line 494 at r2 (raw file):

Previously, RaduBerinde wrote…

I don't get the 3. There can be multiple columns in each ancestor "section". The key is actually `<1st-tableid>/<1st-indexid>/<1st-index-column-1>/<1st-index-column-2>.../#/<2nd-tableid>/<2nd-indexid>/<2nd-index-column-1>.../#/...

The 3 is for the tableid, indexid and interleave sentinel #.

The actual # of columns in each ancestor section is given by each SharedPrefixLen in the InterleaveDescriptor (hence the sum over all of them).

The -1 is for the last interleave sentinel (of the ancestor) which we do not want to keep.

I updated the comment to be more clear.

pkg/sql/distsql_join.go, line 526 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] don't overload len

Good catch!

pkg/sql/join.go, line 861 at r2 (raw file):

Previously, RaduBerinde wrote…

Why is it helpful to compute this separately? We could check this stuff directly during distsql planning (we are already re-checking them..) I would change createPlanForInterleaveJoin to tryCreatePlanForInterleaveJoin and return an ok flag.

The motive for computing this ahead of time was possibly re-using the hint in the non-distributed execution engine. Looking back, this might be a bit premature and it might just be fine to keep all this logic in DistSQL for now.

pkg/sql/distsqlrun/flow_diagram.go, line 160 at r2 (raw file):

Previously, RaduBerinde wrote…

We need to include more details on the post-processing of each table.

Good point, updated this. Here's a preview of what it looks like for a simple

SELECT * FROM parent JOIN child USING (pid1, pid2)

https://cockroachdb.github.io/distsqlplan/decode.html?eJyMkM9KNDEQxO_fU3zUSdk-bEZRCAi5jgdX9ipziJN2NpBNhqRHlGXeXWZyWP8geKzqX3dRfUJMjh_skQv0ExQ6wphTz6WkvFgVaN0b9Jbg4zjJYneEPmWGPkG8BIZGG4VzYPvKe7aO833ykTMIjsX6sN4X-xxYYcnwR5vfzWgzRwFhN4n-bxSZhswVqJLNJ7I_-OC-g2SuQQj8IhdGbcg0m8u77IfDWf7cuCFzi24mpEnObYrYgaHVTH9vvOcyplj4S8ffLm_njsBu4PrVkqbc82NO_RpT5W7dWw3HRepUVdHGOpq7-d9HAAAA___5_Yri

Comments from Reviewable

RaduBerinde · 2017-11-08T15:48:48Z

Review status: 0 of 31 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.

pkg/roachpb/combine_spans.go, line 105 at r4 (raw file):

}

// IntersectSpans returns the logical intersection of all spans represented as

Not sure how this semantic is useful. Intersection of two sets is not the same with intersecting all elements in the sets. We want to intersect two lists of spans (one for each table), I don't see why we would intersect spans from the same table E.g. if the join has a WHERE k IN [1,10,20], we would have spans '1,10,20` and the intersection would be empty.

pkg/sql/distsql_join.go, line 494 at r2 (raw file):

Previously, richardwu (Richard Wu) wrote…

The 3 is for the tableid, indexid and interleave sentinel #.

The actual # of columns in each ancestor section is given by each SharedPrefixLen in the InterleaveDescriptor (hence the sum over all of them).

The -1 is for the last interleave sentinel (of the ancestor) which we do not want to keep.

I updated the comment to be more clear.

Makes sense, thanks!

There is an asymmetry in that we only fix up EndKeys and it's not clear why that's correct. We should add a comment explaining why we don't need to fix the start keys (I believe that the reason is that the child rows always come before the corresponding parent rows?)

pkg/sql/distsql_join.go, line 372 at r4 (raw file):

			// a prefix up to the ancestor and then some.
			if !earlyTermination && len(endKey) > 0 {
				span.EndKey = span.EndKey[:prefixLen].PrefixEnd()

It's easy to reason about this function if the space covered by the spans does not change (i.e. we just move pieces between partitions). But it feels wrong that we may be extending spans.

What if we have: parent table key K1, child table key K2 and the join has WHERE K1=1 AND K2=2. We would have the span /1/#/2 - /1/#-3. This function would fix this up to /1/#/2 - /2 right? But that's no longer correct (without extra filtering) - we would return things like 1,3, 1,4, etc.

I still believe that this entire thing would be cleaner and less error-prone if we populate the interleave reader with two sets of spans, for each table. Then the reader goes through both at the same time (as if it's merging them) but it knows which rows to ignore from each side. fixInterleavePartitions would "align" the split points between the two sides but it would not change the spans that are covered.

pkg/sql/distsqlrun/flow_diagram.go, line 160 at r2 (raw file):

Previously, richardwu (Richard Wu) wrote…

Good point, updated this. Here's a preview of what it looks like for a simple
SELECT * FROM parent JOIN child USING (pid1, pid2)
https://cockroachdb.github.io/distsqlplan/decode.html?eJyMkM9KNDEQxO_fU3zUSdk-bEZRCAi5jgdX9ipziJN2NpBNhqRHlGXeXWZyWP8geKzqX3dRfUJMjh_skQv0ExQ6wphTz6WkvFgVaN0b9Jbg4zjJYneEPmWGPkG8BIZGG4VzYPvKe7aO833ykTMIjsX6sN4X-xxYYcnwR5vfzWgzRwFhN4n-bxSZhswVqJLNJ7I_-OC-g2SuQQj8IhdGbcg0m8u77IfDWf7cuCFzi24mpEnObYrYgaHVTH9vvOcyplj4S8ffLm_njsBu4PrVkqbc82NO_RpT5W7dWw3HRepUVdHGOpq7-d9HAAAA___5_Yri

We should prepend a "Left/Right" prefix so things are more readable.

pkg/sql/distsqlrun/flow_diagram.go, line 29 at r4 (raw file):

cockroach_sql_sqlbase1

Why this alias?

Comments from Reviewable

richardwu · 2017-11-08T17:29:35Z

pkg/sql/distsql_join.go, line 372 at r4 (raw file):

Previously, RaduBerinde wrote…

It's easy to reason about this function if the space covered by the spans does not change (i.e. we just move pieces between partitions). But it feels wrong that we may be extending spans.

What if we have: parent table key K1, child table key K2 and the join has WHERE K1=1 AND K2=2. We would have the span /1/#/2 - /1/#-3. This function would fix this up to /1/#/2 - /2 right? But that's no longer correct (without extra filtering) - we would return things like 1,3, 1,4, etc.

I still believe that this entire thing would be cleaner and less error-prone if we populate the interleave reader with two sets of spans, for each table. Then the reader goes through both at the same time (as if it's merging them) but it knows which rows to ignore from each side. fixInterleavePartitions would "align" the split points between the two sides but it would not change the spans that are covered.

I see your point, good catch.

I'm beginning to lean towards the idea of pushing down 2 sets of spans for each table to the processor. It'll give us the flexibility to not have to push down origFilter for outer joins (since we can check which rows to ignore from either table using their spans).

Now that I think about it, an intersection of spans from the parent/child tables is not quite correct for the spans generated from WHERE K1 = 1 AND K2 = 2

parent: /1 - /2
child: /1/#/2 - /1/#/3

this would incorrectly yield the intersection /1/#/2 - /1/#/3, missing the K1 = 1 parent row.

As for aligning split points: I am unsure if there's a cleaner way to do this in order to allow each InterleaveReaderJoiner to be fully self-contained on its node.

Even with the two set of spans, one per table, we'll need to align the split points of each table in a similar fashion (technically, we only need to align the split points of the child's set of spans since the parent).

Based on the following assumptions:

Each partition contains disjoint spans (otherwise, partitionSpans would have merged the adjacent spans)

The revised algorithm for adjusting the end key would be the following (for the child table):

Unwrap all spans from every partition and sort them. From assumption tidy up some correctness issues reported by go vet #1, no two consecutive spans will be part of the same partition.
For each span, check the EndKey (initialKey)
- If is contains the "child" part of the key (e.g. /1/#/3)
  - Compute the "fixed" EndKey - call this fixedKey (/1/#/2)
  - (2a) If there are no subsequent spans, we cannot adjust this fixed key (since we do not want to extend our domain).
  - There are subsequent spans, we check their EndKey. Either a subsequent span contains fixedKey or the last span in the set has an EndKey < fixedKey (lastKey).
    - (2b) If the former, we can set initialKey = fixedKey and adjust the start Key of the span that contains fixedKey to fixedKey. We remove any spans in between since they are now contained in our initial span.
    - (2c) If the latter, we adjust our initialKey = lastKey. All subsequent spans are contained in our initial span so we remove them.
Re-partition the spans based on their node.
If there is a node without children rows, we can remove the node's partition.

Here's an example where we have a filter WHERE K1 >= 1 AND K1 < 4 AND K2 = 2. This should give us the spans

parent: /1 - /4
child: /1/#/2 - /3/#/3

Suppose the range split at /2/#/42. This would give us the partitions

                  node 1                  node 2
parent:       /1 - /2/#/42        |     /2/#/42 - /4
child:        /1/#/2 - /2/#/42    |     /2/#/42 - /3/#/3

Since parent row K1 = 2 was read on node 1, we want all K1 = 2 child rows to also be read on node 1. Thus when we begin fixing the child partition, we notice that the EndKey /2/#/42 needs to be fixed to /3, which is permitted since /3 is contained in the subsequent span /2/#/42 - /3/#/3 (case 2b). We thus end up with

                  node 1               node 2
parent:       /1 - /2/#/42      |     /2/#/42 - /4
child:        /1/#/2 - /3       |     /3 - /3/#/3

In some cases, fixing the span may "consume" the other partition i.e. in the case WHERE K1 = 1 AND K2 >= 2 AND K2 < 4 and a split at /1/#/3

                  node 1               node 2
parent:       /1 - /1/#/3      |     /1/#/3- /2
child:        /1/#/2- /1/#/3   |     /1/#/3 - /1/#/4

When we try to fix the EndKey of the first child span (/1/#/3) to /2, we note that the subsequent span's EndKey is <= /2 (case 2c). We thus set EndKey = /1/#/4 which leaves

                  node 1               node 2
parent:       /1 - /1/#/3      |     /1/#/3- /2
child:        /1/#/2- /1/#/4   |

and since there are no children rows being scanned on node 2, we can completely remove that partition (step 4).

Comments from Reviewable

RaduBerinde · 2017-11-08T17:53:16Z

pkg/sql/distsql_join.go, line 372 at r4 (raw file):

Previously, richardwu (Richard Wu) wrote…

I see your point, good catch.

I'm beginning to lean towards the idea of pushing down 2 sets of spans for each table to the processor. It'll give us the flexibility to not have to push down origFilter for outer joins (since we can check which rows to ignore from either table using their spans).

Now that I think about it, an intersection of spans from the parent/child tables is not quite correct for the spans generated from WHERE K1 = 1 AND K2 = 2
parent: /1 - /2
child: /1/#/2 - /1/#/3
this would incorrectly yield the intersection /1/#/2 - /1/#/3, missing the K1 = 1 parent row.

As for aligning split points: I am unsure if there's a cleaner way to do this in order to allow each InterleaveReaderJoiner to be fully self-contained on its node.

Even with the two set of spans, one per table, we'll need to align the split points of each table in a similar fashion (technically, we only need to align the split points of the child's set of spans since the parent).

Based on the following assumptions:

Each partition contains disjoint spans (otherwise, partitionSpans would have merged the adjacent spans)

The revised algorithm for adjusting the end key would be the following (for the child table):

Unwrap all spans from every partition and sort them. From assumption tidy up some correctness issues reported by go vet #1, no two consecutive spans will be part of the same partition.

For each span, check the EndKey (initialKey)

If is contains the "child" part of the key (e.g. /1/#/3)

Compute the "fixed" EndKey - call this fixedKey (/1/#/2)

(2a) If there are no subsequent spans, we cannot adjust this fixed key (since we do not want to extend our domain).

There are subsequent spans, we check their EndKey. Either a subsequent span contains fixedKey or the last span in the set has an EndKey < fixedKey (lastKey).

(2b) If the former, we can set initialKey = fixedKey and adjust the start Key of the span that contains fixedKey to fixedKey. We remove any spans in between since they are now contained in our initial span.

(2c) If the latter, we adjust our initialKey = lastKey. All subsequent spans are contained in our initial span so we remove them.

Re-partition the spans based on their node.

If there is a node without children rows, we can remove the node's partition.

Here's an example where we have a filter WHERE K1 >= 1 AND K1 < 4 AND K2 = 2. This should give us the spans
parent: /1 - /4
child: /1/#/2 - /3/#/3
Suppose the range split at /2/#/42. This would give us the partitions
                  node 1                  node 2
parent:       /1 - /2/#/42        |     /2/#/42 - /4
child:        /1/#/2 - /2/#/42    |     /2/#/42 - /3/#/3
Since parent row K1 = 2 was read on node 1, we want all K1 = 2 child rows to also be read on node 1. Thus when we begin fixing the child partition, we notice that the EndKey /2/#/42 needs to be fixed to /3, which is permitted since /3 is contained in the subsequent span /2/#/42 - /3/#/3 (case 2b). We thus end up with
                  node 1               node 2
parent:       /1 - /2/#/42      |     /2/#/42 - /4
child:        /1/#/2 - /3       |     /3 - /3/#/3
In some cases, fixing the span may "consume" the other partition i.e. in the case WHERE K1 = 1 AND K2 >= 2 AND K2 < 4 and a split at /1/#/3
                  node 1               node 2
parent:       /1 - /1/#/3      |     /1/#/3- /2
child:        /1/#/2- /1/#/3   |     /1/#/3 - /1/#/4
When we try to fix the EndKey of the first child span (/1/#/3) to /2, we note that the subsequent span's EndKey is <= /2 (case 2c). We thus set EndKey = /1/#/4 which leaves
                  node 1               node 2
parent:       /1 - /1/#/3      |     /1/#/3- /2
child:        /1/#/2- /1/#/4   |    
and since there are no children rows being scanned on node 2, we can completely remove that partition (step 4).

I wonder if it wouldn't be easier to add this logic to partitionSpans. The idea would be that in the Seek loop we would pretend that the end of the range is higher as necessary (perhaps this logic can be provided through a callback so the interleave-specific stuff doesn't live here). It would require a bit of hacking to make the loop still work correctly, but seems easier than what you're proposing (it reuses the existing logic of partitioning spans etc). We would only do this for the child table.

Comments from Reviewable

richardwu · 2017-11-09T21:17:30Z

Review status: 0 of 40 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.

pkg/roachpb/merge_spans.go, line 105 at r4 (raw file):

Previously, RaduBerinde wrote…

Not sure how this semantic is useful. Intersection of two sets is not the same with intersecting all elements in the sets. We want to intersect two lists of spans (one for each table), I don't see why we would intersect spans from the same table E.g. if the join has a WHERE k IN [1,10,20], we would have spans '1,10,20` and the intersection would be empty.

Removed since this is no longer necessary (nor correct).

pkg/sql/distsql_join.go, line 494 at r2 (raw file):

Previously, RaduBerinde wrote…

Makes sense, thanks!

There is an asymmetry in that we only fix up EndKeys and it's not clear why that's correct. We should add a comment explaining why we don't need to fix the start keys (I believe that the reason is that the child rows always come before the corresponding parent rows?)

That's correct: child rows always come after parent rows, so even if we read child rows first it will not be joined to anything and outputted.

Beyond inner joins, correctness is a factor so I've taken your suggestion to move the "fixing" (pushing) EndKey logic to partitionSpans. Once an EndKey is pushed, it will define the start Key of the next span so there is no need to fix the start key. I've added a comment addressing this above pushToNextAncestorFn.

pkg/sql/distsql_join.go, line 372 at r4 (raw file):

Previously, RaduBerinde wrote…

I wonder if it wouldn't be easier to add this logic to partitionSpans. The idea would be that in the Seek loop we would pretend that the end of the range is higher as necessary (perhaps this logic can be provided through a callback so the interleave-specific stuff doesn't live here). It would require a bit of hacking to make the loop still work correctly, but seems easier than what you're proposing (it reuses the existing logic of partitioning spans etc). We would only do this for the child table.

As discussed offline, end keys can now be pushed on a per-input span basis in partitionSpans. This of course doesn't address the case where we have multiple spans under the same parent ID PK1 such as in the spans generated by PK1 = 1 AND PK2 IN (5, 7, 9)

/1/#/5 - /1/#/6
/1/#/7 - /1/#/8
/1/#/9 - /1/#/10

pkg/sql/distsqlrun/flow_diagram.go, line 160 at r2 (raw file):

Previously, RaduBerinde wrote…

We should prepend a "Left/Right" prefix so things are more readable.

I made it so that this only outputs info for the first two tables (left and right). Once we have more than 2 tables being joined, we can update this.

pkg/sql/distsqlrun/flow_diagram.go, line 29 at r4 (raw file):

Previously, RaduBerinde wrote…

cockroach_sql_sqlbase1

Why this alias?

Oops, fixed.

Comments from Reviewable

richardwu · 2017-11-12T07:42:17Z

So as discussed offline, I've introduced the concept of "join intervals" for ancestor spans in order to correspond each descendant span to its corresponding ancestor span.

Take a look at alignPartitionSpans and the associated unit tests in distsql_join_test.go.

Review status: 0 of 40 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.

Comments from Reviewable

richardwu · 2017-11-16T00:59:27Z

I've added a decent number of logic tests that should cover all edge cases I can think of.

Surprisingly only 1 bug surfaced which is both relieving but borderline unusual.

Also updated docstrings of the new functions/concepts as per @RaduBerinde's input.

One slight deviation from the RFC: parent-grandchild joins (or grandgrandchild, etc.) are supported now.

Also ran a few benchmarks which show that this performs much better than interleave joins before and comparable to regular joins in a shallow interleave hierarchy.

Ready for a full-on review!

RaduBerinde · 2017-11-28T21:22:19Z

Great work! Would be great to get another pair of eyes on this though.

Reviewed 1 of 17 files at r1, 3 of 21 files at r3, 17 of 34 files at r5, 18 of 18 files at r6.
Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.

pkg/sql/distsql_join.go, line 101 at r6 (raw file):

		joinType = distsqlrun.JoinType_INNER
	default:
		panic("can only plan inner joins with interleaved joins")

This should be an error (I know it shouldn't happen, but if we have a bug and it does, an error is preferable)

pkg/sql/distsql_join.go, line 531 at r6 (raw file):

// joined with parent rows in the span.
//
// To illustrate, we'll use some examples of parent spans (/table/index omitted

I think this section should be moved to joinSpans (it has comments that refer to these examples).

pkg/sql/distsql_join.go, line 541 at r6 (raw file):

//      rows corresponding to 1, 2 (note that /3/#/1 comes after all the parent
//      rows with 3 but before all corresponding child rows).
//	The join span is: /1 - /4.

[nit] alignment

pkg/sql/distsql_join.go, line 562 at r6 (raw file):

// If there is overlap with some parent join span, there exist "some" child
// keys in the span that need to be mapped to the parent span. The sections of
// the child span that do not overlap do not belong need to be split off and

extra "do not belong"

pkg/sql/distsql_join.go, line 580 at r6 (raw file):

	// mapAndSplit takes a childSpan and finds the parentJoinSpan that has
	// the parent row(s) the child row(s) are suppose to join to.

with which the child row(s) are supposed to join

pkg/sql/distsql_join.go, line 584 at r6 (raw file):

	// parentJoinSpan.
	// It splits off the non-overlapping parts and appends them to
	// the passed in nonOverlaps slice for recursive application.

repeated application (it's not really recursive)

pkg/sql/distsql_join.go, line 586 at r6 (raw file):

	// the passed in nonOverlaps slice for recursive application.
	mapAndSplit := func(curNodeID roachpb.NodeID, childSpan roachpb.Span, nonOverlaps roachpb.Spans) roachpb.Spans {
		for _, parentPart := range joinSpans {

Please add a TODO here to investigate making this more efficient (we keep looking for overlaps from the beginning every time). Perhaps pre-sorting the join spans would help.

pkg/sql/distsql_join.go, line 596 at r6 (raw file):

					// Check non-overlapping region
					// before start key.
					//	|----parentSpan----...

parentJoinSpan (same below)

pkg/sql/distsql_join.go, line 597 at r6 (raw file):

					// before start key.
					//	|----parentSpan----...
					//  |----childSpan----...

these start in the same place, this probably needs to be indented a bit

pkg/sql/distsql_join.go, line 644 at r6 (raw file):

				// Copy out the first span in spansLeft to
				// mapAndSplit.
				spanToMap := spansLeft[0]

[nit] can't we use spansLeft[len(spansLeft)-1] instead of swapping?

pkg/sql/distsql_join_test.go, line 295 at r6 (raw file):

		// right).
		for i := 0; i < 2; i++ {
			// Run every permutation of the equality columns (just

Nice!

pkg/sql/distsql_physical_planner.go, line 2045 at r6 (raw file):

		p.Processors = append(p.Processors, proc)
	}

Very nice cleanups in this file!

pkg/sql/distsqlrun/flow_diagram.go, line 441 at r6 (raw file):

	}

	// We want to sort these edges to ensure determinism in our DistSQL outputs.

It's nice to sort them, but how was this non-deterministic before? It seems that for the same spec, we would always be adding them in the same order.

Comments from Reviewable

richardwu · 2017-11-29T22:41:51Z

Review status: 3 of 18 files reviewed at latest revision, 13 unresolved discussions.

pkg/sql/distsqlrun/flow_diagram.go, line 441 at r6 (raw file):

Previously, RaduBerinde wrote…

It's nice to sort them, but how was this non-deterministic before? It seems that for the same spec, we would always be adding them in the same order.

After much digging around I realized I was using a map to keep track of unique node IDs after aligning the partitioned spans when creating an interleaved join. Removed this sorting since distsql plans should still be deterministic.

pkg/sql/distsql_join.go, line 101 at r6 (raw file):

Previously, RaduBerinde wrote…

This should be an error (I know it shouldn't happen, but if we have a bug and it does, an error is preferable)

Done.

pkg/sql/distsql_join.go, line 531 at r6 (raw file):

Previously, RaduBerinde wrote…

I think this section should be moved to joinSpans (it has comments that refer to these examples).

Done.

pkg/sql/distsql_join.go, line 541 at r6 (raw file):

Previously, RaduBerinde wrote…

[nit] alignment

Done.

pkg/sql/distsql_join.go, line 562 at r6 (raw file):

Previously, RaduBerinde wrote…

extra "do not belong"

Good catch, thanks!

pkg/sql/distsql_join.go, line 580 at r6 (raw file):

Previously, RaduBerinde wrote…

with which the child row(s) are supposed to join

Done.

pkg/sql/distsql_join.go, line 584 at r6 (raw file):

Previously, RaduBerinde wrote…

repeated application (it's not really recursive)

Good point.

pkg/sql/distsql_join.go, line 586 at r6 (raw file):

Previously, RaduBerinde wrote…

Please add a TODO here to investigate making this more efficient (we keep looking for overlaps from the beginning every time). Perhaps pre-sorting the join spans would help.

Done.

pkg/sql/distsql_join.go, line 596 at r6 (raw file):

Previously, RaduBerinde wrote…

parentJoinSpan (same below)

Done.

pkg/sql/distsql_join.go, line 597 at r6 (raw file):

Previously, RaduBerinde wrote…

these start in the same place, this probably needs to be indented a bit

This should be fixed now 🤞

pkg/sql/distsql_join.go, line 644 at r6 (raw file):

Previously, RaduBerinde wrote…

[nit] can't we use spansLeft[len(spansLeft)-1] instead of swapping?

Good point, I guess it doesn't really matter which end we take from.

Comments from Reviewable

rjnn · 2017-11-30T21:18:06Z

Reviewed 2 of 21 files at r3, 3 of 34 files at r5, 2 of 18 files at r6, 2 of 22 files at r7, 13 of 14 files at r8.
Review status: 17 of 18 files reviewed at latest revision, 28 unresolved discussions, all commit checks successful.

pkg/roachpb/data.go, line 1237 at r7 (raw file):

}

// Overlaps returns whether the two spans overlap.

Can remove this comment, it is pretty obvious.

pkg/roachpb/data.go, line 1239 at r7 (raw file):

// Overlaps returns whether the two spans overlap.
func (s Span) Overlaps(o Span) bool {
	// Spans must be valid.

This one as well. Pretty clear what's going on.

pkg/roachpb/data.go, line 1256 at r7 (raw file):

// Contains returns whether the receiver contains the given span.
func (s Span) Contains(o Span) bool {
	// Spans must be valid.

Remove.

pkg/roachpb/data.go, line 1295 at r7 (raw file):

}

// SplitOnKey returns two spans where the left span has EndKey

I don't mean to pick on your code comments, so I will highlight that this comment is outstanding, and the kind of comment that should be there - something that adds value beyond the surface meaning of the function name.

pkg/sql/distsql_physical_planner.go, line 2045 at r6 (raw file):

Previously, RaduBerinde wrote…

Very nice cleanups in this file!

Seconded!

pkg/sql/distsql_plan_join.go, line 35 at r8 (raw file):

	"sql.distsql.interleaved_joins.enabled",
	"if set we plan interleaved table joins instead of merge joins when possible",
	true,

I'm not sure about the default true setting. @RaduBerinde, @petermattis ?

pkg/sql/distsql_plan_join.go, line 69 at r8 (raw file):

		// We don't really need to initialize a full-on plan to
		// retrieve the metadata for each table reader, but this turns
		// out to be very useful to compute ordering and remapping the

super nitty nit (sorry): s/remapping/to remap

pkg/sql/distsql_plan_join.go, line 75 at r8 (raw file):

			return physicalPlan{}, false, err
		}
		// Doesn't matter which processor we choose since the metadata

since the metadata...?

pkg/sql/distsql_plan_join.go, line 80 at r8 (raw file):

		ordering := distsqlOrdering(n.mergeJoinOrdering, eqCols)

		// for TableReader is independent of node/processor instance.

hmm, please unsplit the comment, or explicitly demarcate it with "first comment..." "...second comment".

pkg/sql/distsql_plan_join.go, line 90 at r8 (raw file):

		}

		if tr.LimitHint >= math.MaxInt64-totalLimitHint {

Why? Unclear to me why this limit is being finicked like this.

pkg/sql/distsql_plan_join.go, line 432 at r8 (raw file):

	ancestor, descendant := n.interleavedNodes()
	if ancestor == nil || descendant == nil {
		panic("cannot compute maximalJoinPrefix since there is no interleaved relation")

you should return an error, not panic.
if you're going to enforce this criterion, i would prefer that the arguments to maximalJoinPrefix be the ancestor and descendent explicitly, so that the error is more obvious. the caller can do the unpacking that you are doing 2 lines above.

pkg/sql/distsql_plan_join.go, line 480 at r8 (raw file):

	nAncestors := 0
	sharedPrefixLen := 0
	for _, descAncestor := range descendant.index.Interleave.Ancestors {

@RaduBerinde is this correct? I think so, but my brain is melting.

pkg/sql/join.go, line 873 at r8 (raw file):

	for _, descAncestor := range descendant.index.Interleave.Ancestors {
		if descAncestor.TableID == ancestor.desc.ID && descAncestor.IndexID == ancestor.index.ID {
			hasInterleaveRelation = true

nit: put the return right here, you don't need hasInterleaveRelation.

pkg/sql/distsqlrun/flow_diagram.go, line 183 at r8 (raw file):

		details = append(details, indexDetails(table.IndexIdx, &table.Desc)...)
		// Post process (filters, projections, renderExprs, limits/offsets)
		details = append(details, table.Post.summary()...)

can you amend this to display an optional prefix, instead of just the generic "Out: " prefix? It would be nice if you could say "left out" and "right out" as right now the InterleavedJoin summary has three out columns, which confused me for a while.

pkg/sql/logictest/testdata/logic_test/interleaved_join, line 3 at r8 (raw file):

# LogicTest: 5node-distsql

# The following tables form the interleaved hierarchy:

This testing strategy is outstanding.

pkg/sql/distsql_join.go, line 44 at r6 (raw file):

	}

	// We know they are scan nodes from useInterleaveJoin.

I'd add a check above, and an explicit error return, to alleviate potential future panics.

Comments from Reviewable

RaduBerinde · 2017-11-30T22:22:19Z

Review status: 17 of 18 files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.

pkg/roachpb/data.go, line 1237 at r7 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

Can remove this comment, it is pretty obvious.

linter says no! (it's exported)

pkg/sql/distsql_plan_join.go, line 35 at r8 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

I'm not sure about the default true setting. @RaduBerinde, @petermattis ?

Hm, good question.. We don't have many tests exercising joins between interleaved tables (e.g. none of the "production" workloads do it).

pkg/sql/distsql_plan_join.go, line 480 at r8 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

@RaduBerinde is this correct? I think so, but my brain is melting.

I think so too..

Comments from Reviewable

richardwu · 2017-11-30T23:03:29Z

Review status: 13 of 18 files reviewed at latest revision, 17 unresolved discussions.

pkg/roachpb/data.go, line 1237 at r7 (raw file):

Previously, RaduBerinde wrote…

linter says no! (it's exported)

I've updated it to be more precise and non-trivial.

pkg/roachpb/data.go, line 1239 at r7 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

This one as well. Pretty clear what's going on.

Good point.

pkg/roachpb/data.go, line 1295 at r7 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

I don't mean to pick on your code comments, so I will highlight that this comment is outstanding, and the kind of comment that should be there - something that adds value beyond the surface meaning of the function name.

Gracias

pkg/sql/distsql_plan_join.go, line 35 at r8 (raw file):

Previously, RaduBerinde wrote…

Hm, good question.. We don't have many tests exercising joins between interleaved tables (e.g. none of the "production" workloads do it).

So the new "interleaved joins" are always better (from benchmarking on a GCE cluster) than regular merge joins for interleaved tables.

This follows from the fact that a regular merge join for two interleaved tables always reads the entire interleaved hierarchy (but concurrently in two processors: one for the left and one for the right table). An interleaved join also reads the entire interleaved hierarchy once so at worst it does the same amount of work scanning.

The merging part always needs to co-locate rows that are to be joined, regardless of merge vs interleaved join. So even if interleaved joins end up scanning rows for either table on a different node, merge joins would have to do the same when streaming from the reader to the joiner. At best, interleaved joins avoid the pseudo-randomness of hashing the equality columns in order to co-locate rows for merging that merge joins have to do. At worst, interleaved joins will generate the same amount of RPC traffic.

We could leave this false as a "beta feature" initially, but when do we decide that it's acceptance-ready? I can't think of any downside in defaulting this to true (the test suite should cover correctness if that's a concern).

pkg/sql/distsql_plan_join.go, line 69 at r8 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

super nitty nit (sorry): s/remapping/to remap

Oops, how about "... very useful for computing ordering and remapping..."?

pkg/sql/distsql_plan_join.go, line 75 at r8 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

since the metadata...?

Oops (see below).

pkg/sql/distsql_plan_join.go, line 80 at r8 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

hmm, please unsplit the comment, or explicitly demarcate it with "first comment..." "...second comment".

I must have accidentally spliced the comment from after above. Fixed.

pkg/sql/distsql_plan_join.go, line 90 at r8 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

Why? Unclear to me why this limit is being finicked like this.

Added a comment for clarity.

pkg/sql/distsql_plan_join.go, line 432 at r8 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

you should return an error, not panic.

if you're going to enforce this criterion, i would prefer that the arguments to maximalJoinPrefix be the ancestor and descendent explicitly, so that the error is more obvious. the caller can do the unpacking that you are doing 2 lines above.

Ah good point, done!

pkg/sql/join.go, line 873 at r8 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

nit: put the return right here, you don't need hasInterleaveRelation.

Much cleaner 👍

pkg/sql/distsqlrun/flow_diagram.go, line 183 at r8 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

can you amend this to display an optional prefix, instead of just the generic "Out: " prefix? It would be nice if you could say "left out" and "right out" as right now the InterleavedJoin summary has three out columns, which confused me for a while.

How's this? https://cockroachdb.github.io/distsqlplan/decode.html?eJzskkFLOzEQxe__T1Hm9JdOYbPVy4KQa0Va6VX2EDbTNrBNlklWFNnvLkks7lYtendPSd77zcw-5hWs07RWR_JQPYIAhBIQllAjdOwa8t5xlLJxpZ-hKhCM7fqQn4MJLUEFjjUxaUDQFJRpUz0p5lAPNULjmKD6cK_dwnUTbz0guD68160RfFB7gqoccNRbjHp_UXZlA3FL6om2pDTxnTOWeDrTIn73tAvpAPE_zVHxi-wUkw0xgqjONn2oZlKgjIEk79bsD5-o5mBaHaGkjiiUyxOYxziRLe3CfynmV7cckXQEhAl6jfIGvotETCIp_yI5j6S4HMmWfOespx_tXxHXl_Se8rp713NDD-ya1CZfN4lLD5p8yOoyX1Y2S3HAMSwuwsUEFudw-Su4Hv69BQAA___qqTc2

pkg/sql/distsql_join.go, line 44 at r6 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

I'd add a check above, and an explicit error return, to alleviate potential future panics.

Done.

Comments from Reviewable

RaduBerinde · 2017-12-01T13:41:43Z

Review status: 3 of 18 files reviewed at latest revision, 16 unresolved discussions, some commit checks failed.

pkg/sql/distsql_plan_join.go, line 35 at r8 (raw file):

Previously, richardwu (Richard Wu) wrote…

So the new "interleaved joins" are always better (from benchmarking on a GCE cluster) than regular merge joins for interleaved tables.

This follows from the fact that a regular merge join for two interleaved tables always reads the entire interleaved hierarchy (but concurrently in two processors: one for the left and one for the right table). An interleaved join also reads the entire interleaved hierarchy once so at worst it does the same amount of work scanning.

The merging part always needs to co-locate rows that are to be joined, regardless of merge vs interleaved join. So even if interleaved joins end up scanning rows for either table on a different node, merge joins would have to do the same when streaming from the reader to the joiner. At best, interleaved joins avoid the pseudo-randomness of hashing the equality columns in order to co-locate rows for merging that merge joins have to do. At worst, interleaved joins will generate the same amount of RPC traffic.

We could leave this false as a "beta feature" initially, but when do we decide that it's acceptance-ready? I can't think of any downside in defaulting this to true (the test suite should cover correctness if that's a concern).

I think the concern is correctness / regressions. But having it off for a period doesn't really help. I'm ok with enabling it.

pkg/sql/distsqlrun/flow_diagram.go, line 183 at r8 (raw file):

Previously, richardwu (Richard Wu) wrote…

How's this? https://cockroachdb.github.io/distsqlplan/decode.html?eJzskkFLOzEQxe__T1Hm9JdOYbPVy4KQa0Va6VX2EDbTNrBNlklWFNnvLkks7lYtendPSd77zcw-5hWs07RWR_JQPYIAhBIQllAjdOwa8t5xlLJxpZ-hKhCM7fqQn4MJLUEFjjUxaUDQFJRpUz0p5lAPNULjmKD6cK_dwnUTbz0guD68160RfFB7gqoccNRbjHp_UXZlA3FL6om2pDTxnTOWeDrTIn73tAvpAPE_zVHxi-wUkw0xgqjONn2oZlKgjIEk79bsD5-o5mBaHaGkjiiUyxOYxziRLe3CfynmV7cckXQEhAl6jfIGvotETCIp_yI5j6S4HMmWfOespx_tXxHXl_Se8rp713NDD-ya1CZfN4lLD5p8yOoyX1Y2S3HAMSwuwsUEFudw-Su4Hv69BQAA___qqTc2

O_o this is nice! We probably don't need the "Left" and "Right" with the new "-----" delimiters, it's fine either way though.

Comments from Reviewable

rjnn · 2017-12-04T20:44:48Z

Still ! Thanks for addressing all the issues!

Review status: 17 of 18 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.

pkg/sql/distsql_plan_join.go, line 35 at r8 (raw file):

Previously, RaduBerinde wrote…

I think the concern is correctness / regressions. But having it off for a period doesn't really help. I'm ok with enabling it.

I'm fine with leaving it default on. You're right that having it off doesn't really help, and we have plenty of time before release. This only makes getting the interleaved benchmark shipped and running nightly more important, though.

pkg/sql/distsql_plan_join.go, line 69 at r8 (raw file):

Previously, richardwu (Richard Wu) wrote…

Oops, how about "... very useful for computing ordering and remapping..."?

Fine by me!

pkg/sql/distsqlrun/flow_diagram.go, line 183 at r8 (raw file):

Previously, RaduBerinde wrote…

O_o this is nice! We probably don't need the "Left" and "Right" with the new "-----" delimiters, it's fine either way though.

Looks good to me, as does Radu's suggestion.

Comments from Reviewable

richardwu · 2017-12-08T16:30:59Z

So as I was reducing down the sizes of the logic test tables (since they were a bit too big for testlogicrace), I realized I forgot to add test cases for swapping the positions of parent-child such that the child table is on the left and the parent table is on the right (previously it assumed the parent was always on the left, doh).

I added the corrective code into interleaved_reader_joiner.go (look at r9-r10) and test cases into logic_test/interleaved_join. I apologize for the messed up revisions, it seems I accidentally "reverted" r9 and r10 at some point.

Review status: 2 of 19 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.

Comments from Reviewable

Release notes: none

descendant Release notes: Performance improvement: equality joins on the entire interleave prefix between parent and (not necessarily direct) child interleaved tables are faster now.

rjnn · 2017-12-11T19:01:27Z

Reviewed 1 of 34 files at r5, 1 of 16 files at r9, 16 of 16 files at r10.
Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

Comments from Reviewable

rjnn · 2017-12-11T19:05:06Z

🎉

The first iteration (and goal of this RFC) of full interleaved prefix joins on a parent-child table was officially completed when #19853 was merged into master. The outstanding general cases outlined in the RFC can be incrementally introduced with or without additional RFCs. Namely: 1. Multi-table joins 2. Prefix and subset joins (#20661) 3. Sibling and common ancestor joins Avoiding splits in between interleaved children rows (or rather, encouraging splits right before a root parent table) is still outstanding.

richardwu requested review from andreimatei, rjnn, RaduBerinde and a team November 6, 2017 22:27

richardwu force-pushed the interleave-join-planning-extravaganza branch 2 times, most recently from f51713a to 23588ec Compare November 8, 2017 04:51

richardwu force-pushed the interleave-join-planning-extravaganza branch from 23588ec to 62481c2 Compare November 9, 2017 20:25

richardwu force-pushed the interleave-join-planning-extravaganza branch 3 times, most recently from 85ffb54 to 5a10a3a Compare November 12, 2017 07:31

richardwu force-pushed the interleave-join-planning-extravaganza branch 5 times, most recently from 953cad7 to eba2a85 Compare November 16, 2017 00:51

richardwu changed the title ~~[WIP] sql, distsql: planning for interleave joins between ancestor and descendant~~ sql, distsql: planning for interleave joins between ancestor and descendant Nov 16, 2017

richardwu force-pushed the interleave-join-planning-extravaganza branch 4 times, most recently from 682d31a to bd2fa78 Compare November 29, 2017 22:40

richardwu force-pushed the interleave-join-planning-extravaganza branch from bd2fa78 to 162775d Compare November 29, 2017 23:28

richardwu force-pushed the interleave-join-planning-extravaganza branch from 162775d to def5409 Compare November 30, 2017 23:02

richardwu force-pushed the interleave-join-planning-extravaganza branch from def5409 to 4b5bac3 Compare November 30, 2017 23:26

richardwu force-pushed the interleave-join-planning-extravaganza branch from 4b5bac3 to 162775d Compare December 1, 2017 23:06

richardwu force-pushed the interleave-join-planning-extravaganza branch 4 times, most recently from 2b31323 to e4bfc7c Compare December 6, 2017 16:48

roachpb: valid method for spans to check if they're logically valid

3edc5f7

Release notes: none

richardwu mentioned this pull request Dec 9, 2017

benchmark: investigate interleaved table join performance #20586

Closed

sql, distsql: planning for interleaved joins between ancestor and

69165ab

descendant Release notes: Performance improvement: equality joins on the entire interleave prefix between parent and (not necessarily direct) child interleaved tables are faster now.

richardwu force-pushed the interleave-join-planning-extravaganza branch from e4bfc7c to 69165ab Compare December 10, 2017 06:14

richardwu merged commit ca5eaaf into cockroachdb:master Dec 11, 2017

richardwu deleted the interleave-join-planning-extravaganza branch December 11, 2017 19:02

richardwu mentioned this pull request Dec 12, 2017

rfcs: move status of interleaved table join RFC to complete #20662

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql, distsql: planning for interleave joins between ancestor and descendant #19853

sql, distsql: planning for interleave joins between ancestor and descendant #19853

richardwu commented Nov 6, 2017 •

edited

Loading

cockroach-teamcity commented Nov 6, 2017

RaduBerinde commented Nov 7, 2017

richardwu commented Nov 8, 2017

RaduBerinde commented Nov 8, 2017

richardwu commented Nov 8, 2017

RaduBerinde commented Nov 8, 2017

richardwu commented Nov 9, 2017

richardwu commented Nov 12, 2017

richardwu commented Nov 16, 2017

RaduBerinde commented Nov 28, 2017

richardwu commented Nov 29, 2017

rjnn commented Nov 30, 2017

RaduBerinde commented Nov 30, 2017

richardwu commented Nov 30, 2017

RaduBerinde commented Dec 1, 2017

rjnn commented Dec 4, 2017

richardwu commented Dec 8, 2017

rjnn commented Dec 11, 2017

rjnn commented Dec 11, 2017

sql, distsql: planning for interleave joins between ancestor and descendant #19853

sql, distsql: planning for interleave joins between ancestor and descendant #19853

Conversation

richardwu commented Nov 6, 2017 • edited Loading

cockroach-teamcity commented Nov 6, 2017

RaduBerinde commented Nov 7, 2017

richardwu commented Nov 8, 2017

RaduBerinde commented Nov 8, 2017

richardwu commented Nov 8, 2017

RaduBerinde commented Nov 8, 2017

richardwu commented Nov 9, 2017

richardwu commented Nov 12, 2017

richardwu commented Nov 16, 2017

RaduBerinde commented Nov 28, 2017

richardwu commented Nov 29, 2017

rjnn commented Nov 30, 2017

RaduBerinde commented Nov 30, 2017

richardwu commented Nov 30, 2017

RaduBerinde commented Dec 1, 2017

rjnn commented Dec 4, 2017

richardwu commented Dec 8, 2017

rjnn commented Dec 11, 2017

rjnn commented Dec 11, 2017

richardwu commented Nov 6, 2017 •

edited

Loading