distsql: plan and execute distributed joins over interleaved tables #18948

rjnn · 2017-10-02T17:59:43Z

"Natural" joins over interleaved tables (i.e. joins that are joined over the same shared prefix of columns that is used in the parent interleaving relationship during CREATE TABLE INTERLEAVE INTO PARENT) can be optimized into using a single scan over the keyspace, rather than the two scans that would result in naively planning the join. This should significantly improve performance of these joins, and also provide a stronger justification for using the interleaved tables feature.

Doing this in DistSQL requires that we add a new InterleavedTableReader, akin to TableReader, which can perform the underlying joint scan and perform the join in a streaming fashion, outputting the result of the join without requiring two scans. It also seems possible to use this on joins that are a strict prefix of the shared prefix in the interleaving relationship, although I have not thought through the design. Finally, we need to detect when a JOIN has this property in the planning process, and use the new processor in distsqlPhysicallPlanner.createPlanForJoin.

This project is complex enough that it requires an RFC before starting.

cc @jordanlewis if you have any insights into whether there is anything else we need to be aware of when scoping this feature out that would affect TPC-C performance with interleaved tables.

cc @andreimatei

The text was updated successfully, but these errors were encountered:

jordanlewis · 2017-10-02T18:13:53Z

We might also want to explore the analogous optimization for joins across sibling interleaves in the RFC. Joins across sibling interleaves might be mildly more efficient with a single scan as well, since they're laid out next to each other in the keyspace. The effect won't be as dramatic, though, and it may well be pointless since in DistSQL I think we prefer to get several input streams in parallel - whereas in this case we'd have to wait until the first interleave was read in its entirety before seeing the first key from the second interleave.

Actually, that might be a problem for the InterleavedTableReader as well - any join processor will need to be able to handle its input keys in any order from either side. In other words you won't be able to control which side you get the next key from.

rjnn added this to the 1.2 milestone Oct 2, 2017

rjnn assigned richardwu Oct 2, 2017

richardwu mentioned this issue Oct 6, 2017

RFC: interleaved table joins #19028

Merged

richardwu mentioned this issue Nov 30, 2017

sql, distsql: planning for interleave joins between ancestor and descendant #19853

Merged

richardwu closed this as completed in #19853 Dec 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsql: plan and execute distributed joins over interleaved tables #18948

distsql: plan and execute distributed joins over interleaved tables #18948

rjnn commented Oct 2, 2017

jordanlewis commented Oct 2, 2017

distsql: plan and execute distributed joins over interleaved tables #18948

distsql: plan and execute distributed joins over interleaved tables #18948

Comments

rjnn commented Oct 2, 2017

jordanlewis commented Oct 2, 2017