Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsql: plan and execute distributed joins over interleaved tables #18948

Closed
rjnn opened this issue Oct 2, 2017 · 1 comment · Fixed by #19853
Closed

distsql: plan and execute distributed joins over interleaved tables #18948

rjnn opened this issue Oct 2, 2017 · 1 comment · Fixed by #19853
Assignees
Milestone

Comments

@rjnn
Copy link
Contributor

rjnn commented Oct 2, 2017

"Natural" joins over interleaved tables (i.e. joins that are joined over the same shared prefix of columns that is used in the parent interleaving relationship during CREATE TABLE INTERLEAVE INTO PARENT) can be optimized into using a single scan over the keyspace, rather than the two scans that would result in naively planning the join. This should significantly improve performance of these joins, and also provide a stronger justification for using the interleaved tables feature.

Doing this in DistSQL requires that we add a new InterleavedTableReader, akin to TableReader, which can perform the underlying joint scan and perform the join in a streaming fashion, outputting the result of the join without requiring two scans. It also seems possible to use this on joins that are a strict prefix of the shared prefix in the interleaving relationship, although I have not thought through the design. Finally, we need to detect when a JOIN has this property in the planning process, and use the new processor in distsqlPhysicallPlanner.createPlanForJoin.

This project is complex enough that it requires an RFC before starting.

cc @jordanlewis if you have any insights into whether there is anything else we need to be aware of when scoping this feature out that would affect TPC-C performance with interleaved tables.

cc @andreimatei

@rjnn rjnn added this to the 1.2 milestone Oct 2, 2017
@jordanlewis
Copy link
Member

We might also want to explore the analogous optimization for joins across sibling interleaves in the RFC. Joins across sibling interleaves might be mildly more efficient with a single scan as well, since they're laid out next to each other in the keyspace. The effect won't be as dramatic, though, and it may well be pointless since in DistSQL I think we prefer to get several input streams in parallel - whereas in this case we'd have to wait until the first interleave was read in its entirety before seeing the first key from the second interleave.

Actually, that might be a problem for the InterleavedTableReader as well - any join processor will need to be able to handle its input keys in any order from either side. In other words you won't be able to control which side you get the next key from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants