You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"Natural" joins over interleaved tables (i.e. joins that are joined over the same shared prefix of columns that is used in the parent interleaving relationship during CREATE TABLE INTERLEAVE INTO PARENT) can be optimized into using a single scan over the keyspace, rather than the two scans that would result in naively planning the join. This should significantly improve performance of these joins, and also provide a stronger justification for using the interleaved tables feature.
Doing this in DistSQL requires that we add a new InterleavedTableReader, akin to TableReader, which can perform the underlying joint scan and perform the join in a streaming fashion, outputting the result of the join without requiring two scans. It also seems possible to use this on joins that are a strict prefix of the shared prefix in the interleaving relationship, although I have not thought through the design. Finally, we need to detect when a JOIN has this property in the planning process, and use the new processor in distsqlPhysicallPlanner.createPlanForJoin.
This project is complex enough that it requires an RFC before starting.
cc @jordanlewis if you have any insights into whether there is anything else we need to be aware of when scoping this feature out that would affect TPC-C performance with interleaved tables.
We might also want to explore the analogous optimization for joins across sibling interleaves in the RFC. Joins across sibling interleaves might be mildly more efficient with a single scan as well, since they're laid out next to each other in the keyspace. The effect won't be as dramatic, though, and it may well be pointless since in DistSQL I think we prefer to get several input streams in parallel - whereas in this case we'd have to wait until the first interleave was read in its entirety before seeing the first key from the second interleave.
Actually, that might be a problem for the InterleavedTableReader as well - any join processor will need to be able to handle its input keys in any order from either side. In other words you won't be able to control which side you get the next key from.
"Natural" joins over interleaved tables (i.e. joins that are joined over the same shared prefix of columns that is used in the parent interleaving relationship during
CREATE TABLE INTERLEAVE INTO PARENT
) can be optimized into using a single scan over the keyspace, rather than the two scans that would result in naively planning the join. This should significantly improve performance of these joins, and also provide a stronger justification for using the interleaved tables feature.Doing this in DistSQL requires that we add a new
InterleavedTableReader
, akin toTableReader
, which can perform the underlying joint scan and perform the join in a streaming fashion, outputting the result of the join without requiring two scans. It also seems possible to use this on joins that are a strict prefix of the shared prefix in the interleaving relationship, although I have not thought through the design. Finally, we need to detect when a JOIN has this property in the planning process, and use the new processor indistsqlPhysicallPlanner.createPlanForJoin
.This project is complex enough that it requires an RFC before starting.
cc @jordanlewis if you have any insights into whether there is anything else we need to be aware of when scoping this feature out that would affect TPC-C performance with interleaved tables.
cc @andreimatei
The text was updated successfully, but these errors were encountered: