-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join reordering stats11 speedup #605
base: join-reordering-stats11
Are you sure you want to change the base?
Conversation
The join_distribution_type property has three option: "repartitioned", "replicated", and "automatic". This property replaces the "distributed_joins" property. "Repartitioned" has the same behavior as distribtued_joins=true. "Replicated" is equivalent to distributed_joins=false. "Automatic" will use stats to evaluate whether partitioned or replicated is better, or if no information is available, it will choose partitioned. The default value for join_distribution_type is "repartitioned" .
Create a new property called join_reordering_strategy with the options ELIMINATE_CROSS_JOINS, COST_BASED and NONE. ELIMINATE_CROSS_JOINS is equivalent to the previous reorder_joins=true. COST_BASED will used join enumeration to make a cost-based decision of join order. NONE will maintain the syntactic join order, and is equivalent to the previous reorder_joins=false.
Allowing the LocalQueryRunner to estimate costs using a fake node count allows unit tests to consider network costs and different cluster configurations.
Change ExpressionUtils.binaryExpression to return TRUE on an empty list
Add a rule to enumerate join order possibilities for a join graph and choose the least cost option. This does a minimal form of cross join elimination, by only partitioning nodes into groups that have at least one edge between them, which eliminates some unnecessary cross joins from consideration. It also means that necessary cross joins will always be executed as late as possible in the plan (which may be worse).
Results run on my development vm BenchmarkReorderJoinsConnectedGraph: BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins ELIMINATE_CROSS_JOINS 2 avgt 30 54.610 ± 4.236 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins ELIMINATE_CROSS_JOINS 4 avgt 30 153.794 ± 9.075 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins ELIMINATE_CROSS_JOINS 6 avgt 30 326.410 ± 19.912 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins ELIMINATE_CROSS_JOINS 8 avgt 30 578.028 ± 33.308 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins ELIMINATE_CROSS_JOINS 10 avgt 30 955.494 ± 44.523 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins COST_BASED 2 avgt 30 54.844 ± 4.256 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins COST_BASED 4 avgt 30 161.164 ± 11.008 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins COST_BASED 6 avgt 30 440.007 ± 28.903 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins COST_BASED 8 avgt 30 2491.240 ± 72.341 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins COST_BASED 10 avgt 30 24026.603 ± 886.696 ms/opa BencharkReorderJoinsLinearGraph: BenchmarkReorderJoinsLinearQuery.benchmarkReorderJoins ELIMINATE_CROSS_JOINS avgt 30 944.179 ± 42.406 ms/op BenchmarkReorderJoinsLinearQuery.benchmarkReorderJoins COST_BASED avgt 30 1329.194 ± 71.704 ms/op
Previously there was a lot of map copying (through ImmutableMap.copyOf(...) and new HashMap(...)) which was significantly impacting stats code performance. HashTreePMap is much better for cases where individual entries of base map are modified which is common case in stats code.
Not all CostCalculators are thread safe.
Previously JoinEnumerator#setJoinNodeProperties was recomputing join stats for each alternative of join even though actual stats didn't change. Now join stats are memoized by node id. This reduced enumeration time by a factor of 2.
@@ -60,7 +59,7 @@ public void testResolvesGroupReferenceNode() | |||
PlanNode plan = node(source); | |||
Memo memo = new Memo(idAllocator, plan); | |||
|
|||
MemoBasedLookup lookup = new MemoBasedLookup(memo, new NodeCountingStatsCalculator(), new CostCalculatorUsingExchanges(1)); | |||
Lookup lookup = Lookup.from(memo::resolve, new NodeCountingStatsCalculator(), new CostCalculatorUsingExchanges(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these tests still useful since there's no more MemoBasedLookup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably caching calculator tests would be more useful. I will remove this tests for now.
implements StatsCalculator | ||
{ | ||
private final StatsCalculator statsCalculator; | ||
private final Map<PlanNodeId, PlanNodeStatsEstimate> stats = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here about why it caches by node id.
844fe2d
to
c3d1a7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor comments, more offline
<dependency> | ||
<groupId>org.pcollections</groupId> | ||
<artifactId>pcollections</artifactId> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you cannot add something to pom.xml which is not used in the code. mvn validate
will fail in such case.
return this; | ||
} | ||
|
||
public Builder removeSymbolStatistics(Symbol symbol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having this method make me think that this whole Builder
is not needed any more as you can just simply operate on SymbolStatsEstimate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but you need to construct initial PlanNodeStatsEstimate
for which the builder seems to be useful
@@ -102,7 +102,7 @@ public PlanNodeStatsEstimate getStats(PlanNode node, Session session, Map<Symbol | |||
@Override | |||
public PlanNodeCostEstimate getCumulativeCost(PlanNode node, Session session, Map<Symbol, Type> types) | |||
{ | |||
return costCalculator.calculateCumulativeCost(node, this, session, types); | |||
return costCalculator.calculateCumulativeCost(resolve(node), this, session, types); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not look related
Waiting for reorder joins to pass tests then I will merge into reorder joins branch |
// Cross joins can't filter symbols as part of the join | ||
// If we're doing a cross join, use all output symbols from the inputs and add a project node | ||
// on top | ||
List<Symbol> joinOutputSymbols = sortedOutputSymbols; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is no longer needed since we don't allow cross joins
Optional.empty(), | ||
Optional.empty())); | ||
|
||
if (!joinOutputSymbols.equals(sortedOutputSymbols)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's needed either
@@ -111,6 +111,9 @@ public PlanNode visitDelete(DeleteNode node, RewriteContext<Void> context) | |||
|
|||
private JoinNode.DistributionType getTargetJoinDistributionType(JoinNode node) | |||
{ | |||
if (node.getDistributionType().isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unrelated change to this commit. Make a (small) commit out of it.
@@ -347,7 +347,7 @@ public PlanOptimizers( | |||
stats, | |||
statsCalculator, | |||
estimatedExchangesCostCalculator, | |||
ImmutableSet.of(new ReorderJoins(costComparator)) | |||
ImmutableSet.of(new ReorderJoins(costComparator, statsCalculator, costCalculator)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
estimatedExchangesCostCalculator
should be used here
Using dedicated cache for join stats