opt, colexec: take advantage of partial ordering in topk sorter #69724

rharding6373 · 2021-09-02T01:46:55Z

Today topKSorter must process all input rows before returning the top K rows according to its specified ordering. In cases where some rows are already ordered, we could take advantage of the ordering to reduce the number of rows that need to be processed.

Take for example the following query and table with an index on a:

SELECT * FROM table ORDER BY a, b LIMIT 2

a | b

1 | 5
2 | 3
2 | 1
3 | 3
5 | 3

Given an index scan on a to provide a's ordering, topk only needs to process 3 rows in order to guarantee that it has found the top K rows. Once it finishes processing the third row [2, 1], all subsequent rows have higher values of a than the top 2 rows found so far, and therefore cannot be in the top 2 rows.

This change would involve modifying topKSorter in the vectorized exec engine to chunk the input according to ordered columns and stop processing once K candidates have been found, as well as changes to the optimizer to make sure partial orderings are propagated properly and that it is costed appropriately.

Previously, topKSorter had to process all input rows before returning the top K rows according to its specified ordering. If a subset of the input rows were already ordered, topKSorter would still iterate over the entire input. However, if the input was partially ordered, topKSorter could potentially stop iterating early, since after it has found K candidates it is guaranteed not to find any better top candidates. For example, take the following query and table with an index on a: ``` a | b ----+---- 1 | 5 2 | 3 2 | 1 3 | 3 5 | 3 SELECT * FROM t ORDER BY a, b LIMIT 2 ``` Given an index scan on a to provide a's ordering, topk only needs to process 3 rows in order to guarantee that it has found the top K rows. Once it finishes processing the third row [2, 1], all subsequent rows have higher values of a than the top 2 rows found so far, and therefore cannot be in the top 2 rows. This change modifies the vectorized engine's TopKSorter signature to include a partial ordering. The TopKSorter chunks the input according to the sorted columns and processes each chunk with its existing heap algorithm. Row comparison in the heap is also optimized so that tuples in the same chunk only compare non-sorted columns. At the end of each chunk, if K rows are in the heap, TopKSorter emits the rows and stops execution. A later commit, once merged with top K optimizer and distsql changes, will adjust the cost model for top K to reflect this change. Informs cockroachdb#69724 Release note: None

Recent improvements to TopK in colexec allow TopK to stop execution early and emit its output if its sort columns were partially ordered in the input rows. This change modifies the optimizer so that it can find lower cost TopK plans. This change adds two new exploration rules: `GenerateLimitedTopKScans` and `GeneratePartialOrderTopK`. The first rule is similar to `GenerateLimitedGroupByScans` in that it looks for secondary indexes that could provide a partial ordering and adds the secondary index scan and an index join to get the rest of the columns to the memo. This allows us to explore cases of partially ordered inputs (via the index scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy` in that it uses interesting orderings to find partial orderings. The cost model is also updated to reflect the new estimated limit on the number of rows TopK needs to process to find the top K rows. The limit is propagated to TopK's child expressions as a limit hint. Fixes: cockroachdb#69724 Release note (sql change): Improves cost model for TopK expressions if the input to TopK can be partially ordered by its sort columns.

73459: opt: take advantage of partial ordering in topk sorter r=rharding6373 a=rharding6373 Recent improvements to TopK in colexec allow TopK to stop execution early and emit its output if its sort columns were partially ordered in the input rows. This change modifies the optimizer so that it can find lower cost TopK plans. This change adds two new exploration rules: `GenerateLimitedTopKScans` and `GeneratePartialOrderTopK`. The first rule is similar to `GenerateLimitedGroupByScans` in that it looks for secondary indexes that could provide a partial ordering and adds the secondary index scan and an index join to get the rest of the columns to the memo. This allows us to explore cases of partially ordered inputs (via the index scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy` in that it uses interesting orderings to find partial orderings. The cost model is also updated to reflect the new estimated limit on the number of rows TopK needs to process to find the top K rows. The limit is propagated to TopK's child expressions as a limit hint. Fixes: #69724 Release note (sql change): Improves cost model for TopK expressions if the input to TopK can be partially ordered by its sort columns. Co-authored-by: rharding6373 <[email protected]>

Recent improvements to TopK in colexec allow TopK to stop execution early and emit its output if its sort columns were partially ordered in the input rows. This change modifies the optimizer so that it can find lower cost TopK plans. This change adds two new exploration rules: `GenerateLimitedTopKScans` and `GeneratePartialOrderTopK`. The first rule is similar to `GenerateLimitedGroupByScans` in that it looks for secondary indexes that could provide a partial ordering and adds the secondary index scan and an index join to get the rest of the columns to the memo. This allows us to explore cases of partially ordered inputs (via the index scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy` in that it uses interesting orderings to find partial orderings. The cost model is also updated to reflect the new estimated limit on the number of rows TopK needs to process to find the top K rows. The limit is propagated to TopK's child expressions as a limit hint. Fixes: cockroachdb#69724 Release note (sql change): Improves cost model for TopK expressions if the input to TopK can be partially ordered by its sort columns.

rharding6373 added C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team labels Sep 2, 2021

rharding6373 self-assigned this Sep 2, 2021

rharding6373 mentioned this issue Dec 3, 2021

opt: take advantage of partial ordering in topk sorter #73459

Merged

craig bot closed this as completed in 48dd550 Dec 17, 2021

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt, colexec: take advantage of partial ordering in topk sorter #69724

opt, colexec: take advantage of partial ordering in topk sorter #69724

rharding6373 commented Sep 2, 2021

opt, colexec: take advantage of partial ordering in topk sorter #69724

opt, colexec: take advantage of partial ordering in topk sorter #69724

Comments

rharding6373 commented Sep 2, 2021

a | b