-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt, colexec: take advantage of partial ordering in topk sorter #69724
Labels
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
T-sql-queries
SQL Queries Team
Comments
rharding6373
added
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
T-sql-queries
SQL Queries Team
labels
Sep 2, 2021
rharding6373
added a commit
to rharding6373/cockroach
that referenced
this issue
Sep 21, 2021
Previously, topKSorter had to process all input rows before returning the top K rows according to its specified ordering. If a subset of the input rows were already ordered, topKSorter would still iterate over the entire input. However, if the input was partially ordered, topKSorter could potentially stop iterating early, since after it has found K candidates it is guaranteed not to find any better top candidates. For example, take the following query and table with an index on a: ``` a | b ----+---- 1 | 5 2 | 3 2 | 1 3 | 3 5 | 3 SELECT * FROM t ORDER BY a, b LIMIT 2 ``` Given an index scan on a to provide a's ordering, topk only needs to process 3 rows in order to guarantee that it has found the top K rows. Once it finishes processing the third row [2, 1], all subsequent rows have higher values of a than the top 2 rows found so far, and therefore cannot be in the top 2 rows. This change modifies the vectorized engine's TopKSorter signature to include a partial ordering. The TopKSorter chunks the input according to the sorted columns and processes each chunk with its existing heap algorithm. Row comparison in the heap is also optimized so that tuples in the same chunk only compare non-sorted columns. At the end of each chunk, if K rows are in the heap, TopKSorter emits the rows and stops execution. A later commit, once merged with top K optimizer and distsql changes, will adjust the cost model for top K to reflect this change. Informs cockroachdb#69724 Release note: None
rharding6373
added a commit
to rharding6373/cockroach
that referenced
this issue
Dec 3, 2021
Recent improvements to TopK in colexec allow TopK to stop execution early and emit its output if its sort columns were partially ordered in the input rows. This change modifies the optimizer so that it can find lower cost TopK plans. This change adds two new exploration rules: `GenerateLimitedTopKScans` and `GeneratePartialOrderTopK`. The first rule is similar to `GenerateLimitedGroupByScans` in that it looks for secondary indexes that could provide a partial ordering and adds the secondary index scan and an index join to get the rest of the columns to the memo. This allows us to explore cases of partially ordered inputs (via the index scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy` in that it uses interesting orderings to find partial orderings. The cost model is also updated to reflect the new estimated limit on the number of rows TopK needs to process to find the top K rows. The limit is propagated to TopK's child expressions as a limit hint. Fixes: cockroachdb#69724 Release note (sql change): Improves cost model for TopK expressions if the input to TopK can be partially ordered by its sort columns.
rharding6373
added a commit
to rharding6373/cockroach
that referenced
this issue
Dec 16, 2021
Recent improvements to TopK in colexec allow TopK to stop execution early and emit its output if its sort columns were partially ordered in the input rows. This change modifies the optimizer so that it can find lower cost TopK plans. This change adds two new exploration rules: `GenerateLimitedTopKScans` and `GeneratePartialOrderTopK`. The first rule is similar to `GenerateLimitedGroupByScans` in that it looks for secondary indexes that could provide a partial ordering and adds the secondary index scan and an index join to get the rest of the columns to the memo. This allows us to explore cases of partially ordered inputs (via the index scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy` in that it uses interesting orderings to find partial orderings. The cost model is also updated to reflect the new estimated limit on the number of rows TopK needs to process to find the top K rows. The limit is propagated to TopK's child expressions as a limit hint. Fixes: cockroachdb#69724 Release note (sql change): Improves cost model for TopK expressions if the input to TopK can be partially ordered by its sort columns.
rharding6373
added a commit
to rharding6373/cockroach
that referenced
this issue
Dec 16, 2021
Recent improvements to TopK in colexec allow TopK to stop execution early and emit its output if its sort columns were partially ordered in the input rows. This change modifies the optimizer so that it can find lower cost TopK plans. This change adds two new exploration rules: `GenerateLimitedTopKScans` and `GeneratePartialOrderTopK`. The first rule is similar to `GenerateLimitedGroupByScans` in that it looks for secondary indexes that could provide a partial ordering and adds the secondary index scan and an index join to get the rest of the columns to the memo. This allows us to explore cases of partially ordered inputs (via the index scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy` in that it uses interesting orderings to find partial orderings. The cost model is also updated to reflect the new estimated limit on the number of rows TopK needs to process to find the top K rows. The limit is propagated to TopK's child expressions as a limit hint. Fixes: cockroachdb#69724 Release note (sql change): Improves cost model for TopK expressions if the input to TopK can be partially ordered by its sort columns.
rharding6373
added a commit
to rharding6373/cockroach
that referenced
this issue
Dec 17, 2021
Recent improvements to TopK in colexec allow TopK to stop execution early and emit its output if its sort columns were partially ordered in the input rows. This change modifies the optimizer so that it can find lower cost TopK plans. This change adds two new exploration rules: `GenerateLimitedTopKScans` and `GeneratePartialOrderTopK`. The first rule is similar to `GenerateLimitedGroupByScans` in that it looks for secondary indexes that could provide a partial ordering and adds the secondary index scan and an index join to get the rest of the columns to the memo. This allows us to explore cases of partially ordered inputs (via the index scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy` in that it uses interesting orderings to find partial orderings. The cost model is also updated to reflect the new estimated limit on the number of rows TopK needs to process to find the top K rows. The limit is propagated to TopK's child expressions as a limit hint. Fixes: cockroachdb#69724 Release note (sql change): Improves cost model for TopK expressions if the input to TopK can be partially ordered by its sort columns.
craig bot
pushed a commit
that referenced
this issue
Dec 17, 2021
73459: opt: take advantage of partial ordering in topk sorter r=rharding6373 a=rharding6373 Recent improvements to TopK in colexec allow TopK to stop execution early and emit its output if its sort columns were partially ordered in the input rows. This change modifies the optimizer so that it can find lower cost TopK plans. This change adds two new exploration rules: `GenerateLimitedTopKScans` and `GeneratePartialOrderTopK`. The first rule is similar to `GenerateLimitedGroupByScans` in that it looks for secondary indexes that could provide a partial ordering and adds the secondary index scan and an index join to get the rest of the columns to the memo. This allows us to explore cases of partially ordered inputs (via the index scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy` in that it uses interesting orderings to find partial orderings. The cost model is also updated to reflect the new estimated limit on the number of rows TopK needs to process to find the top K rows. The limit is propagated to TopK's child expressions as a limit hint. Fixes: #69724 Release note (sql change): Improves cost model for TopK expressions if the input to TopK can be partially ordered by its sort columns. Co-authored-by: rharding6373 <[email protected]>
gustasva
pushed a commit
to gustasva/cockroach
that referenced
this issue
Jan 4, 2022
Recent improvements to TopK in colexec allow TopK to stop execution early and emit its output if its sort columns were partially ordered in the input rows. This change modifies the optimizer so that it can find lower cost TopK plans. This change adds two new exploration rules: `GenerateLimitedTopKScans` and `GeneratePartialOrderTopK`. The first rule is similar to `GenerateLimitedGroupByScans` in that it looks for secondary indexes that could provide a partial ordering and adds the secondary index scan and an index join to get the rest of the columns to the memo. This allows us to explore cases of partially ordered inputs (via the index scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy` in that it uses interesting orderings to find partial orderings. The cost model is also updated to reflect the new estimated limit on the number of rows TopK needs to process to find the top K rows. The limit is propagated to TopK's child expressions as a limit hint. Fixes: cockroachdb#69724 Release note (sql change): Improves cost model for TopK expressions if the input to TopK can be partially ordered by its sort columns.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
T-sql-queries
SQL Queries Team
Today topKSorter must process all input rows before returning the top K rows according to its specified ordering. In cases where some rows are already ordered, we could take advantage of the ordering to reduce the number of rows that need to be processed.
Take for example the following query and table with an index on a:
SELECT * FROM table ORDER BY a, b LIMIT 2
a | b
1 | 5
2 | 3
2 | 1
3 | 3
5 | 3
Given an index scan on a to provide a's ordering, topk only needs to process 3 rows in order to guarantee that it has found the top K rows. Once it finishes processing the third row [2, 1], all subsequent rows have higher values of a than the top 2 rows found so far, and therefore cannot be in the top 2 rows.
This change would involve modifying topKSorter in the vectorized exec engine to chunk the input according to ordered columns and stop processing once K candidates have been found, as well as changes to the optimizer to make sure partial orderings are propagated properly and that it is costed appropriately.
The text was updated successfully, but these errors were encountered: