Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt, colexec: take advantage of partial ordering in topk sorter #69724

Closed
rharding6373 opened this issue Sep 2, 2021 · 0 comments · Fixed by #73459
Closed

opt, colexec: take advantage of partial ordering in topk sorter #69724

rharding6373 opened this issue Sep 2, 2021 · 0 comments · Fixed by #73459
Assignees
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team

Comments

@rharding6373
Copy link
Collaborator

Today topKSorter must process all input rows before returning the top K rows according to its specified ordering. In cases where some rows are already ordered, we could take advantage of the ordering to reduce the number of rows that need to be processed.

Take for example the following query and table with an index on a:

SELECT * FROM table ORDER BY a, b LIMIT 2

a | b

1 | 5
2 | 3
2 | 1
3 | 3
5 | 3

Given an index scan on a to provide a's ordering, topk only needs to process 3 rows in order to guarantee that it has found the top K rows. Once it finishes processing the third row [2, 1], all subsequent rows have higher values of a than the top 2 rows found so far, and therefore cannot be in the top 2 rows.

This change would involve modifying topKSorter in the vectorized exec engine to chunk the input according to ordered columns and stop processing once K candidates have been found, as well as changes to the optimizer to make sure partial orderings are propagated properly and that it is costed appropriately.

@rharding6373 rharding6373 added C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team labels Sep 2, 2021
@rharding6373 rharding6373 self-assigned this Sep 2, 2021
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Sep 21, 2021
Previously, topKSorter had to process all input rows before returning
the top K rows according to its specified ordering. If a subset of the
input rows were already ordered, topKSorter would still iterate over the
entire input.

However, if the input was partially ordered, topKSorter could
potentially stop iterating early, since after it has found K candidates
it is guaranteed not to find any better top candidates.

For example, take the following query and table with an index on a:

```
  a | b
----+----
  1 | 5
  2 | 3
  2 | 1
  3 | 3
  5 | 3

SELECT * FROM t ORDER BY a, b LIMIT 2
```

Given an index scan on a to provide a's ordering, topk only needs to
process 3 rows in order to guarantee that it has found the top K rows.
Once it finishes processing the third row [2, 1], all subsequent rows
have higher values of a than the top 2 rows found so far, and
therefore cannot be in the top 2 rows.

This change modifies the vectorized engine's TopKSorter signature to include
a partial ordering. The TopKSorter chunks the input according to the
sorted columns and processes each chunk with its existing heap
algorithm. Row comparison in the heap is also optimized so that tuples
in the same chunk only compare non-sorted columns. At the end
of each chunk, if K rows are in the heap, TopKSorter emits the rows and
stops execution.

A later commit, once merged with top K optimizer and distsql changes,
will adjust the cost model for top K to reflect this change.

Informs cockroachdb#69724

Release note: None
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Dec 3, 2021
Recent improvements to TopK in colexec allow TopK to stop execution
early and emit its output if its sort columns were partially ordered in
the input rows. This change modifies the optimizer so that it can
find lower cost TopK plans.

This change adds two new exploration rules: `GenerateLimitedTopKScans`
and `GeneratePartialOrderTopK`. The first rule is similar to
`GenerateLimitedGroupByScans` in that it looks for secondary indexes
that could provide a partial ordering and adds the secondary index scan
and an index join to get the rest of the columns to the memo. This
allows us to explore cases of partially ordered inputs (via the index
scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy`
in that it uses interesting orderings to find partial orderings.

The cost model is also updated to reflect the new estimated limit on the
number of rows TopK needs to process to find the top K rows. The limit
is propagated to TopK's child expressions as a limit hint.

Fixes: cockroachdb#69724

Release note (sql change): Improves cost model for TopK expressions if the
input to TopK can be partially ordered by its sort columns.
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Dec 16, 2021
Recent improvements to TopK in colexec allow TopK to stop execution
early and emit its output if its sort columns were partially ordered in
the input rows. This change modifies the optimizer so that it can
find lower cost TopK plans.

This change adds two new exploration rules: `GenerateLimitedTopKScans`
and `GeneratePartialOrderTopK`. The first rule is similar to
`GenerateLimitedGroupByScans` in that it looks for secondary indexes
that could provide a partial ordering and adds the secondary index scan
and an index join to get the rest of the columns to the memo. This
allows us to explore cases of partially ordered inputs (via the index
scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy`
in that it uses interesting orderings to find partial orderings.

The cost model is also updated to reflect the new estimated limit on the
number of rows TopK needs to process to find the top K rows. The limit
is propagated to TopK's child expressions as a limit hint.

Fixes: cockroachdb#69724

Release note (sql change): Improves cost model for TopK expressions if the
input to TopK can be partially ordered by its sort columns.
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Dec 16, 2021
Recent improvements to TopK in colexec allow TopK to stop execution
early and emit its output if its sort columns were partially ordered in
the input rows. This change modifies the optimizer so that it can
find lower cost TopK plans.

This change adds two new exploration rules: `GenerateLimitedTopKScans`
and `GeneratePartialOrderTopK`. The first rule is similar to
`GenerateLimitedGroupByScans` in that it looks for secondary indexes
that could provide a partial ordering and adds the secondary index scan
and an index join to get the rest of the columns to the memo. This
allows us to explore cases of partially ordered inputs (via the index
scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy`
in that it uses interesting orderings to find partial orderings.

The cost model is also updated to reflect the new estimated limit on the
number of rows TopK needs to process to find the top K rows. The limit
is propagated to TopK's child expressions as a limit hint.

Fixes: cockroachdb#69724

Release note (sql change): Improves cost model for TopK expressions if the
input to TopK can be partially ordered by its sort columns.
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Dec 17, 2021
Recent improvements to TopK in colexec allow TopK to stop execution
early and emit its output if its sort columns were partially ordered in
the input rows. This change modifies the optimizer so that it can
find lower cost TopK plans.

This change adds two new exploration rules: `GenerateLimitedTopKScans`
and `GeneratePartialOrderTopK`. The first rule is similar to
`GenerateLimitedGroupByScans` in that it looks for secondary indexes
that could provide a partial ordering and adds the secondary index scan
and an index join to get the rest of the columns to the memo. This
allows us to explore cases of partially ordered inputs (via the index
scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy`
in that it uses interesting orderings to find partial orderings.

The cost model is also updated to reflect the new estimated limit on the
number of rows TopK needs to process to find the top K rows. The limit
is propagated to TopK's child expressions as a limit hint.

Fixes: cockroachdb#69724

Release note (sql change): Improves cost model for TopK expressions if the
input to TopK can be partially ordered by its sort columns.
craig bot pushed a commit that referenced this issue Dec 17, 2021
73459: opt: take advantage of partial ordering in topk sorter r=rharding6373 a=rharding6373

Recent improvements to TopK in colexec allow TopK to stop execution
early and emit its output if its sort columns were partially ordered in
the input rows. This change modifies the optimizer so that it can
find lower cost TopK plans.

This change adds two new exploration rules: `GenerateLimitedTopKScans`
and `GeneratePartialOrderTopK`. The first rule is similar to
`GenerateLimitedGroupByScans` in that it looks for secondary indexes
that could provide a partial ordering and adds the secondary index scan
and an index join to get the rest of the columns to the memo. This
allows us to explore cases of partially ordered inputs (via the index
scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy`
in that it uses interesting orderings to find partial orderings.

The cost model is also updated to reflect the new estimated limit on the
number of rows TopK needs to process to find the top K rows. The limit
is propagated to TopK's child expressions as a limit hint.

Fixes: #69724

Release note (sql change): Improves cost model for TopK expressions if the
input to TopK can be partially ordered by its sort columns.

Co-authored-by: rharding6373 <[email protected]>
@craig craig bot closed this as completed in 48dd550 Dec 17, 2021
gustasva pushed a commit to gustasva/cockroach that referenced this issue Jan 4, 2022
Recent improvements to TopK in colexec allow TopK to stop execution
early and emit its output if its sort columns were partially ordered in
the input rows. This change modifies the optimizer so that it can
find lower cost TopK plans.

This change adds two new exploration rules: `GenerateLimitedTopKScans`
and `GeneratePartialOrderTopK`. The first rule is similar to
`GenerateLimitedGroupByScans` in that it looks for secondary indexes
that could provide a partial ordering and adds the secondary index scan
and an index join to get the rest of the columns to the memo. This
allows us to explore cases of partially ordered inputs (via the index
scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy`
in that it uses interesting orderings to find partial orderings.

The cost model is also updated to reflect the new estimated limit on the
number of rows TopK needs to process to find the top K rows. The limit
is propagated to TopK's child expressions as a limit hint.

Fixes: cockroachdb#69724

Release note (sql change): Improves cost model for TopK expressions if the
input to TopK can be partially ordered by its sort columns.
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant