-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt: take advantage of partial ordering in topk sorter #73459
opt: take advantage of partial ordering in topk sorter #73459
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably grok'd about %50 of this but it looks good and seems thoroughly tested. Nice work!
Reviewed 8 of 19 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rharding6373 and @rytaft)
pkg/sql/opt/xform/index_scan_builder.go, line 181 at r1 (raw file):
// expressions that were specified by previous calls to various add methods. // It is similar to Build, but does not add the expression to the memo group. func (b *indexScanBuilder) BuildExpr() (output memo.RelExpr) {
Seems like a high amount of code duplication, there's got to be a way to refactor Build and BuildExpr to address this, I thought about it for a bit and came up empty though, so not a blocker just thought I'd mention it.
pkg/sql/opt/xform/limit_funcs.go, line 263 at r1 (raw file):
if pkCols.Empty() { pkCols = c.PrimaryKeyCols(sp.Table) }
Why not just do this in the outer function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Sorry it took me SO long to review
Reviewed 19 of 19 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @rharding6373)
pkg/sql/opt/ops/relational.opt, line 1056 at r1 (raw file):
# required ordering, TopK can take advantage of optimizations. TopK can be used # to substitute a Limit that requires its input to be ordered and performs best # when the input is not already ordered. TopK scans the input, storing the K
nit: not already ordered -> not already fully ordered ?
pkg/sql/opt/ordering/topk.go, line 43 at r1 (raw file):
func TopKColOrdering(topk *memo.TopKExpr, required *props.OrderingChoice) opt.Ordering { print("required cols: ", required.ColSet().String(), "\n") print("ordered cols: ", topk.Ordering.ColSet().String(), "\n")
leftover debug stuff
pkg/sql/opt/ordering/topk.go, line 46 at r1 (raw file):
ordering := make(opt.Ordering, len(required.Columns)) for i, rc := range required.Columns { cols := rc.Group.Intersection(topk.Ordering.ColSet())
Is there a reason you're not using OrderingChoice.CommonPrefix
? Maybe I don't fully understand what this function is doing...
pkg/sql/opt/ordering/topk.go, line 53 at r1 (raw file):
return ordering[:i] } print("Col ", i, ": MakeOrderingCol ", rc.Group.String(), "\n")
more debug
pkg/sql/opt/xform/coster.go, line 1581 at r1 (raw file):
// topKInputLimitHint calculates an appropriate limit hint for the input // to a Top K expression when there.
looks like this sentence got cut off
pkg/sql/opt/xform/coster.go, line 1589 at r1 (raw file):
} orderedCols := topk.PartialOrdering.ColSet() if orderedCols.Len() == 0 {
better to just do if topk.PartialOrdering.Any()
pkg/sql/opt/xform/coster.go, line 1592 at r1 (raw file):
return inputRowCount } orderedStats, ok := topk.Relational().Stats.ColStats.Lookup(orderedCols)
can you make a call to countSegments
to avoid some of this duplication?
pkg/sql/opt/xform/coster.go, line 1600 at r1 (raw file):
} expectedSegments := math.Ceil(K/orderedStats.DistinctCount) * orderedStats.DistinctCount
can you explain this formula? I'm a bit confused
pkg/sql/opt/xform/index_scan_builder.go, line 181 at r1 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
Seems like a high amount of code duplication, there's got to be a way to refactor Build and BuildExpr to address this, I thought about it for a bit and came up empty though, so not a blocker just thought I'd mention it.
+1 although I also don't have any specific ideas. The interface to construct a new memo group v. add to an existing one is definitely a bit clunky.
I do think this function could have a better name, though. Maybe BuildNew
to indicate that it's building a new memo group? Not sure what would be best.
pkg/sql/opt/xform/limit_funcs.go, line 288 at r1 (raw file):
// the index. sb.AddIndexJoin(sp.Cols) input := sb.BuildExpr()
Why is this using different logic from GenerateLimitedGroupByScans
? Why do you need this new BuildExpr
function?
pkg/sql/opt/xform/limit_funcs.go, line 291 at r1 (raw file):
// Use the overlapping indexes and required ordering. newPrivate := *tp for i, cnt := 0, index.KeyColumnCount(); i < cnt; i++ {
You might want to use the function ScanPrivateCanProvide
instead of duplicating the logic here. But also, maybe you don't even need this? GenerateLimitedGroupByScans
doesn't update any ordering fields. And seems like GeneratePartialOrderTopK
is already doing everything you need to set the partial ordering.
pkg/sql/opt/xform/limit_funcs.go, line 306 at r1 (raw file):
// f is a function that allows the caller to impose additional constraints on // columns that are considered optional, and should return true if the column // is optional.
Can you give the function a name that supports this? Maybe isOptional
?
pkg/sql/opt/xform/limit_funcs.go, line 336 at r1 (raw file):
newOrd.FromOrderingWithOptCols(o, opt.ColSet{}) // Simplify the ordering according to the input's FDs. Note that this is not
nit: we usually add a blank line above big comments like this
pkg/sql/opt/xform/limit_funcs.go, line 347 at r1 (raw file):
// GeneratePartialOrderTopK generates TopK expressions with more specific orderings // based on the interesting orderings property. This enables the optimizer to // explore TopK with partially ordered input columns according.
sentence seems cut off
pkg/sql/opt/xform/limit_funcs.go, line 362 at r1 (raw file):
newPrivate := *private newPrivate.PartialOrdering = newOrd
is it possible that PartialOrdering will be the same as the required ordering (i.e., private.Ordering)? That would indicate that it would be better to just use a Limit. But maybe we should just keep it simple and leave it to the coster to figure out.
pkg/sql/opt/xform/rules/limit.opt, line 132 at r1 (raw file):
# missing from the index. This differs from GenerateIndexScans, which does not # generate index joins for non-covering indexes. It also regenerates the Top K # expression with the partial ordering of index columns.
As I mentioned above, seems like you shouldn't need to update the topK partial ordering here
pkg/sql/opt/xform/rules/limit.opt, line 139 at r1 (raw file):
# being already partially ordered, and allows TopK to pass on a limit hint to # its input expressions so that they may not have to process the entire input # rows, either.
nit: "process the entire input rows" -> "process all input rows"
pkg/sql/opt/xform/testdata/coster/limit, line 65 at r1 (raw file):
└── 1000 opt
nit: add a comment to explain what this is showing
pkg/sql/opt/xform/testdata/rules/limit, line 1811 at r1 (raw file):
---- # Columns are covered by GenerateIndexScans.
I don't understand this comment
pkg/sql/opt/xform/testdata/rules/limit, line 1930 at r1 (raw file):
│ │ ├── best: (index-join G10="[ordering: +1,+1]" defg,cols=(1-4)) │ │ └── cost: 7404.89 │ ├── [ordering: +1,+1] [limit hint: 19.12]
This ordering looks wrong. We don't usually allow ordering by the same column twice
055d7b0
to
c61fbe6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs! Addressed comments and fixed some bugs. PTAL!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @rharding6373, and @rytaft)
pkg/sql/opt/ordering/topk.go, line 43 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
leftover debug stuff
Done.
pkg/sql/opt/ordering/topk.go, line 53 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
more debug
Done.
pkg/sql/opt/xform/coster.go, line 1581 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
looks like this sentence got cut off
Done.
pkg/sql/opt/xform/coster.go, line 1589 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
better to just do
if topk.PartialOrdering.Any()
Done.
pkg/sql/opt/xform/coster.go, line 1592 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
can you make a call to
countSegments
to avoid some of this duplication?
I made a common helper function.
pkg/sql/opt/xform/coster.go, line 1600 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
can you explain this formula? I'm a bit confused
Added a comment, and also reexamining this made me realize there was a bug. This calculation should be using inputRowCount / distinctCount
, not distinctCount
.
pkg/sql/opt/xform/index_scan_builder.go, line 181 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
+1 although I also don't have any specific ideas. The interface to construct a new memo group v. add to an existing one is definitely a bit clunky.
I do think this function could have a better name, though. Maybe
BuildNew
to indicate that it's building a new memo group? Not sure what would be best.
I tried to refactor using the new function and switching based on op type to add the right kind of expression to the memo group, but it turns out the memo interface is more unwieldy than I thought and that's not possible. Since ConstructScan
type functions intern the expression, you can't add the same expression to a group using the typical API. Added a TODO.
Changed the function name to BuildNewExpr
, but I'm not sure that's any more satisfactory, either.
pkg/sql/opt/xform/limit_funcs.go, line 263 at r1 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
Why not just do this in the outer function?
Good observation. Done.
pkg/sql/opt/xform/limit_funcs.go, line 288 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Why is this using different logic from
GenerateLimitedGroupByScans
? Why do you need this newBuildExpr
function?
I made the new function in part because there were now at least 2 Generate functions that could use it (and code duplication outside of index_scan_builder
seems even worse than code duplication inside it), but didn't refactor GenerateLimitedGroupByScans
in this PR initially. Refactored it now.
pkg/sql/opt/xform/limit_funcs.go, line 291 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
You might want to use the function
ScanPrivateCanProvide
instead of duplicating the logic here. But also, maybe you don't even need this?GenerateLimitedGroupByScans
doesn't update any ordering fields. And seems likeGeneratePartialOrderTopK
is already doing everything you need to set the partial ordering.
Yes, I think you're right. I've removed this piece of logic.
pkg/sql/opt/xform/limit_funcs.go, line 306 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Can you give the function a name that supports this? Maybe
isOptional
?
Done.
pkg/sql/opt/xform/limit_funcs.go, line 347 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
sentence seems cut off
Done.
pkg/sql/opt/xform/limit_funcs.go, line 362 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
is it possible that PartialOrdering will be the same as the required ordering (i.e., private.Ordering)? That would indicate that it would be better to just use a Limit. But maybe we should just keep it simple and leave it to the coster to figure out.
Yes, I don't think I've seen it chosen, but it seems like we should keep to the partial order promise of this function and reduce the memo size.
pkg/sql/opt/xform/rules/limit.opt, line 132 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
As I mentioned above, seems like you shouldn't need to update the topK partial ordering here
Done.
pkg/sql/opt/xform/testdata/rules/limit, line 1811 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
I don't understand this comment
Updated the comment
pkg/sql/opt/xform/testdata/rules/limit, line 1930 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
This ordering looks wrong. We don't usually allow ordering by the same column twice
There was a bug in the logic removed from GenerateLimitedTopKScans
. Thanks for catching this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @rharding6373)
pkg/sql/opt/xform/coster.go, line 1264 at r2 (raw file):
} func getOrderingColStats(
This function could use a comment
pkg/sql/opt/xform/coster.go, line 1604 at r2 (raw file):
// In order to find the top K rows of a partially sorted input, we estimate // the number of rows we'll need to ingest by rounding up the nearest multiple // of the number of rows per distinct values to K.
nit: this sentence is a bit hard to follow. Maybe add some more words or add an example?
pkg/sql/opt/xform/coster.go, line 1605 at r2 (raw file):
// the number of rows we'll need to ingest by rounding up the nearest multiple // of the number of rows per distinct values to K. expectedRows := math.Ceil(K/(inputRowCount/orderedStats.DistinctCount)) * (inputRowCount / orderedStats.DistinctCount)
For clarity, maybe it would help to define another variable rowsPerDistinctVal := inputRowCount/orderedStats.DistinctCount
?
pkg/sql/opt/xform/limit_funcs.go, line 263 at r1 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
Good observation. Done.
It probably doesn't matter too much here, but just FYI, I'm guessing that the reason it was done this way in other places is to try to minimize the amount of unnecessary logic that needs to run if there are no secondary indexes. We've tried hard to keep the "match" logic for each rule as efficient as possible, so that more expensive logic that might perform allocations only runs if the rule is definitely going to apply.
pkg/sql/opt/xform/limit_funcs.go, line 306 at r1 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
Done.
The comment needs to be updated
pkg/sql/opt/xform/limit_funcs.go, line 301 at r2 (raw file):
input memo.RelExpr, isOptional func(id opt.ColumnID) bool, ) (props.OrderingChoice, bool, bool) {
nit: you could give these return params names to help clarify
pkg/sql/opt/xform/limit_funcs.go, line 351 at r2 (raw file):
return false }) if !found || fullPrefix {
Maybe add a comment to explain why you also return if fullPrefix is true
pkg/sql/opt/xform/testdata/rules/limit, line 1811 at r1 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
Updated the comment
Looks like the plan is different now, though
c61fbe6
to
834de68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @rytaft)
pkg/sql/opt/ordering/topk.go, line 46 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Is there a reason you're not using
OrderingChoice.CommonPrefix
? Maybe I don't fully understand what this function is doing...
Didn't end up using this function, so I've now removed it.
pkg/sql/opt/xform/coster.go, line 1264 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
This function could use a comment
Done.
pkg/sql/opt/xform/coster.go, line 1604 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: this sentence is a bit hard to follow. Maybe add some more words or add an example?
I added a couple of examples and a more thorough explanation. PTAL and let me know if it needs more clarification.
pkg/sql/opt/xform/coster.go, line 1605 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
For clarity, maybe it would help to define another variable
rowsPerDistinctVal := inputRowCount/orderedStats.DistinctCount
?
Thank you for the suggestion!
pkg/sql/opt/xform/limit_funcs.go, line 263 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
It probably doesn't matter too much here, but just FYI, I'm guessing that the reason it was done this way in other places is to try to minimize the amount of unnecessary logic that needs to run if there are no secondary indexes. We've tried hard to keep the "match" logic for each rule as efficient as possible, so that more expensive logic that might perform allocations only runs if the rule is definitely going to apply.
Ack, thanks. I reverted for consistency.
pkg/sql/opt/xform/limit_funcs.go, line 306 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
The comment needs to be updated
Done.
pkg/sql/opt/xform/limit_funcs.go, line 301 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: you could give these return params names to help clarify
Done.
pkg/sql/opt/xform/limit_funcs.go, line 351 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Maybe add a comment to explain why you also return if fullPrefix is true
Done.
pkg/sql/opt/xform/testdata/rules/limit, line 1811 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Looks like the plan is different now, though
The plan cost changed enough when I fixed the cost model :-(. Changed this to a memo test instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 8 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach)
pkg/sql/opt/xform/coster.go, line 1604 at r2 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
I added a couple of examples and a more thorough explanation. PTAL and let me know if it needs more clarification.
Yes very helpful - thanks!
834de68
to
c614e21
Compare
Recent improvements to TopK in colexec allow TopK to stop execution early and emit its output if its sort columns were partially ordered in the input rows. This change modifies the optimizer so that it can find lower cost TopK plans. This change adds two new exploration rules: `GenerateLimitedTopKScans` and `GeneratePartialOrderTopK`. The first rule is similar to `GenerateLimitedGroupByScans` in that it looks for secondary indexes that could provide a partial ordering and adds the secondary index scan and an index join to get the rest of the columns to the memo. This allows us to explore cases of partially ordered inputs (via the index scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy` in that it uses interesting orderings to find partial orderings. The cost model is also updated to reflect the new estimated limit on the number of rows TopK needs to process to find the top K rows. The limit is propagated to TopK's child expressions as a limit hint. Fixes: cockroachdb#69724 Release note (sql change): Improves cost model for TopK expressions if the input to TopK can be partially ordered by its sort columns.
c614e21
to
48dd550
Compare
Canceled. |
Thanks for the careful reviews! Fixed some merge conflicts. bors r+ |
Build succeeded: |
Recent improvements to TopK in colexec allow TopK to stop execution
early and emit its output if its sort columns were partially ordered in
the input rows. This change modifies the optimizer so that it can
find lower cost TopK plans.
This change adds two new exploration rules:
GenerateLimitedTopKScans
and
GeneratePartialOrderTopK
. The first rule is similar toGenerateLimitedGroupByScans
in that it looks for secondary indexesthat could provide a partial ordering and adds the secondary index scan
and an index join to get the rest of the columns to the memo. This
allows us to explore cases of partially ordered inputs (via the index
scan) to TopK. The second rule is similar to
GenerateStreamingGroupBy
in that it uses interesting orderings to find partial orderings.
The cost model is also updated to reflect the new estimated limit on the
number of rows TopK needs to process to find the top K rows. The limit
is propagated to TopK's child expressions as a limit hint.
Fixes: #69724
Release note (sql change): Improves cost model for TopK expressions if the
input to TopK can be partially ordered by its sort columns.