Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: take advantage of partial ordering in topk sorter #73459

Merged
merged 1 commit into from
Dec 17, 2021

Conversation

rharding6373
Copy link
Collaborator

Recent improvements to TopK in colexec allow TopK to stop execution
early and emit its output if its sort columns were partially ordered in
the input rows. This change modifies the optimizer so that it can
find lower cost TopK plans.

This change adds two new exploration rules: GenerateLimitedTopKScans
and GeneratePartialOrderTopK. The first rule is similar to
GenerateLimitedGroupByScans in that it looks for secondary indexes
that could provide a partial ordering and adds the secondary index scan
and an index join to get the rest of the columns to the memo. This
allows us to explore cases of partially ordered inputs (via the index
scan) to TopK. The second rule is similar to GenerateStreamingGroupBy
in that it uses interesting orderings to find partial orderings.

The cost model is also updated to reflect the new estimated limit on the
number of rows TopK needs to process to find the top K rows. The limit
is propagated to TopK's child expressions as a limit hint.

Fixes: #69724

Release note (sql change): Improves cost model for TopK expressions if the
input to TopK can be partially ordered by its sort columns.

@rharding6373 rharding6373 requested a review from a team as a code owner December 3, 2021 22:55
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably grok'd about %50 of this but it looks good and seems thoroughly tested. Nice work!

Reviewed 8 of 19 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rharding6373 and @rytaft)


pkg/sql/opt/xform/index_scan_builder.go, line 181 at r1 (raw file):

// expressions that were specified by previous calls to various add methods.
// It is similar to Build, but does not add the expression to the memo group.
func (b *indexScanBuilder) BuildExpr() (output memo.RelExpr) {

Seems like a high amount of code duplication, there's got to be a way to refactor Build and BuildExpr to address this, I thought about it for a bit and came up empty though, so not a blocker just thought I'd mention it.


pkg/sql/opt/xform/limit_funcs.go, line 263 at r1 (raw file):

		if pkCols.Empty() {
			pkCols = c.PrimaryKeyCols(sp.Table)
		}

Why not just do this in the outer function?

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Sorry it took me SO long to review

Reviewed 19 of 19 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @rharding6373)


pkg/sql/opt/ops/relational.opt, line 1056 at r1 (raw file):

# required ordering, TopK can take advantage of optimizations. TopK can be used
# to substitute a Limit that requires its input to be ordered and performs best
# when the input is not already ordered. TopK scans the input, storing the K

nit: not already ordered -> not already fully ordered ?


pkg/sql/opt/ordering/topk.go, line 43 at r1 (raw file):

func TopKColOrdering(topk *memo.TopKExpr, required *props.OrderingChoice) opt.Ordering {
	print("required cols: ", required.ColSet().String(), "\n")
	print("ordered cols: ", topk.Ordering.ColSet().String(), "\n")

leftover debug stuff


pkg/sql/opt/ordering/topk.go, line 46 at r1 (raw file):

	ordering := make(opt.Ordering, len(required.Columns))
	for i, rc := range required.Columns {
		cols := rc.Group.Intersection(topk.Ordering.ColSet())

Is there a reason you're not using OrderingChoice.CommonPrefix? Maybe I don't fully understand what this function is doing...


pkg/sql/opt/ordering/topk.go, line 53 at r1 (raw file):

			return ordering[:i]
		}
		print("Col ", i, ": MakeOrderingCol ", rc.Group.String(), "\n")

more debug


pkg/sql/opt/xform/coster.go, line 1581 at r1 (raw file):

// topKInputLimitHint calculates an appropriate limit hint for the input
// to a Top K expression when there.

looks like this sentence got cut off


pkg/sql/opt/xform/coster.go, line 1589 at r1 (raw file):

	}
	orderedCols := topk.PartialOrdering.ColSet()
	if orderedCols.Len() == 0 {

better to just do if topk.PartialOrdering.Any()


pkg/sql/opt/xform/coster.go, line 1592 at r1 (raw file):

		return inputRowCount
	}
	orderedStats, ok := topk.Relational().Stats.ColStats.Lookup(orderedCols)

can you make a call to countSegments to avoid some of this duplication?


pkg/sql/opt/xform/coster.go, line 1600 at r1 (raw file):

	}

	expectedSegments := math.Ceil(K/orderedStats.DistinctCount) * orderedStats.DistinctCount

can you explain this formula? I'm a bit confused


pkg/sql/opt/xform/index_scan_builder.go, line 181 at r1 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

Seems like a high amount of code duplication, there's got to be a way to refactor Build and BuildExpr to address this, I thought about it for a bit and came up empty though, so not a blocker just thought I'd mention it.

+1 although I also don't have any specific ideas. The interface to construct a new memo group v. add to an existing one is definitely a bit clunky.

I do think this function could have a better name, though. Maybe BuildNew to indicate that it's building a new memo group? Not sure what would be best.


pkg/sql/opt/xform/limit_funcs.go, line 288 at r1 (raw file):

		// the index.
		sb.AddIndexJoin(sp.Cols)
		input := sb.BuildExpr()

Why is this using different logic from GenerateLimitedGroupByScans? Why do you need this new BuildExpr function?


pkg/sql/opt/xform/limit_funcs.go, line 291 at r1 (raw file):

		// Use the overlapping indexes and required ordering.
		newPrivate := *tp
		for i, cnt := 0, index.KeyColumnCount(); i < cnt; i++ {

You might want to use the function ScanPrivateCanProvide instead of duplicating the logic here. But also, maybe you don't even need this? GenerateLimitedGroupByScans doesn't update any ordering fields. And seems like GeneratePartialOrderTopK is already doing everything you need to set the partial ordering.


pkg/sql/opt/xform/limit_funcs.go, line 306 at r1 (raw file):

// f is a function that allows the caller to impose additional constraints on
// columns that are considered optional, and should return true if the column
// is optional.

Can you give the function a name that supports this? Maybe isOptional?


pkg/sql/opt/xform/limit_funcs.go, line 336 at r1 (raw file):

	newOrd.FromOrderingWithOptCols(o, opt.ColSet{})
	// Simplify the ordering according to the input's FDs. Note that this is not

nit: we usually add a blank line above big comments like this


pkg/sql/opt/xform/limit_funcs.go, line 347 at r1 (raw file):

// GeneratePartialOrderTopK generates TopK expressions with more specific orderings
// based on the interesting orderings property. This enables the optimizer to
// explore TopK with partially ordered input columns according.

sentence seems cut off


pkg/sql/opt/xform/limit_funcs.go, line 362 at r1 (raw file):

		newPrivate := *private
		newPrivate.PartialOrdering = newOrd

is it possible that PartialOrdering will be the same as the required ordering (i.e., private.Ordering)? That would indicate that it would be better to just use a Limit. But maybe we should just keep it simple and leave it to the coster to figure out.


pkg/sql/opt/xform/rules/limit.opt, line 132 at r1 (raw file):

# missing from the index. This differs from GenerateIndexScans, which does not
# generate index joins for non-covering indexes. It also regenerates the Top K
# expression with the partial ordering of index columns.

As I mentioned above, seems like you shouldn't need to update the topK partial ordering here


pkg/sql/opt/xform/rules/limit.opt, line 139 at r1 (raw file):

# being already partially ordered, and allows TopK to pass on a limit hint to
# its input expressions so that they may not have to process the entire input
# rows, either.

nit: "process the entire input rows" -> "process all input rows"


pkg/sql/opt/xform/testdata/coster/limit, line 65 at r1 (raw file):

 └── 1000

opt

nit: add a comment to explain what this is showing


pkg/sql/opt/xform/testdata/rules/limit, line 1811 at r1 (raw file):

----

# Columns are covered by GenerateIndexScans.

I don't understand this comment


pkg/sql/opt/xform/testdata/rules/limit, line 1930 at r1 (raw file):

 │    │    ├── best: (index-join G10="[ordering: +1,+1]" defg,cols=(1-4))
 │    │    └── cost: 7404.89
 │    ├── [ordering: +1,+1] [limit hint: 19.12]

This ordering looks wrong. We don't usually allow ordering by the same column twice

@rharding6373 rharding6373 force-pushed the topk_partial_order_69724 branch 2 times, most recently from 055d7b0 to c61fbe6 Compare December 16, 2021 01:59
Copy link
Collaborator Author

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs! Addressed comments and fixed some bugs. PTAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @rharding6373, and @rytaft)


pkg/sql/opt/ordering/topk.go, line 43 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

leftover debug stuff

Done.


pkg/sql/opt/ordering/topk.go, line 53 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

more debug

Done.


pkg/sql/opt/xform/coster.go, line 1581 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

looks like this sentence got cut off

Done.


pkg/sql/opt/xform/coster.go, line 1589 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

better to just do if topk.PartialOrdering.Any()

Done.


pkg/sql/opt/xform/coster.go, line 1592 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

can you make a call to countSegments to avoid some of this duplication?

I made a common helper function.


pkg/sql/opt/xform/coster.go, line 1600 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

can you explain this formula? I'm a bit confused

Added a comment, and also reexamining this made me realize there was a bug. This calculation should be using inputRowCount / distinctCount, not distinctCount.


pkg/sql/opt/xform/index_scan_builder.go, line 181 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

+1 although I also don't have any specific ideas. The interface to construct a new memo group v. add to an existing one is definitely a bit clunky.

I do think this function could have a better name, though. Maybe BuildNew to indicate that it's building a new memo group? Not sure what would be best.

I tried to refactor using the new function and switching based on op type to add the right kind of expression to the memo group, but it turns out the memo interface is more unwieldy than I thought and that's not possible. Since ConstructScan type functions intern the expression, you can't add the same expression to a group using the typical API. Added a TODO.

Changed the function name to BuildNewExpr, but I'm not sure that's any more satisfactory, either.


pkg/sql/opt/xform/limit_funcs.go, line 263 at r1 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

Why not just do this in the outer function?

Good observation. Done.


pkg/sql/opt/xform/limit_funcs.go, line 288 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Why is this using different logic from GenerateLimitedGroupByScans? Why do you need this new BuildExpr function?

I made the new function in part because there were now at least 2 Generate functions that could use it (and code duplication outside of index_scan_builder seems even worse than code duplication inside it), but didn't refactor GenerateLimitedGroupByScans in this PR initially. Refactored it now.


pkg/sql/opt/xform/limit_funcs.go, line 291 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

You might want to use the function ScanPrivateCanProvide instead of duplicating the logic here. But also, maybe you don't even need this? GenerateLimitedGroupByScans doesn't update any ordering fields. And seems like GeneratePartialOrderTopK is already doing everything you need to set the partial ordering.

Yes, I think you're right. I've removed this piece of logic.


pkg/sql/opt/xform/limit_funcs.go, line 306 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Can you give the function a name that supports this? Maybe isOptional?

Done.


pkg/sql/opt/xform/limit_funcs.go, line 347 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

sentence seems cut off

Done.


pkg/sql/opt/xform/limit_funcs.go, line 362 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

is it possible that PartialOrdering will be the same as the required ordering (i.e., private.Ordering)? That would indicate that it would be better to just use a Limit. But maybe we should just keep it simple and leave it to the coster to figure out.

Yes, I don't think I've seen it chosen, but it seems like we should keep to the partial order promise of this function and reduce the memo size.


pkg/sql/opt/xform/rules/limit.opt, line 132 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

As I mentioned above, seems like you shouldn't need to update the topK partial ordering here

Done.


pkg/sql/opt/xform/testdata/rules/limit, line 1811 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I don't understand this comment

Updated the comment


pkg/sql/opt/xform/testdata/rules/limit, line 1930 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

This ordering looks wrong. We don't usually allow ordering by the same column twice

There was a bug in the logic removed from GenerateLimitedTopKScans. Thanks for catching this!

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 13 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @rharding6373)


pkg/sql/opt/xform/coster.go, line 1264 at r2 (raw file):

}

func getOrderingColStats(

This function could use a comment


pkg/sql/opt/xform/coster.go, line 1604 at r2 (raw file):

	// In order to find the top K rows of a partially sorted input, we estimate
	// the number of rows we'll need to ingest by rounding up the nearest multiple
	// of the number of rows per distinct values to K.

nit: this sentence is a bit hard to follow. Maybe add some more words or add an example?


pkg/sql/opt/xform/coster.go, line 1605 at r2 (raw file):

	// the number of rows we'll need to ingest by rounding up the nearest multiple
	// of the number of rows per distinct values to K.
	expectedRows := math.Ceil(K/(inputRowCount/orderedStats.DistinctCount)) * (inputRowCount / orderedStats.DistinctCount)

For clarity, maybe it would help to define another variable rowsPerDistinctVal := inputRowCount/orderedStats.DistinctCount?


pkg/sql/opt/xform/limit_funcs.go, line 263 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Good observation. Done.

It probably doesn't matter too much here, but just FYI, I'm guessing that the reason it was done this way in other places is to try to minimize the amount of unnecessary logic that needs to run if there are no secondary indexes. We've tried hard to keep the "match" logic for each rule as efficient as possible, so that more expensive logic that might perform allocations only runs if the rule is definitely going to apply.


pkg/sql/opt/xform/limit_funcs.go, line 306 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Done.

The comment needs to be updated


pkg/sql/opt/xform/limit_funcs.go, line 301 at r2 (raw file):

	input memo.RelExpr,
	isOptional func(id opt.ColumnID) bool,
) (props.OrderingChoice, bool, bool) {

nit: you could give these return params names to help clarify


pkg/sql/opt/xform/limit_funcs.go, line 351 at r2 (raw file):

			return false
		})
		if !found || fullPrefix {

Maybe add a comment to explain why you also return if fullPrefix is true


pkg/sql/opt/xform/testdata/rules/limit, line 1811 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Updated the comment

Looks like the plan is different now, though

@rharding6373 rharding6373 force-pushed the topk_partial_order_69724 branch from c61fbe6 to 834de68 Compare December 16, 2021 21:21
Copy link
Collaborator Author

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @rytaft)


pkg/sql/opt/ordering/topk.go, line 46 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Is there a reason you're not using OrderingChoice.CommonPrefix? Maybe I don't fully understand what this function is doing...

Didn't end up using this function, so I've now removed it.


pkg/sql/opt/xform/coster.go, line 1264 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

This function could use a comment

Done.


pkg/sql/opt/xform/coster.go, line 1604 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: this sentence is a bit hard to follow. Maybe add some more words or add an example?

I added a couple of examples and a more thorough explanation. PTAL and let me know if it needs more clarification.


pkg/sql/opt/xform/coster.go, line 1605 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

For clarity, maybe it would help to define another variable rowsPerDistinctVal := inputRowCount/orderedStats.DistinctCount?

Thank you for the suggestion!


pkg/sql/opt/xform/limit_funcs.go, line 263 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

It probably doesn't matter too much here, but just FYI, I'm guessing that the reason it was done this way in other places is to try to minimize the amount of unnecessary logic that needs to run if there are no secondary indexes. We've tried hard to keep the "match" logic for each rule as efficient as possible, so that more expensive logic that might perform allocations only runs if the rule is definitely going to apply.

Ack, thanks. I reverted for consistency.


pkg/sql/opt/xform/limit_funcs.go, line 306 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

The comment needs to be updated

Done.


pkg/sql/opt/xform/limit_funcs.go, line 301 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: you could give these return params names to help clarify

Done.


pkg/sql/opt/xform/limit_funcs.go, line 351 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Maybe add a comment to explain why you also return if fullPrefix is true

Done.


pkg/sql/opt/xform/testdata/rules/limit, line 1811 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Looks like the plan is different now, though

The plan cost changed enough when I fixed the cost model :-(. Changed this to a memo test instead.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nice work!

Reviewed 8 of 8 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach)


pkg/sql/opt/xform/coster.go, line 1604 at r2 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

I added a couple of examples and a more thorough explanation. PTAL and let me know if it needs more clarification.

Yes very helpful - thanks!

@rharding6373 rharding6373 force-pushed the topk_partial_order_69724 branch from 834de68 to c614e21 Compare December 17, 2021 19:08
Recent improvements to TopK in colexec allow TopK to stop execution
early and emit its output if its sort columns were partially ordered in
the input rows. This change modifies the optimizer so that it can
find lower cost TopK plans.

This change adds two new exploration rules: `GenerateLimitedTopKScans`
and `GeneratePartialOrderTopK`. The first rule is similar to
`GenerateLimitedGroupByScans` in that it looks for secondary indexes
that could provide a partial ordering and adds the secondary index scan
and an index join to get the rest of the columns to the memo. This
allows us to explore cases of partially ordered inputs (via the index
scan) to TopK. The second rule is similar to `GenerateStreamingGroupBy`
in that it uses interesting orderings to find partial orderings.

The cost model is also updated to reflect the new estimated limit on the
number of rows TopK needs to process to find the top K rows. The limit
is propagated to TopK's child expressions as a limit hint.

Fixes: cockroachdb#69724

Release note (sql change): Improves cost model for TopK expressions if the
input to TopK can be partially ordered by its sort columns.
@rharding6373 rharding6373 force-pushed the topk_partial_order_69724 branch from c614e21 to 48dd550 Compare December 17, 2021 20:21
@craig
Copy link
Contributor

craig bot commented Dec 17, 2021

Canceled.

@rharding6373
Copy link
Collaborator Author

Thanks for the careful reviews! Fixed some merge conflicts.

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 17, 2021

Build succeeded:

@craig craig bot merged commit 889498d into cockroachdb:master Dec 17, 2021
@rharding6373 rharding6373 deleted the topk_partial_order_69724 branch December 23, 2021 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

opt, colexec: take advantage of partial ordering in topk sorter
4 participants