colmem: improve the behavior of ResetMaybeReallocate #81535

yuzefovich · 2022-05-19T17:59:17Z

Previously, Allocator.ResetMaybeReallocate would always grow the
capacity of the batch until coldata.BatchSize() (unless the memory
limit has been exceeded). This behavior allows us to have batches with
dynamic size when we don't know how many rows we need to process (for
example, in the cFetcher we start out with 1 row, then grow it
exponentially - 2, 4, 8, etc). However, in some cases we know exactly
how many rows we want to include into the batch, so that behavior can
result in us re-allocating a batch needlessly when the old batch already
have enough capacity.

This commit improves the situation by adding a knob to indicate that if
the desired capacity is satisfied by the old batch, then it should not
be re-allocated. All callers have been audited accordingly.

Release note: None

Previously, `Allocator.ResetMaybeReallocate` would always grow the capacity of the batch until `coldata.BatchSize()` (unless the memory limit has been exceeded). This behavior allows us to have batches with dynamic size when we don't know how many rows we need to process (for example, in the `cFetcher` we start out with 1 row, then grow it exponentially - 2, 4, 8, etc). However, in some cases we know exactly how many rows we want to include into the batch, so that behavior can result in us re-allocating a batch needlessly when the old batch already have enough capacity. This commit improves the situation by adding a knob to indicate that if the desired capacity is satisfied by the old batch, then it should not be re-allocated. All callers have been audited accordingly. Release note: None

cockroach-teamcity · 2022-05-19T17:59:25Z

This change is

rharding6373

Nice improvement! I have some questions around surrounding code.

Reviewed 15 of 28 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)

pkg/sql/colexec/colexecjoin/crossjoiner.go line 112 at r1 (raw file):

		true, /* desiredCapacitySufficient */
	)
	if willEmit > c.output.Capacity() {

Will this condition ever trigger? It doesn't seem like it would now, but I'm not sure it would have before this change, either, since willEmit is the min desired capacity.

pkg/sql/colexec/colexecutils/deselector.go line 62 at r1 (raw file):

	// so we don't use a memory limit here. It is up to the wrapped operator to
	// limit the size of batches based on the memory footprint.
	const maxBatchMemSize = math.MaxInt64

Do you know why maxBatchMemSize is a MaxInt64? It seems high, and, looking at ResetMaybeReallocate, if this is the max batch size then we will never reuse a batch, since batches seem to be capped at coldata.BatchSize().

Also, when desiredCapacitySufficient is true, does it really matter what the max batch size is?

michae2

Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rharding6373, and @yuzefovich)

pkg/sql/colmem/allocator.go line 193 at r1 (raw file):

		if desiredCapacitySufficient && oldBatch.Capacity() >= minDesiredCapacity {
			useOldBatch = true
		}

Why does this logic need desiredCapacitySufficient? Why can't it simply compare against minDesiredCapacity?

Code quote:

		if desiredCapacitySufficient && oldBatch.Capacity() >= minDesiredCapacity {
			useOldBatch = true
		}

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rharding6373)

pkg/sql/colexec/colexecjoin/crossjoiner.go line 112 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Will this condition ever trigger? It doesn't seem like it would now, but I'm not sure it would have before this change, either, since willEmit is the min desired capacity.

willEmit here can be arbitrarily large (meaning it can exceed coldata.BatchSize()), and ResetMaybeReallocate will not "honor" the "min desired capacity" and will reduce it to coldata.BatchSize(). This happens at the very top of ResetMaybeReallocate function. In other words, although we ask for willEmit capacity, a smaller batch might be returned, so we have to reduce willEmit accordingly.

pkg/sql/colexec/colexecutils/deselector.go line 62 at r1 (raw file):
There is a comment right above - deselectorOp relies on its input to produce reasonably sized batches, so if the input batch is of a small enough size when the selection vector is applied to it, then the "deselected" batch will also be of a small enough size (since the "deselected" batch is not larger in its footprint than the original batch with a selection vector).

Also, when desiredCapacitySufficient is true, does it really matter what the max batch size is?

Yeah, it does matter since these two knobs operate in "different dimensions".

Imagine a scenario where we have a batch with 10 very wide rows, so its footprint exceeds maxBatchMemSize, then even if we ask for 20 rows capacity in ResetMaybeReallocate, we still should get the original batch with the capacity of 10 because rows are wide and we don't want to exceed the memory target even more, regardless of the value of desiredCapacitySufficient knob.

On the other hand, if our old batch has capacity of 10, then asking for a capacity of 5 if desiredCapacitySufficient is true should give us the same old batch with cap of 10, regardless of the max batch mem size.

pkg/sql/colmem/allocator.go line 193 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Why does this logic need desiredCapacitySufficient? Why can't it simply compare against minDesiredCapacity?

ResetMaybeReallocate with desiredCapacitySufficient == false (the only previous behavior) is our "dynamic batch size" mechanism. The idea is that some callers don't know how many rows they will add into the batch (e.g. cFetcher in the absence of limits and limit hints), so they start out small (usually with a batch of 1 row), so ResetMaybeReallocate tries to find the best capacity growing it exponentially.

Going back to the cFetcher example: imagine we have 1m rows in the table and we're performing a full table scan, the cFetcher will return a batch with 1 row first, then with 2, with 4, etc, until it gets to coldata.BatchSize() on the eleventh batch, from which point it will only be reusing that maximum batch. This ensures that we get to the maximum batch quickly. At the same time, such a strategy works well when the table is small - if we're performing a full scan of a table with 1 row, it'd be inefficient to allocate a batch for coldata.BatchSize() rows right away.

The dynamic sizing behavior provided by Allocator.ResetMaybeReallocate is our heuristic-based answer to these concerns.

However, in some cases the callers do know how many rows they will add into the batch, and this PR makes the behavior in such a scenario more efficient.

rharding6373

Thanks!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rharding6373)

pkg/sql/colexec/colexecjoin/crossjoiner.go line 112 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

willEmit here can be arbitrarily large (meaning it can exceed coldata.BatchSize()), and ResetMaybeReallocate will not "honor" the "min desired capacity" and will reduce it to coldata.BatchSize(). This happens at the very top of ResetMaybeReallocate function. In other words, although we ask for willEmit capacity, a smaller batch might be returned, so we have to reduce willEmit accordingly.

👍

michae2

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2, @rharding6373, and @yuzefovich)

pkg/sql/colmem/allocator.go line 193 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

ResetMaybeReallocate with desiredCapacitySufficient == false (the only previous behavior) is our "dynamic batch size" mechanism. The idea is that some callers don't know how many rows they will add into the batch (e.g. cFetcher in the absence of limits and limit hints), so they start out small (usually with a batch of 1 row), so ResetMaybeReallocate tries to find the best capacity growing it exponentially.

Going back to the cFetcher example: imagine we have 1m rows in the table and we're performing a full table scan, the cFetcher will return a batch with 1 row first, then with 2, with 4, etc, until it gets to coldata.BatchSize() on the eleventh batch, from which point it will only be reusing that maximum batch. This ensures that we get to the maximum batch quickly. At the same time, such a strategy works well when the table is small - if we're performing a full scan of a table with 1 row, it'd be inefficient to allocate a batch for coldata.BatchSize() rows right away.

The dynamic sizing behavior provided by Allocator.ResetMaybeReallocate is our heuristic-based answer to these concerns.

However, in some cases the callers do know how many rows they will add into the batch, and this PR makes the behavior in such a scenario more efficient.

Making sure I understand: so there are two modes: (a) doubling if below coldata.BatchSize() or (b) ensuring at least minDesiredCapacity. In mode (a) minDesiredCapacity only matters for the first call with oldBatch == nil. Is that right?

Are we ever going to call this with a dynamic value for desiredCapacitySufficient? If we're always going to hardcode either true or false, I think it would be simpler for readers (and simpler for compiler inlining) to use two functions instead of sharing code. Then we can give them descriptive names like "doubleEachTime" and "ensureAtLeastThisLarge".

The cFetcher example is confusing me. All other callers with desiredCapacitySufficient == false use minDesiredCapacity == 1. cFetcher, on the other hand, seems to know what it's doing with minDesiredCapacity and yet it still uses the (a) mode. Why?

yuzefovich

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rharding6373)

pkg/sql/colmem/allocator.go line 193 at r1 (raw file):

so there are two modes: (a) doubling if below coldata.BatchSize() or (b) ensuring at least minDesiredCapacity. In mode (a) minDesiredCapacity only matters for the first call with oldBatch == nil. Is that right?

I think it's not quite right, but let me spell things out.

Goals for ResetMaybeReallocate in terms of capacity allocation, according to priority:

make batches not exceed memory limits (maxBatchMemSize parameter)
grow the capacity of the batches exponentially, until maximum
try to provide the capacity of at least minDesiredCapacity, subject to the memory limit though.

We achieve this with the following approach:
a) if there is no old batch, since we don't know the footprint of rows that will be added to the batch, we just allocate a new batch of minDesiredCapacity (which is made to be in [1, coldata.BatchSize()] range)
b) if there is an old batch, then:
b1) if the old batch already exceeded the memory limit, then we reuse it. This makes sure that we stay as close to goal 1. as possible.
b2) if the old batch is of the maximum capacity, then we also reuse it. It doesn't matter if we reached the maximum memory size - we have nowhere to grow anyway, so goal 2. has been reached.
b3) if the old batch has not exceeded the memory limit yet nor is of maximum capacity, then according to goal 2. we want to grow the capacity exponentially, so we make the new capacity double of the old one. Next, if minDesiredCapacity is still larger than the double of old capacity, we ratchet new capacity up to the desired capacity. The reasoning being that we're still within memory limits (goal 1. is good), we increased the capacity by a factor of more than 2 (goal 2. is good), and we got exactly the desired capacity (goal 3. is good).

(Side comment: note that since we learn the batch memory footprint only after adding rows into it, we actually are expected to exceed the memory limit and not stay under it, but once exceeded, the footprint will not grow further. We think it's acceptable given that we're still accounting for the footprint with the memory accounting system precisely.)

Now, if we examine the behavior in b3) then we see that we always double the capacity for the new batch (subject to not exceeding coldata.BatchSize()) even if minDesiredCapacity was already satisfied by the old batch. This is the case this PR attempts to address. When desiredCapacitySufficient is true, then goal 2. is effectively disabled and goal 3. rises in priority.

Looking over this description, I think the behavior is a bit more nuanced than two modes you described.

Are we ever going to call this with a dynamic value for desiredCapacitySufficient? If we're always going to hardcode either true or false, I think it would be simpler for readers (and simpler for compiler inlining) to use two functions instead of sharing code. Then we can give them descriptive names like "doubleEachTime" and "ensureAtLeastThisLarge".

I think we will always hardcode either true or false there, but to me it seems two methods would be almost the same. For example, I think it's still beneficial to have the doubling behavior when desiredCapacitySufficient is true for the case when the old batch does not have enough capacity. For example, the deselector operator knows have many rows it will copy into the output batch from its current input batch, but the next input batch can be much larger - so if the old output batch doesn't enough capacity for the current input batch, it is probably a good idea to allocate the new output batch with some extra space. That said, most callers (like sorter or hash aggregator) know the total number of rows they will add into the output batch throughout its lifetime, so for those we could have a much simpler function. I guess to me having a single function with an additional parameter seems cleaner still.

The cFetcher example is confusing me. All other callers with desiredCapacitySufficient == false use minDesiredCapacity == 1. cFetcher, on the other hand, seems to know what it's doing with minDesiredCapacity and yet it still uses the (a) mode. Why?

Yeah, maybe my choice of cFetcher for the example was unfortunate given that it wants the most complicated behavior. The idea there if we have a limit hint, then the cFetcher will try to allocate the first batch according to that hint (so it specifies minDesiredCapacity argument equal to the hint); however, if it turns out that the limit hint number of rows was insufficient by the caller of the cFetcher, then the hint was wrong, and we fallback to the doubling behavior.

Similarly, when performing a full table scan, we use the estimated row count from the table stats to size the first batch for the cFetcher. Imagine we're performing a full table scan of the table with 10k rows but our stats are stale and say that the table only has 100 rows - in such a scenario we create the first batch with the capacity of 100. Then it turns out that there are still more rows to read, so for the second batch we use the same minDesiredCapacity = 100 but with desiredCapacitySufficient = false, so we'll get the double capacity of 200, then of 400, etc, eventually getting to coldata.BatchSize().

Wow, I wrote up a lot of words, and they makes sense. Curious to hear your opinion on this setup.

yuzefovich · 2022-05-20T01:08:40Z

and they makes sense

*and I hope the make sense

yuzefovich · 2022-05-20T01:09:25Z

😅

and they makes sense

*and I hope the make sense

**and I hope they make sense

michae2

I'm just a cranky old man. This is good.

Reviewed 16 of 28 files at r1.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @rharding6373)

pkg/sql/colmem/allocator.go line 193 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

so there are two modes: (a) doubling if below coldata.BatchSize() or (b) ensuring at least minDesiredCapacity. In mode (a) minDesiredCapacity only matters for the first call with oldBatch == nil. Is that right?

I think it's not quite right, but let me spell things out.

Goals for ResetMaybeReallocate in terms of capacity allocation, according to priority:

make batches not exceed memory limits (maxBatchMemSize parameter)

grow the capacity of the batches exponentially, until maximum

try to provide the capacity of at least minDesiredCapacity, subject to the memory limit though.

We achieve this with the following approach:
a) if there is no old batch, since we don't know the footprint of rows that will be added to the batch, we just allocate a new batch of minDesiredCapacity (which is made to be in [1, coldata.BatchSize()] range)
b) if there is an old batch, then:
b1) if the old batch already exceeded the memory limit, then we reuse it. This makes sure that we stay as close to goal 1. as possible.
b2) if the old batch is of the maximum capacity, then we also reuse it. It doesn't matter if we reached the maximum memory size - we have nowhere to grow anyway, so goal 2. has been reached.
b3) if the old batch has not exceeded the memory limit yet nor is of maximum capacity, then according to goal 2. we want to grow the capacity exponentially, so we make the new capacity double of the old one. Next, if minDesiredCapacity is still larger than the double of old capacity, we ratchet new capacity up to the desired capacity. The reasoning being that we're still within memory limits (goal 1. is good), we increased the capacity by a factor of more than 2 (goal 2. is good), and we got exactly the desired capacity (goal 3. is good).

(Side comment: note that since we learn the batch memory footprint only after adding rows into it, we actually are expected to exceed the memory limit and not stay under it, but once exceeded, the footprint will not grow further. We think it's acceptable given that we're still accounting for the footprint with the memory accounting system precisely.)

Now, if we examine the behavior in b3) then we see that we always double the capacity for the new batch (subject to not exceeding coldata.BatchSize()) even if minDesiredCapacity was already satisfied by the old batch. This is the case this PR attempts to address. When desiredCapacitySufficient is true, then goal 2. is effectively disabled and goal 3. rises in priority.

Looking over this description, I think the behavior is a bit more nuanced that two modes you described.

Are we ever going to call this with a dynamic value for desiredCapacitySufficient? If we're always going to hardcode either true or false, I think it would be simpler for readers (and simpler for compiler inlining) to use two functions instead of sharing code. Then we can give them descriptive names like "doubleEachTime" and "ensureAtLeastThisLarge".

I think we will always hardcode either true or false there, but to me it seems two methods would be almost the same. For example, I think it's still beneficial to have the doubling behavior when desiredCapacitySufficient is true for the case when the old batch does not have enough capacity. For example, the deselector operator knows have many rows it will copy into the output batch from its current input batch, but the next input batch can be much larger - so if the old output batch doesn't enough capacity for the current input batch, it is probably a good idea to allocate the new output batch with some extra space. That said, most callers (like sorter or hash aggregator) know the total number of rows they will add into the output batch throughout its lifetime, so for those we could have a much simpler function. I guess to me having a single function with an additional parameter seems cleaner still.

The cFetcher example is confusing me. All other callers with desiredCapacitySufficient == false use minDesiredCapacity == 1. cFetcher, on the other hand, seems to know what it's doing with minDesiredCapacity and yet it still uses the (a) mode. Why?

Yeah, maybe my choice of cFetcher for the example was unfortunate given that it wants the most complicated behavior. The idea there if we have a limit hint, then the cFetcher will try to allocate the first batch according to that hint (so it specifies minDesiredCapacity argument equal to the hint); however, if it turns out that the limit hint number of rows was insufficient by the caller of the cFetcher, then the hint was wrong, and we fallback to the doubling behavior.

Similarly, when performing a full table scan, we use the estimated row count from the table stats to size the first batch for the cFetcher. Imagine we're performing a full table scan of the table with 10k rows but our stats are stale and say that the table only has 100 rows - in such a scenario we create the first batch with the capacity of 100. Then it turns out that there are still more rows to read, so for the second batch we use the same minDesiredCapacity = 100 but with desiredCapacitySufficient = false, so we'll get the double capacity of 200, then of 400, etc, eventually getting to coldata.BatchSize().

Wow, I wrote up a lot of words, and they makes sense. Curious to hear your opinion on this setup.

Your words make sense. I still think two functions would be clearer, but that is only an opinion, and your point that they would have a lot in common is valid.

Thank you for pointing out that maxBatchMemSize is the highest-priority constraint. I was overlooking that.

And thank you for explaining the cFetcher motivation, that helps.

yuzefovich

lol

TFTRs!

bors r+

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @rharding6373)

craig · 2022-05-20T18:16:21Z

Build succeeded:

GitHub CI (Cockroach)

yuzefovich requested review from rharding6373, michae2 and a team May 19, 2022 17:59

rharding6373 requested changes May 19, 2022

View reviewed changes

michae2 requested changes May 19, 2022

View reviewed changes

yuzefovich commented May 19, 2022

View reviewed changes

rharding6373 approved these changes May 19, 2022

View reviewed changes

michae2 requested changes May 20, 2022

View reviewed changes

yuzefovich commented May 20, 2022

View reviewed changes

michae2 approved these changes May 20, 2022

View reviewed changes

yuzefovich commented May 20, 2022

View reviewed changes

craig bot merged commit 85d99af into cockroachdb:master May 20, 2022

yuzefovich deleted the reset-maybe-reallocate branch May 20, 2022 18:22

yuzefovich mentioned this pull request Aug 25, 2022

release-22.1: colmem: improve memory-limiting behavior of the accounting helpers #86916

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colmem: improve the behavior of ResetMaybeReallocate #81535

colmem: improve the behavior of ResetMaybeReallocate #81535

yuzefovich commented May 19, 2022

cockroach-teamcity commented May 19, 2022

rharding6373 left a comment

michae2 left a comment

yuzefovich left a comment

rharding6373 left a comment

michae2 left a comment

yuzefovich left a comment •

edited

Loading

yuzefovich commented May 20, 2022

yuzefovich commented May 20, 2022

michae2 left a comment

yuzefovich left a comment

craig bot commented May 20, 2022

colmem: improve the behavior of ResetMaybeReallocate #81535

colmem: improve the behavior of ResetMaybeReallocate #81535

Conversation

yuzefovich commented May 19, 2022

cockroach-teamcity commented May 19, 2022

rharding6373 left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

rharding6373 left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

yuzefovich left a comment • edited Loading

Choose a reason for hiding this comment

yuzefovich commented May 20, 2022

yuzefovich commented May 20, 2022

michae2 left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

craig bot commented May 20, 2022

yuzefovich left a comment •

edited

Loading