Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
colexec: deeply reset datum-backed vectors in ResetInternalBatch
Previously, we would keep references to the old datums set in the
datum-backed vectors until the datums are overwritten, thus, extending
the time period when the no-longer-used datums are live unnecessarily.
We don't have to do that for any other types because for others we can
actually reuse the old space for new elements, so we want to keep them
around (e.g. decimals can reuse the non-inlined coefficient). This
commit makes it so that we deeply unset the datums in
ResetInternalBatch
.Care had to be taken to keep the memory accounting up-to-date. In
particular, after deeply resetting the datum-backed vector, we want to
release the memory allocation that was taken up by the actual datums
while keeping the overhead of
tree.Datum
interface in (since[]tree.Datum
slice is still fully live).Release note: None
colexec: fix limiting the output batch in size in several operators
In the cFetcher, the hash aggregator, and the ordered synchronizer we
are performing careful memory accounting in order to limit the size of
the output batches.
However, we had an incorrect assumption that could lead to batches
being larger than desired. Previously, the first time the batch exceeded
the target size, we would remember its "capacity" (i.e. the number of
rows that fit into that batch), and in the future we would always put up
to that number of rows in the batch. That works well when each row is of
roughly the same size throughout the lifetime of an operator; however,
that is not always the case. Imagine we have 1000 small rows followed by
1000 large rows - previously, all 1000 large rows would be put into
a single batch, significantly exceeding the target size.
This commit makes the limiting much more sane - it removes the notion of
"max capacity" and, instead, after each row is set in the batch we check
whether the batch has exceeded the target size.
Release note: None