Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colexec: fix incorrect accounting when resetting datum-backed vectors #97750

Merged
merged 1 commit into from
Feb 28, 2023

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Feb 27, 2023

This commit reverts a couple of other commits:

  • "colexec: fix a "fake" memory accounting leak for intra-query period"
    (72e83fe)
  • "colexec: deeply reset datum-backed vectors in ResetInternalBatch"
    (cb93c30)

since they introduced incorrect memory accounting for the datum-backed
vectors.

Those two commits together solved another issue where we would keep
no-longer-needed datums live for longer than necessary (until they are
overwritten in the datum-backed vector) by eagerly nil-ing them out when
resetting the whole batch. This required introducing some careful
adjustment to the memory accounting in order to keep the accounting up
to date. However, that logic turned out to be faulty; in particular, it
became possible to register the allocations of the datum-backed vectors
with one account but then attempt to release some of those allocations
from another. If those releases happen enough times, it'd put the
account in debt which would trigger an internal error (or a crash in
test builds).

Such a scenario can occur because we have a couple of utility operators
that append a vector to a batch owned by another operator. When that
other operator resets its batch, the appended-by-utility-operator
vector is also reset, and the memory usage of the freed datum would be
deregistered from the wrong account. Tracking precisely which vector is
owned by the owner of the batch vs appended by another operator can be
cumbersome and error-prone, so this commit instead of introducing this
tracking removes the resetting behavior of the datum-backed vectors.
This should be bullet-proof while only increasing slightly the amount of
time references to datums are kept live.

Fixes: #97603.

Release note (bug fix): CockroachDB could previously encounter an
internal error "no bytes in account to release ..." in rare cases and
this is now fixed. The bug was introduced in 22.1.

@blathers-crl
Copy link

blathers-crl bot commented Feb 27, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

This commit reverts a couple of other commits:
- "colexec: fix a "fake" memory accounting leak for intra-query period"
(72e83fe)
- "colexec: deeply reset datum-backed vectors in ResetInternalBatch"
(cb93c30)

since they introduced incorrect memory accounting for the datum-backed
vectors.

Those two commits together solved another issue where we would keep
no-longer-needed datums live for longer than necessary (until they are
overwritten in the datum-backed vector) by eagerly nil-ing them out when
resetting the whole batch. This required introducing some careful
adjustment to the memory accounting in order to keep the accounting up
to date. However, that logic turned out to be faulty; in particular, it
became possible to register the allocations of the datum-backed vectors
with one account but then attempt to release some of those allocations
from another. If those releases happen enough times, it'd put the
account in debt which would trigger an internal error (or a crash in
test builds).

Such a scenario can occur because we have a couple of utility operators
that append a vector to a batch owned by another operator. When that
other operator resets its batch, the appended-by-utility-operator
vector is also reset, and the memory usage of the freed datum would be
deregistered from the wrong account. Tracking precisely which vector is
owned by the owner of the batch vs appended by another operator can be
cumbersome and error-prone, so this commit instead of introducing this
tracking removes the resetting behavior of the datum-backed vectors.
This should be bullet-proof while only increasing slightly the amount of
time references to datums are kept live.

Release note (bug fix): CockroachDB could previously encounter an
internal error "no bytes in account to release ..." in rare cases and
this is now fixed. The bug was introduced in 22.1.
@yuzefovich yuzefovich marked this pull request as ready for review February 28, 2023 00:12
@yuzefovich yuzefovich requested a review from a team as a code owner February 28, 2023 00:12
@yuzefovich yuzefovich requested review from cucaroach, rharding6373 and DrewKimball and removed request for cucaroach February 28, 2023 00:12
Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thanks for fixing this!

Reviewed 15 of 15 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rharding6373)

Copy link
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Have you opened an issue to address the original problem that the commits were trying to solve? :lgtm:

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @yuzefovich)

@yuzefovich
Copy link
Member Author

Have you opened an issue to address the original problem that the commits were trying to solve?

I think that original problem is not that big of a deal. Unfortunately, I didn't include any details into the commit message of #76463 to indicate why I decided to address that problem in the first place (my guess is that I was just modifying some related code and noticed an inefficiency).

Just to make sure we're on the same page, that problem is roughly as follows:

  • an operator on the first Next call allocates a new coldata.Batch with datumVecs in which every element is nil
  • then it populates all vectors, including the datum-backed vectors, with some data and returns the batch
  • on the second and all consecutive calls to Next, in the very beginning of Next the same batch is reset. This means that all previous data can be discarded. For most types we will reuse the same elements in each vectors, but datums are immutable, so we could safely discard them here.
  • instead, those datums are kept live until the operator processes the current call to Next and overwrites the datums in the datum-backed vectors.

Thus, we could have discarded old datums in the third point but now will do so only in the fourth point. It's hard to say how long between these two points take, but my feeling is that it shouldn't be that long. The impact of this is that our RSS is larger than necessary for that - hopefully brief - interval.

I spent some time today thinking how to avoid the revert of these two commits, but I couldn't see a clean and reliable way. I think that this issue is not worth spending more time on, so opening up a github issue to track doesn't seem worth it (it'd go straight into the cold storage backlog).

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 28, 2023

Build succeeded:

@craig craig bot merged commit 5f34f44 into cockroachdb:master Feb 28, 2023
@blathers-crl
Copy link

blathers-crl bot commented Feb 28, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from f2dd52c to blathers/backport-release-22.1-97750: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


error creating merge commit from f2dd52c to blathers/backport-release-22.2-97750: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: incorrectly resetting datum-backed vectors that are appended
4 participants