-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colexec: fix incorrect accounting when resetting datum-backed vectors #97750
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
53b9211
to
3b414d9
Compare
This commit reverts a couple of other commits: - "colexec: fix a "fake" memory accounting leak for intra-query period" (72e83fe) - "colexec: deeply reset datum-backed vectors in ResetInternalBatch" (cb93c30) since they introduced incorrect memory accounting for the datum-backed vectors. Those two commits together solved another issue where we would keep no-longer-needed datums live for longer than necessary (until they are overwritten in the datum-backed vector) by eagerly nil-ing them out when resetting the whole batch. This required introducing some careful adjustment to the memory accounting in order to keep the accounting up to date. However, that logic turned out to be faulty; in particular, it became possible to register the allocations of the datum-backed vectors with one account but then attempt to release some of those allocations from another. If those releases happen enough times, it'd put the account in debt which would trigger an internal error (or a crash in test builds). Such a scenario can occur because we have a couple of utility operators that append a vector to a batch owned by another operator. When that other operator resets its batch, the appended-by-utility-operator vector is also reset, and the memory usage of the freed datum would be deregistered from the wrong account. Tracking precisely which vector is owned by the owner of the batch vs appended by another operator can be cumbersome and error-prone, so this commit instead of introducing this tracking removes the resetting behavior of the datum-backed vectors. This should be bullet-proof while only increasing slightly the amount of time references to datums are kept live. Release note (bug fix): CockroachDB could previously encounter an internal error "no bytes in account to release ..." in rare cases and this is now fixed. The bug was introduced in 22.1.
3b414d9
to
f2dd52c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 15 of 15 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rharding6373)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. Have you opened an issue to address the original problem that the commits were trying to solve?
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @yuzefovich)
I think that original problem is not that big of a deal. Unfortunately, I didn't include any details into the commit message of #76463 to indicate why I decided to address that problem in the first place (my guess is that I was just modifying some related code and noticed an inefficiency). Just to make sure we're on the same page, that problem is roughly as follows:
Thus, we could have discarded old datums in the third point but now will do so only in the fourth point. It's hard to say how long between these two points take, but my feeling is that it shouldn't be that long. The impact of this is that our RSS is larger than necessary for that - hopefully brief - interval. I spent some time today thinking how to avoid the revert of these two commits, but I couldn't see a clean and reliable way. I think that this issue is not worth spending more time on, so opening up a github issue to track doesn't seem worth it (it'd go straight into the cold storage backlog). TFTRs! bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from f2dd52c to blathers/backport-release-22.1-97750: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. error creating merge commit from f2dd52c to blathers/backport-release-22.2-97750: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This commit reverts a couple of other commits:
(72e83fe)
(cb93c30)
since they introduced incorrect memory accounting for the datum-backed
vectors.
Those two commits together solved another issue where we would keep
no-longer-needed datums live for longer than necessary (until they are
overwritten in the datum-backed vector) by eagerly nil-ing them out when
resetting the whole batch. This required introducing some careful
adjustment to the memory accounting in order to keep the accounting up
to date. However, that logic turned out to be faulty; in particular, it
became possible to register the allocations of the datum-backed vectors
with one account but then attempt to release some of those allocations
from another. If those releases happen enough times, it'd put the
account in debt which would trigger an internal error (or a crash in
test builds).
Such a scenario can occur because we have a couple of utility operators
that append a vector to a batch owned by another operator. When that
other operator resets its batch, the appended-by-utility-operator
vector is also reset, and the memory usage of the freed datum would be
deregistered from the wrong account. Tracking precisely which vector is
owned by the owner of the batch vs appended by another operator can be
cumbersome and error-prone, so this commit instead of introducing this
tracking removes the resetting behavior of the datum-backed vectors.
This should be bullet-proof while only increasing slightly the amount of
time references to datums are kept live.
Fixes: #97603.
Release note (bug fix): CockroachDB could previously encounter an
internal error "no bytes in account to release ..." in rare cases and
this is now fixed. The bug was introduced in 22.1.