Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.2: kvstreamer: account for the overhead of GetResponse and ScanResponse #97499

Merged
merged 1 commit into from
Feb 23, 2023

Conversation

yuzefovich
Copy link
Member

Backport 1/1 commits from #97425.

/cc @cockroachdb/release


The Streamer is careful to account for the requests (both the footprint
and the overhead) as well as to estimate the footprint of the responses.
However, it currently doesn't account for the overhead of the GetResponse
(currently 64 bytes) and ScanResponse (120 bytes) structs. We recently
saw a case where this overhead was the largest user of RAM which
contributed to the pod OOMing. This commit fixes this accounting oversight
in the following manner:

  • prior to issuing the BatchRequest, we estimate the overhead of
    a response to each request in the batch. Notably, the BatchResponse will
    contain a RequestUnion object as well as the GetResponse or ScanResponse
    object for each request
  • once the BatchResponse is received, we reconcile the budget to track
    the precise memory usage of the responses (ignoring the RequestUnion
    since we don't keep a reference to it). We already tracked the
    "footprint" and now we also include the "overhead" with both being
    released to the budget on Result.Release call.

We track this "responses overhead" usage separate from the target bytes
usage (the "footprint") since the KV server doesn't include the overhead
when determining how to handle TargetBytes limit, and we must behave
in the same manner.

It's worth noting that the overhead of the response structs is
proportional to the number of requests included in the BatchRequest
since every request will get a corresponding (possibly empty) response.

Fixes: #97279.

Release note: None

Release justification: stability improvement.

@yuzefovich yuzefovich requested a review from a team as a code owner February 22, 2023 18:39
@blathers-crl
Copy link

blathers-crl bot commented Feb 22, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member Author

I'll let this bake on master for a week or so in case any of the tests become flaky.

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

The Streamer is careful to account for the requests (both the footprint
and the overhead) as well as to estimate the footprint of the responses.
However, it currently doesn't account for the overhead of the GetResponse
(currently 64 bytes) and ScanResponse (120 bytes) structs. We recently
saw a case where this overhead was the largest user of RAM which
contributed to the pod OOMing. This commit fixes this accounting oversight
in the following manner:
- prior to issuing the BatchRequest, we estimate the overhead of
a response to each request in the batch. Notably, the BatchResponse will
contain a RequestUnion object as well as the GetResponse or ScanResponse
object for each request
- once the BatchResponse is received, we reconcile the budget to track
the precise memory usage of the responses (ignoring the RequestUnion
since we don't keep a reference to it). We already tracked the
"footprint" and now we also include the "overhead" with both being
released to the budget on `Result.Release` call.

We track this "responses overhead" usage separately from the target bytes
usage (the "footprint") since the KV server doesn't include the overhead
when determining how to handle `TargetBytes` limit, and we must behave
in the same manner.

It's worth noting that the overhead of the response structs is
proportional to the number of requests included in the BatchRequest
since every request will get a corresponding (possibly empty) response.

Release note: None
@yuzefovich
Copy link
Member Author

Looks like there might be an extraordinary 22.2.x release, and I want to get this fix in if possible. Nightlies on master didn't seem to reveal any flakes.

@yuzefovich yuzefovich merged commit 51bbca7 into cockroachdb:release-22.2 Feb 23, 2023
@yuzefovich yuzefovich deleted the backport22.2-97425 branch February 23, 2023 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants