Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k/p/gen: Use fragmented_vector for fetchable_partition_response #11181

Merged
merged 2 commits into from
Oct 11, 2023

Conversation

BenPope
Copy link
Member

@BenPope BenPope commented Jun 3, 2023

Avoid oversize allocations by converting fetchable_partition_response to small_fragment_vector

Fixes #11017

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

Improvements

  • Kafka: Avoid very large contiguous allocations during fetch.

Copy link
Member

@StephanDollberg StephanDollberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that large_fragment_vector is too far in the wrong direction

I think the general concern is that we shouldn't just blindly dump the fragmented_vector everywhere because it will create lots of memory wastage if we only put something like a couple 8 bytes elements into it (and this is even worse for the large_fragment_vector I guess).

fetch_response::partition_response isn't too large but I guess we are expecting many of those for topics with higher partition counts?

/**
* Assign from a std::vector.
*/
fragmented_vector& operator=(const std::vector<T>& rhs) noexcept {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this because it is not needed anymore or for any other reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added to satisfy a use case that is no longer needed. If we want something like this I'd rather be explicit, as it can hide unexpected copies.

make_fragmented_vector(std::initializer_list<T> in) {
fragmented_vector<T> ret;
for (auto& e : in) {
ret.push_back(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can one move out of a initializer_list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my hunch would be yes. but being a test helper, im not sure how much it matters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #13854

@BenPope
Copy link
Member Author

BenPope commented Jun 5, 2023

fetch_response::partition_response isn't too large but I guess we are expecting many of those for topics with higher partition counts?

Yeah, the oversize allocation was in: rptest.scale_tests.many_partitions_test.ManyPartitionsTest.test_many_partitions

StephanDollberg
StephanDollberg previously approved these changes Jun 5, 2023
Copy link
Member

@StephanDollberg StephanDollberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right makes sense, I guess the question is how much we are pessimizing responses with few partitions.

If there is no new data on a partition do we return both a metadata response and a partition response?

I guess it's fine either way as long as there is only a single one for each fetch request.

@travisdowns
Copy link
Member

The cover page mentions metadata response, but this change affects the produce path, not metadata path, right?

So it's also perf/memory sensitive.

@BenPope
Copy link
Member Author

BenPope commented Jun 5, 2023

The cover page mentions metadata response, but this change affects the produce path, not metadata path, right?

The fetch response is the primary fix, but I reworked some stuff in the generator, so the metadata response builds into a fragmented_vector, instead of a std::vector for metadata response, so a large alloc and copy is avoided there, too.

So it's also perf/memory sensitive.

Aye.

@BenPope
Copy link
Member Author

BenPope commented Jun 5, 2023

I guess the question is how much we are pessimizing responses with few partitions.

Honestly, I think large_fragment_vector might not be the right choice for either operation. We've discussed changing fragmented_vector to have more std::vector-like behaviour for the first fragment (it doesn't make much sense for further fragments as a std::vector would roughly double anyway).

@travisdowns as the author of #8469, do you have any thoughts about the pessimisation here?

Towards the end of this PR it should be possible to avoid enable_fragmentation_resistance by using the regular path override mechanism for types.

If there is no new data on a partition do we return both a metadata response and a partition response?

They're separate responses for separate requests. Both occur quite often.

@michael-redpanda
Copy link
Contributor

I noticed this has been sitting stale for a little bit. I think @BenPope has an outstanding question for @travisdowns irt if large_fragmented_vector is the right "tool for the job".

dotnwat
dotnwat previously approved these changes Sep 27, 2023
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines 1254 to 1256
{%- if field.nullable() %}
{%- if flex %}
{{ fname }} = reader.read_nullable_flex_array([version](protocol::decoder& reader) {
{%- else %}
{{ fname }} = reader.read_nullable_array([version](protocol::decoder& reader) {
{%- endif %}
{%- else %}
{%- if flex %}
{{ fname }} = reader.read_flex_array([version](protocol::decoder& reader) {
{%- else %}
{{ fname }} = reader.read_array([version](protocol::decoder& reader) {
{%- endif %}
{%- endif %}
{%- set nullable = "nullable_" if field.nullable() else "" %}
{%- set flex = "flex_" if flex else "" %}
{{ fname }} = reader.read_{{nullable}}{{flex}}array([version](protocol::decoder& reader) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

make_fragmented_vector(std::initializer_list<T> in) {
fragmented_vector<T> ret;
for (auto& e : in) {
ret.push_back(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my hunch would be yes. but being a test helper, im not sure how much it matters?

@dotnwat
Copy link
Member

dotnwat commented Sep 27, 2023

not sure if there are more outstanding questions. maybe ping travis?

@travisdowns
Copy link
Member

@BenPope - yeah, the part where fragmented vector creates a large first segment as soon as the first entry is added is a problem.

Arguably it's worse here for fetch because this is a much hotter path in general, and it's also involved in the "poll" loop where we wake up periodically to do the whole fetch execution again but then go to sleep if we don't get enough bytes.

That said, there are 8 commits here and I think the first 7 seems non-controversial. So we could always go with those now and continue the discussion on fetch problem? The fetch is a problem and I guess we have to solve it.

@dotnwat
Copy link
Member

dotnwat commented Oct 1, 2023

@BenPope - yeah, the part where fragmented vector creates a large first segment as soon as the first entry is added is a problem.
Arguably it's worse here for fetch because this is a much hotter path in general, and it's also involved in the "poll" loop where we wake up periodically to do the whole fetch execution again but then go to sleep if we don't get enough bytes.

Is this just a matter of using a more modest segment size in the fragmented vector? maybe not, as it sounds like, between the "poll" case and large fetches the variability is large enough that we need something adaptive? maybe we could ditch the vector entirely and use a boost intrusive list...

@BenPope
Copy link
Member Author

BenPope commented Oct 2, 2023

I've rebased everything but the last commit into #13854 with some minor changes.

@BenPope
Copy link
Member Author

BenPope commented Oct 2, 2023

I've rebased everything but the last commit into #13854 with some minor changes.

I also accidentally pushed a rebase of this branch, so I've rebased again on top of #13854

@mergify
Copy link

mergify bot commented Oct 2, 2023

⚠️ The sha of the head commit of this PR conflicts with #13854. Mergify cannot evaluate rules on this PR. ⚠️

StephanDollberg
StephanDollberg previously approved these changes Oct 3, 2023
@BenPope BenPope dismissed StephanDollberg’s stale review October 3, 2023 08:14

The merge-base changed after approval.

@travisdowns
Copy link
Member

Is this just a matter of using a more modest segment size in the fragmented vector? maybe not, as it sounds like, between the "poll" case and large fetches the variability is large enough that we need something adaptive? maybe we could ditch the vector entirely and use a boost intrusive list...

Right, I think it's exactly a case of using a small segment size, like say 1K. 1K isn't going to bother anyone and since these are small objects we can still fit plenty into a segment so I don't see much of a downside here. Yes there will be more segments, but none of the operations are really O(number_of_segments) anyway (arguably object destruction can be for primitives which avoid O(n) deletion cost otherwise, but oh well).

@BenPope BenPope self-assigned this Oct 5, 2023
* Introduce `small_fragment_vector`
* Switch `fetchable_partition_response` to it

Signed-off-by: Ben Pope <[email protected]>
@BenPope
Copy link
Member Author

BenPope commented Oct 10, 2023

Is this just a matter of using a more modest segment size in the fragmented vector? maybe not, as it sounds like, between the "poll" case and large fetches the variability is large enough that we need something adaptive? maybe we could ditch the vector entirely and use a boost intrusive list...

Right, I think it's exactly a case of using a small segment size, like say 1K. 1K isn't going to bother anyone and since these are small objects we can still fit plenty into a segment so I don't see much of a downside here. Yes there will be more segments, but none of the operations are really O(number_of_segments) anyway (arguably object destruction can be for primitives which avoid O(n) deletion cost otherwise, but oh well).

Done

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@piyushredpanda
Copy link
Contributor

Known failure: #14053

@piyushredpanda piyushredpanda merged commit ee0e298 into redpanda-data:dev Oct 11, 2023
25 of 28 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-11181-v23.1.x-362 remotes/upstream/v23.1.x
git cherry-pick -x 793d7af15c408ef2ee99cc0c48a95903c3fca19c f33973cd55f9ff1af86dd1e5f484ebfd958ff9b1

Workflow run logs.

@BenPope
Copy link
Member Author

BenPope commented Oct 13, 2023

/backport v23.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Oversized allocation: 393216 bytes in kafka::op_context::create_response_placeholders
7 participants