Reduce garbage for requests with unpooled buffers #32228

danielmitterdorfer · 2018-07-20T08:12:50Z

With this commit we avoiding copying request contents on the HTTP layer
(NIO and Netty 4) when the buffer that holds the request body contents
is unpooled. As the unpooled allocator is usually used on smaller heaps
this helps to reduce garbage further in those situations.

With this commit we avoiding copying request contents on the HTTP layer (NIO and Netty 4) when the buffer that holds the request body contents is unpooled. As the unpooled allocator is usually used on smaller heaps this helps to reduce garbage further in those situations.

elasticmachine · 2018-07-20T08:12:52Z

Pinging @elastic/es-core-infra

danielmitterdorfer · 2018-07-26T14:01:02Z

@jasontedor can you please review the Netty parts?
@tbrooks8 can you please review the NIO parts?

I plan to backport the Netty parts also to 6.x.

Tim-Brooks

Theoretically this LGTM. But I think that it will cause the build to fail once #32354 is merged.

I think we always want to release the request? Maybe for the Unpooled version we want request.content().duplicate()? And then release the request?

Tim-Brooks · 2018-07-26T18:56:56Z

I actually want to point out that this is also not strictly compliant with what is possible in netty. The following is valid:

ByteBuf buffer1 = Unpooled.buffer(10);
ByteBuf buffer = PooledByteBufAllocator.DEFAULT.buffer(10);
CompositeByteBuf byteBufs = new CompositeByteBuf(UnpooledByteBufAllocator.DEFAULT, false, 2, buffer1, buffer);

This creates a composite byte buffer that is backed by a pooled byte buffer. But uses the unpooled allocator. This would be a memory leak in this PR. I don't think this should happen since we tend to only use one allocator type at once. But it is possible. I think maybe we should address this PR by converting the bytes to not be a netty thing?

Tim-Brooks · 2018-07-26T19:13:10Z

I think maybe we should address this PR by converting the bytes to not be a netty thing?

I guess this is a little tricky right now. So probably not the best approach for this PR.

danielmitterdorfer · 2018-08-02T12:31:26Z

@tbrooks8 after our discussion yesterday I added an additional assert that checks that all associated buffers are unpooled so we get failures if the assertion is violated. Can you please have a look?

Tim-Brooks

LGTM -

Looking at netty - If you set -Dio.netty.allocator.numHeapArenas=0 it looks like the pool allocator delegates to Unpooled.

        if (heapArena != null) {
            buf = heapArena.allocate(cache, initialCapacity, maxCapacity);
        } else {
            buf = PlatformDependent.hasUnsafe() ?
                    new UnpooledUnsafeHeapByteBuf(this, initialCapacity, maxCapacity) :
                    new UnpooledHeapByteBuf(this, initialCapacity, maxCapacity);
        }

netty/netty#1315

We should maybe look to a follow up where we have that set to 0 in the unpooled case. Like we set both the unpooled system property (making it the default) and heap arenas = 0 (making a manual usage of pooled allocator be delegated to unpooled).

An then maybe the logic can be that those settings are enabled?

danielmitterdorfer · 2018-08-06T11:09:07Z

Thanks for review and good point with the follow-up. When we choose the unpooled allocator ergonomically, this is a pretty simple change (code-wise). When it is set by the user in jvm.options it might be a bit tricky.

danielmitterdorfer · 2018-08-10T11:49:08Z

I just want to provide an update why I held back merging this change. First of all, I managed to fix this issue for all allocator types (pooled and unpooled) by releasing the buffer at the appropriate place. Then I ran several benchmarks and it turns out that with this change indexing throughput actually decreases by roughly 10% (I tested several of our Rally tracks on 1GB and 4GB heaps).

Profiling reveals the reason: The bulk API uses the buffer's random access API (i.e. #get(index)) to find the next marker (i.e. the new line character for the JSON content type). If we copy the buffer, this is a cheap array access but if we don't, we access a CompositeByteBuffer instance which needs to check which of its components actually contains this index, calculate the offset and then access the buffer element at the right index. I implemented a proof of concept that is optimized for iteration but the results are still worse than with the buffer copy. So I think we should not merge the Netty parts of this PR.

As a next step I want to benchmark the performance impact on our NIO implementation but I'll wait until #32757 is merged.

danielmitterdorfer · 2018-08-24T06:58:09Z

After running several benchmarks also for the NIO implementation with different workloads and heap sizes it turns out that we do not get a practical benefit from avoiding this copy. Therefore, I'm closing the PR unmerged.

* Copying the request is not necessary here. We can simply release it once the response has been generated and a lot of `Unpooled` allocations that way * Relates elastic#32228 * I think the issue that preventet that PR that PR from being merged was solved by elastic#39634 that moved the bulk index marker search to ByteBuf bulk access so the composite buffer shouldn't require many additional bounds checks (I'd argue the bounds checks we add, we save when copying the composite buffer) * I couldn't neccessarily reproduce much of a speedup from this change, but I could reproduce a very measureable reduction in GC time with e.g. Rally's PMC (4g heap node and bulk requests of size 5k saw a reduction in young GC time by ~10% for me)

* Copying the request is not necessary here. We can simply release it once the response has been generated and a lot of `Unpooled` allocations that way * Relates #32228 * I think the issue that preventet that PR that PR from being merged was solved by #39634 that moved the bulk index marker search to ByteBuf bulk access so the composite buffer shouldn't require many additional bounds checks (I'd argue the bounds checks we add, we save when copying the composite buffer) * I couldn't neccessarily reproduce much of a speedup from this change, but I could reproduce a very measureable reduction in GC time with e.g. Rally's PMC (4g heap node and bulk requests of size 5k saw a reduction in young GC time by ~10% for me)

* Copying the request is not necessary here. We can simply release it once the response has been generated and a lot of `Unpooled` allocations that way * Relates elastic#32228 * I think the issue that preventet that PR that PR from being merged was solved by elastic#39634 that moved the bulk index marker search to ByteBuf bulk access so the composite buffer shouldn't require many additional bounds checks (I'd argue the bounds checks we add, we save when copying the composite buffer) * I couldn't neccessarily reproduce much of a speedup from this change, but I could reproduce a very measureable reduction in GC time with e.g. Rally's PMC (4g heap node and bulk requests of size 5k saw a reduction in young GC time by ~10% for me)

* Copying the request is not necessary here. We can simply release it once the response has been generated and a lot of `Unpooled` allocations that way * Relates #32228 * I think the issue that preventet that PR that PR from being merged was solved by #39634 that moved the bulk index marker search to ByteBuf bulk access so the composite buffer shouldn't require many additional bounds checks (I'd argue the bounds checks we add, we save when copying the composite buffer) * I couldn't neccessarily reproduce much of a speedup from this change, but I could reproduce a very measureable reduction in GC time with e.g. Rally's PMC (4g heap node and bulk requests of size 5k saw a reduction in young GC time by ~10% for me)

* Copying the request is not necessary here. We can simply release it once the response has been generated and a lot of `Unpooled` allocations that way * Relates elastic#32228 * I think the issue that preventet that PR that PR from being merged was solved by elastic#39634 that moved the bulk index marker search to ByteBuf bulk access so the composite buffer shouldn't require many additional bounds checks (I'd argue the bounds checks we add, we save when copying the composite buffer) * I couldn't neccessarily reproduce much of a speedup from this change, but I could reproduce a very measureable reduction in GC time with e.g. Rally's PMC (4g heap node and bulk requests of size 5k saw a reduction in young GC time by ~10% for me)

danielmitterdorfer added >enhancement review :Distributed Coordination/Network Http and internode communication implementations v7.0.0 labels Jul 20, 2018

danielmitterdorfer requested review from Tim-Brooks and jasontedor July 20, 2018 08:12

Merge remote-tracking branch 'origin/master' into http-no-copy

65e5b4f

Tim-Brooks reviewed Jul 26, 2018

View reviewed changes

danielmitterdorfer added 3 commits July 27, 2018 07:42

Merge remote-tracking branch 'origin/master' into http-no-copy

8800993

Merge remote-tracking branch 'origin/master' into http-no-copy

002d062

Assert that all involved buffers are unpooled

3aca7b5

Tim-Brooks approved these changes Aug 2, 2018

View reviewed changes

Merge remote-tracking branch 'origin/master' into http-no-copy

41c9bd2

danielmitterdorfer closed this Aug 24, 2018

danielmitterdorfer removed review v7.0.0 labels Aug 24, 2018

danielmitterdorfer removed the request for review from jasontedor August 24, 2018 10:28

original-brownbear mentioned this pull request Jul 18, 2019

Stop Copying Every Http Request in Message Handler #44564

Merged

original-brownbear mentioned this pull request Jul 25, 2019

Assert Not Escaping byte[] from Pooled ByteBuf #44881

Closed

original-brownbear mentioned this pull request Dec 3, 2019

Stop Copying Every Http Request in Message Handler (#44564) #49809

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce garbage for requests with unpooled buffers #32228

Reduce garbage for requests with unpooled buffers #32228

danielmitterdorfer commented Jul 20, 2018

elasticmachine commented Jul 20, 2018

danielmitterdorfer commented Jul 26, 2018

Tim-Brooks left a comment •

edited

Loading

Tim-Brooks commented Jul 26, 2018

Tim-Brooks commented Jul 26, 2018

danielmitterdorfer commented Aug 2, 2018 •

edited

Loading

Tim-Brooks left a comment •

edited

Loading

danielmitterdorfer commented Aug 6, 2018

danielmitterdorfer commented Aug 10, 2018

danielmitterdorfer commented Aug 24, 2018

Reduce garbage for requests with unpooled buffers #32228

Reduce garbage for requests with unpooled buffers #32228

Conversation

danielmitterdorfer commented Jul 20, 2018

elasticmachine commented Jul 20, 2018

danielmitterdorfer commented Jul 26, 2018

Tim-Brooks left a comment • edited Loading

Choose a reason for hiding this comment

Tim-Brooks commented Jul 26, 2018

Tim-Brooks commented Jul 26, 2018

danielmitterdorfer commented Aug 2, 2018 • edited Loading

Tim-Brooks left a comment • edited Loading

Choose a reason for hiding this comment

danielmitterdorfer commented Aug 6, 2018

danielmitterdorfer commented Aug 10, 2018

danielmitterdorfer commented Aug 24, 2018

Tim-Brooks left a comment •

edited

Loading

danielmitterdorfer commented Aug 2, 2018 •

edited

Loading

Tim-Brooks left a comment •

edited

Loading