Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] netty OOM with MULTITHREADED shuffle #9153

Closed
abellina opened this issue Aug 31, 2023 · 6 comments · Fixed by #9344
Closed

[BUG] netty OOM with MULTITHREADED shuffle #9153

abellina opened this issue Aug 31, 2023 · 6 comments · Fixed by #9344
Assignees
Labels
bug Something isn't working performance A performance related task/issue reliability Features to improve reliability or bugs that severly impact the reliability of the plugin shuffle things that impact the shuffle plugin

Comments

@abellina
Copy link
Collaborator

I ran NDS @100TB and ran into a netty OOM in query50. This needs further investigation. It is likely that either our limits are not working or there is a leak somewhere:

io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 20999453 byte(s) of direct memory (used: 15263094153, max: 15271460864)
  at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:802)
  at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:731)
  at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:632)
  at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:621)
  at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:213)
  at io.netty.buffer.PoolArena.allocate(PoolArena.java:141)
  at io.netty.buffer.PoolArena.allocate(PoolArena.java:126)
  at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:395)
  at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
  at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
  at io.netty.buffer.CompositeByteBuf.allocBuffer(CompositeByteBuf.java:1872)
  at io.netty.buffer.CompositeByteBuf.consolidate0(CompositeByteBuf.java:1751)
  at io.netty.buffer.CompositeByteBuf.consolidate(CompositeByteBuf.java:1739)
  at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:179)
  at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
  at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin shuffle things that impact the shuffle plugin labels Aug 31, 2023
@abellina abellina changed the title [BUG] [BUG] netty OOM with MULTITHREADED shuffle Aug 31, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Sep 6, 2023
@abellina
Copy link
Collaborator Author

abellina commented Sep 6, 2023

Note that I also see this with Parquet when the number of cores is high. As a test, I changed spark.executor.cores to 64 (increased from 16). I am not recommending this config yet, but it is definitely pushing things in terms of host memory that is available for shuffle it seems. With this many cores, the shuffle stage would have more tasks in flight, so something is definitely amiss with the limits we have in place.

@abellina
Copy link
Collaborator Author

abellina commented Sep 8, 2023

By setting spark.rapids.shuffle.multiThreaded.maxBytesInFlight I was able to fix this. I calculated a number to stay within what netty is computing as its max size (https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent.java#L1231C51-L1231C70). Netty allows passing a java system property to override its maximum (-Dio.netty.maxDirectMemory), but by default it is defaulting to the maximum that the JVM reports is available for off heap.

Normally this should default to Xmx according to what I read in a similar issue in Spark https://issues.apache.org/jira/browse/SPARK-27991. In this case, the value is lower than Xmx, as I started with a 16GB heap, so it's not clear why that would be the case. The Spark strategy to deal with this issue, which we also follow, is to perform retries when something like this happens.

We may need to decrease our default maxBytesInFlight or configure this dynamically to make it easier to use.

@revans2
Copy link
Collaborator

revans2 commented Sep 8, 2023

Great work @abellina we really need to figure out how to document and figure out how this fits in with our off heap memory limit story.

@abellina abellina added the performance A performance related task/issue label Sep 18, 2023
@abellina
Copy link
Collaborator Author

I added the reliability tag because we would like to document the behavior of the maxBytesInFlight setting relative to performance in addition to coming up with better defaults and also likely providing a way to tie this in with the host allocator interface.

@abellina
Copy link
Collaborator Author

Given the runs I have made I believe setting this config to somewhere between 64MB-128MB is going to be a good enough default. I am running more iterations, but that's the current status.

@abellina
Copy link
Collaborator Author

Here are findings at 3TB running in the cloud (dataproc) with 4 x n1-standard-16 nodes and using 16 threads for the MT shuffle. I see a lot of noise in here, but I do have another datapoint in our performance cluster for query67 specifically (this query is particularly sensitive to MT shuffle). Given the dataproc and perf cluster data, I am going to set the config to 128MB, and we'll need to continue this work as part of the host memory framework #8900

Screenshot from 2023-09-29 09-17-30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance A performance related task/issue reliability Features to improve reliability or bugs that severly impact the reliability of the plugin shuffle things that impact the shuffle plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants