Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-36979][rpc] Reverting pekko version bump in Flink 1.20 #25866

Open
wants to merge 1 commit into
base: release-1.20
Choose a base branch
from

Conversation

XComp
Copy link
Contributor

@XComp XComp commented Dec 29, 2024

This reverts commit 4776c96.

What is the purpose of the change

Reverts the pekko version bump that includes an upgrade to netty 4.x. Corresponding discussion happened in FLINK-36510.

Brief change log

  • Plain revert

Verifying this change

  • no additional verification done aside from CI

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): yes
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@XComp XComp changed the title [FLINK-36979][rpc] Reverting pekko version bump [FLINK-36979][rpc] Reverting pekko version bump in Flink 1.20 Dec 29, 2024
@flinkbot
Copy link
Collaborator

flinkbot commented Dec 29, 2024

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@He-Pin
Copy link
Member

He-Pin commented Dec 31, 2024

@ferenc-csaky @XComp Hi, I think I can confirm that there's not a leak on the Pekko side.

@He-Pin
Copy link
Member

He-Pin commented Dec 31, 2024

I would like to suggest:

rerun the tests with -Dio.netty.tryReflectionSetAccessible=true and -Dio.netty.leakDetection.level=PARANOID to see why and where it leaks or gets a heap dump.

We have some applications, not a flink application still needs this -Dio.netty.tryReflectionSetAccessible=true to avoid OOM in production.

or , you could set the Netty's bytebuf allocator with unpooled by default.

with -Dio.netty.allocator.type=unpooled

@XComp
Copy link
Contributor Author

XComp commented Jan 3, 2025

@He-Pin are you sure? We see the following stacktrace in this e2e test failure:

Jan 02 06:20:02 org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: Cannot reserve 4194304 bytes of direct buffer memory (allocated: 140396831, limit: 141557760) (connection to 'localhost/127.0.0.1:42031 [localhost:45071-b0167d]')
Jan 02 06:20:02 	at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:175) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Jan 02 06:20:02 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]

Anyway, I might have another look into it on the Flink side as well.

@He-Pin
Copy link
Member

He-Pin commented Jan 3, 2025

@XComp So it is better to turn on -Dio.netty.leakDetection.level=PARANOID to see who holds the buffers.
and turn on -Dio.netty.allocator.type=unpooled too, then all ByteBuf lives in memory.

@normanmaurer is there any suggestion, thanks.

@He-Pin
Copy link
Member

He-Pin commented Jan 3, 2025

another optimization in apache/pekko#1667

@He-Pin
Copy link
Member

He-Pin commented Jan 6, 2025

@XComp Is there any update can share, thanks.

@XComp
Copy link
Contributor Author

XComp commented Jan 6, 2025

@XComp Is there any update can share, thanks.

Currently, I have a lot of other stuff on my plate. I wouldn't mind if somebody else could help pick this up.

@He-Pin
Copy link
Member

He-Pin commented Jan 6, 2025

It would be nice if anyone could add :
-Dio.netty.tryReflectionSetAccessible=true
-Dio.netty.leakDetection.level=PARANOID
-Dio.netty.allocator.type=unpooled

to run tests.

@ferenc-csaky
Copy link
Contributor

Hi! I'm back from my holiday, so I can take it and try some runs, will update the Jira ticket with any progress.

@davidradl
Copy link
Contributor

@ferenc-csaky @He-Pin @XComp It looks like this PR is all about reverting the level of the pekko until we understand the cause of the OOM. All the comments seem to relate to resolving the OOM. Can I suggest we merge this revert, and investigate the OOM separately as suggested in the Jira.

@He-Pin
Copy link
Member

He-Pin commented Jan 9, 2025

This is not a leak but a user-side error. This is how Netty works. Will you set it to 7MB in production, @davidradl?
The best Pekko can do is :

  1. add an option to support using UnpooledAllocator to make the 7MB memory tests pass, is that the right way to go?

If you think that's really needed, please send a PR to pekko, which can setup the allocator type of channels.

@ferenc-csaky
Copy link
Contributor

@davidradl My understanding is that Netty4 does not leak memory, simply compared to Netty3 by default it does not work the same way and reserve a bit more memory.

But with -Dio.netty.tryReflectionSetAccessible=true the default memory footprint will be smaller, and -Dio.netty.allocator.type=unpooled can also help with that, but my understanding is that performance wise it is not really advised to use unpooled allocation.

@He-Pin 7MB is not realistic in any kind of production use-case, for the failing test it is only set that way, because that test validates how much memory is used by Netty, that's why it sets num-arenas to 1 and mem to 7MB.

My suggestion would be to fix this test instead of revert. Either by giving it more memory, or providing the necessary Netty configs to be able to function with that much memory.

For the sake of completeness, on master commit 338d024 already increased the memory to 90MB, so backporting that commit to 1.20 and 1.19 could fix that test case mentioned here earlier. Although I'm not sure with those changes that test will be meaningful, because num-arenas=1 is also removed in that commit.

My original idea was to set -Dio.netty.tryReflectionSetAccessible=true for this test case, because that way, it fits into 7MB and test execution succeeded multiple times on my local tests for both JDK11 and JDK17.

@He-Pin
Copy link
Member

He-Pin commented Jan 9, 2025

@ferenc-csaky Our Java applications (high throughput) run with Java 11 /21 are using -Dio.netty.tryReflectionSetAccessible=true when we upgrade From Java 8, this is really needed to avoid OOM and reduce GC pressure, the old one does not need it because it's Unpooled, which generates much more gc counts and hurt performance.

I vote for adding this by default or adding the -Dio.netty.tryReflectionSetAccessible=true to the release notes, people who want the old behavior can enable it with -Dio.netty.allocator.type=unpooled if they like more gc.

@He-Pin
Copy link
Member

He-Pin commented Jan 9, 2025

I implemented the Netty 4-based remoting transport once when I was working for a game company, but the Akka team did not accept that PR for some reason, so we can only use that internally, after years, Pekko fork happens and we have the control of code, So we can do the right thing now, I'm using Netty at $Work too.

We should not simply blindly revert, let's do it in the right way, the CVEs are really annoying.

Copy link
Contributor

@davidradl davidradl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like the title of the PR is not in line with what we want to in the PR comments. It sounds like there is appetite to fix this properly at the pekko higher version. So I am removing my approve - as this refers to the reversion code which is currently in the PR.

@He-Pin
Copy link
Member

He-Pin commented Jan 10, 2025

@davidradl Is there any investigation result update from your side, thanks.

@ferenc-csaky
Copy link
Contributor

Opened #25955 which I believe should supersede this current PR.

@davidradl
Copy link
Contributor

@ferenc-csaky sounds good - can we close this PR?

@ferenc-csaky
Copy link
Contributor

IMO yes, I will close both this one and the 1.19 equivalent on Monday if no objections until then.

@XComp
Copy link
Contributor Author

XComp commented Jan 13, 2025

Sorry for not getting back earlier. Thanks for looking into the issue, @ferenc-csaky . But on a more general note and with the concerns @zentol shared in FLINK-36510:

  • Shouldn't we merge the revert at least for 1.19 as the stable release (i.e. sticking to the older netty version)?
  • I see your point with upgrading to netty 4.x for 1.20 as this seems to be the LTS version for Flink 2.x. One other option we have here is doing the upgrade for 1.20.2 and not including the it in 1.20.1 (which @afedulov is preparing right now). That would give us a few more CI release cycles. WDYT?

@ferenc-csaky
Copy link
Contributor

@XComp I do not see any problem with your suggested approach for 1.19, it makes sense.

I can also accept to release it with 1.20.2 to the 1.20 line, if everybody else agrees, but personally am a bit more reluctant about that. I guess that release will happen in a couple months after 1.20.1 best case scenario, and surely that will give us more time to see if the CI runs are more stable or not regarding this aspect. But even if not, based on my current investigation probably it will be some other necessary configuration to make, not some serious memory bug. And on the other hand, the pesky Netty3 CVEs won't go anywhere. @afedulov WDYT, any requirements from your side?

@afedulov
Copy link
Contributor

afedulov commented Jan 13, 2025

I believe that since we are not dealing with a memory leak but rather with different memory allocation (thanks @ferenc-csaky for confirming this!), we should aim to include the upgrade in both the 1.19 and 1.20 releases. While there is some risk of exposing users to OOM kills by pushing some workloads over hard memory limits, we have to weigh them against the risks posed by leaving numerous critical CVEs exposed on the network stack.
image

Netty 3.10.6 is the last 3.x release and therefore officially reached EOL more than 8 years ago. I also read reports of it suffering from multiple GC and memory management issues, so it is not like we are transitioning away from something that is rock solid and works perfectly to an experimental release.

The approach proposed by @He-Pin in the above comment sounds reasonable to me. If we adopt this approach, I think we should actually add -Dio.netty.tryReflectionSetAccessible=true parameter as default to the startup scripts, not merely to the documentation. An open question remains: should we retain the 3.x-style memory management by using unpooled allocations? My current understanding is that using both parameters should approximate the existing behavior, but it does not seem like unpooled allocation is strictly required for our case. @He-Pin, what’s your perspective on this?

As for fixing it in 1.20.2 or 1.20.1 - I am not convinced that having CI running more times will provide us required confidence. The primary concern lies in breaching memory limits in existing deployments rather than stability. If we agree this change is necessary for a patch release eventually, it would be logical to apply it now for both 1.19.2 and 1.20.1. I will make sure this potential concern is explicitly mentioned in the release notes.

@ferenc-csaky Do we have a rough understanding of how much more memory consumption does this new version induce? Does it look like some fixed amount or something that scales with the number of connections?

@He-Pin
Copy link
Member

He-Pin commented Jan 13, 2025

@afedulov Thanks for the ping.

  1. For the Netty4 migration, we can't ship a library with many CVES, so it took me nearly 5 weekends to migrate it from Netty 3 to Netty 4, I have Flink in mind too. the multi-jvm-plugin is migrated from Netty 3 to Netty 4, and we upstream that to Akka, where it was Netty 3 and got merged. !test Migrate akka-multi-node-testkit to Netty4 akka/akka#32005
  2. We did encounter this in production (Spring Boot Application with Netty-based RPC) , after we migrated from Java 8 to Java 11, we encountered OOMs, all be solved with simple: -Dio.netty.tryReflectionSetAccessible=true .
  3. The Netty 4 works differently than it was in Netty 3, mainly for performance and less GC, and that's how it works. Yes, it may be my fault for using the PooledByteBufAllocator in the first place, I may need to keep the behavior with an UnpooledHeapAllocator, But I think that's not the best practice, so I'm not putting that behind a config option too.
  4. You can take a look at the current implementation, I think I'm doing the best practice, But as time flies, It did take me a day to figure out if it really leaks on Pekko side, but after that, I can confirm it's NOT.
  5. After Pekko 1.1.0 ships, we receive nothing about the leak report.

So I think keeping the Netty 3 version seems a little smoother brain, especially with @ferenc-csaky done detailed investigation, Keep it in Netty 3 will expose all downstream the supply chain with CVES, that's not actually right.

But I do suggest we do some long time stress testing about this( eg 1 or 2 days). I looked at some issues inside the Flink, eg parallelism serialization, I think which can be done in the current classical transport too.

In short: Stress testing to make sure it works smoothly and ships the Netty 4 version is my +1.

@He-Pin
Copy link
Member

He-Pin commented Jan 13, 2025

BTW, if Flink is supporting JDK 17+ too, please add below when running on JDK 17 or higher

--add-opens java.base/java.nio=ALL-UNNAMED
--add-opens java.base/jdk.internal.misc=ALL-UNNAMED

too.

@He-Pin
Copy link
Member

He-Pin commented Jan 13, 2025

@davidradl I think making the old behavior(unpooled) configurable can be done, but that will need @pjfanning to confirm backporting.

And the change is how the PooledByteBufAllocator works, which will cache some arenas, but TBH, 7M is very small for anywork load.

@He-Pin
Copy link
Member

He-Pin commented Jan 13, 2025

@XComp @ferenc-csaky @davidradl @afedulov I just prepared a PR apache/pekko#1707 for this, not sure if @pjfanning agree with backporting this to 1.1.4 release.

@tomncooper
Copy link
Contributor

+1 for keeping the newer Pekko and Netty 4 in 1.20.1 (and not merging this PR). This is the LTS and my 2c is that having a more secure base and fixing any issues that arise is the better path.

@He-Pin
Copy link
Member

He-Pin commented Jan 14, 2025

A backport is pending apache/pekko#1709 as @afedulov requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants