-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-36979][rpc] Reverting pekko version bump in Flink 1.20 #25866
base: release-1.20
Are you sure you want to change the base?
Conversation
This reverts commit 4776c96.
@ferenc-csaky @XComp Hi, I think I can confirm that there's not a leak on the Pekko side. |
I would like to suggest: rerun the tests with -Dio.netty.tryReflectionSetAccessible=true and -Dio.netty.leakDetection.level=PARANOID to see why and where it leaks or gets a heap dump. We have some applications, not a flink application still needs this or , you could set the Netty's bytebuf allocator with with |
@He-Pin are you sure? We see the following stacktrace in this e2e test failure:
Anyway, I might have another look into it on the Flink side as well. |
@XComp So it is better to turn on @normanmaurer is there any suggestion, thanks. |
another optimization in apache/pekko#1667 |
@XComp Is there any update can share, thanks. |
Currently, I have a lot of other stuff on my plate. I wouldn't mind if somebody else could help pick this up. |
It would be nice if anyone could add : to run tests. |
Hi! I'm back from my holiday, so I can take it and try some runs, will update the Jira ticket with any progress. |
@ferenc-csaky @He-Pin @XComp It looks like this PR is all about reverting the level of the pekko until we understand the cause of the OOM. All the comments seem to relate to resolving the OOM. Can I suggest we merge this revert, and investigate the OOM separately as suggested in the Jira. |
This is not a leak but a user-side error. This is how Netty works. Will you set it to 7MB in production, @davidradl?
If you think that's really needed, please send a PR to pekko, which can setup the allocator type of channels. |
@davidradl My understanding is that Netty4 does not leak memory, simply compared to Netty3 by default it does not work the same way and reserve a bit more memory. But with @He-Pin 7MB is not realistic in any kind of production use-case, for the failing test it is only set that way, because that test validates how much memory is used by Netty, that's why it sets My suggestion would be to fix this test instead of revert. Either by giving it more memory, or providing the necessary Netty configs to be able to function with that much memory. For the sake of completeness, on My original idea was to set |
@ferenc-csaky Our Java applications (high throughput) run with Java 11 /21 are using I vote for adding this by default or adding the |
I implemented the Netty 4-based remoting transport once when I was working for a game company, but the Akka team did not accept that PR for some reason, so we can only use that internally, after years, Pekko fork happens and we have the control of code, So we can do the right thing now, I'm using Netty at $Work too. We should not simply blindly revert, let's do it in the right way, the CVEs are really annoying. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like the title of the PR is not in line with what we want to in the PR comments. It sounds like there is appetite to fix this properly at the pekko higher version. So I am removing my approve - as this refers to the reversion code which is currently in the PR.
@davidradl Is there any investigation result update from your side, thanks. |
Opened #25955 which I believe should supersede this current PR. |
@ferenc-csaky sounds good - can we close this PR? |
IMO yes, I will close both this one and the 1.19 equivalent on Monday if no objections until then. |
Sorry for not getting back earlier. Thanks for looking into the issue, @ferenc-csaky . But on a more general note and with the concerns @zentol shared in FLINK-36510:
|
@XComp I do not see any problem with your suggested approach for 1.19, it makes sense. I can also accept to release it with 1.20.2 to the 1.20 line, if everybody else agrees, but personally am a bit more reluctant about that. I guess that release will happen in a couple months after 1.20.1 best case scenario, and surely that will give us more time to see if the CI runs are more stable or not regarding this aspect. But even if not, based on my current investigation probably it will be some other necessary configuration to make, not some serious memory bug. And on the other hand, the pesky Netty3 CVEs won't go anywhere. @afedulov WDYT, any requirements from your side? |
I believe that since we are not dealing with a memory leak but rather with different memory allocation (thanks @ferenc-csaky for confirming this!), we should aim to include the upgrade in both the 1.19 and 1.20 releases. While there is some risk of exposing users to OOM kills by pushing some workloads over hard memory limits, we have to weigh them against the risks posed by leaving numerous critical CVEs exposed on the network stack. Netty 3.10.6 is the last 3.x release and therefore officially reached EOL more than 8 years ago. I also read reports of it suffering from multiple GC and memory management issues, so it is not like we are transitioning away from something that is rock solid and works perfectly to an experimental release. The approach proposed by @He-Pin in the above comment sounds reasonable to me. If we adopt this approach, I think we should actually add As for fixing it in 1.20.2 or 1.20.1 - I am not convinced that having CI running more times will provide us required confidence. The primary concern lies in breaching memory limits in existing deployments rather than stability. If we agree this change is necessary for a patch release eventually, it would be logical to apply it now for both 1.19.2 and 1.20.1. I will make sure this potential concern is explicitly mentioned in the release notes. @ferenc-csaky Do we have a rough understanding of how much more memory consumption does this new version induce? Does it look like some fixed amount or something that scales with the number of connections? |
@afedulov Thanks for the ping.
So I think keeping the Netty 3 version seems a little smoother brain, especially with @ferenc-csaky done detailed investigation, Keep it in Netty 3 will expose all downstream the supply chain with CVES, that's not actually right. But I do suggest we do some long time stress testing about this( eg 1 or 2 days). I looked at some issues inside the Flink, eg parallelism serialization, I think which can be done in the current classical transport too. In short: Stress testing to make sure it works smoothly and ships the Netty 4 version is my +1. |
BTW, if Flink is supporting JDK 17+ too, please add below when running on JDK 17 or higher
too. |
@davidradl I think making the old behavior(unpooled) configurable can be done, but that will need @pjfanning to confirm backporting. And the change is how the PooledByteBufAllocator works, which will cache some arenas, but TBH, 7M is very small for anywork load. |
@XComp @ferenc-csaky @davidradl @afedulov I just prepared a PR apache/pekko#1707 for this, not sure if @pjfanning agree with backporting this to 1.1.4 release. |
+1 for keeping the newer Pekko and Netty 4 in 1.20.1 (and not merging this PR). This is the LTS and my 2c is that having a more secure base and fixing any issues that arise is the better path. |
A backport is pending apache/pekko#1709 as @afedulov requested. |
This reverts commit 4776c96.
What is the purpose of the change
Reverts the pekko version bump that includes an upgrade to netty 4.x. Corresponding discussion happened in FLINK-36510.
Brief change log
Verifying this change
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation