[FLINK-28372][rpc] Migrate to Akka Artery #22271

ferenc-csaky · 2023-03-24T14:27:35Z

What is the purpose of the change

Changes Akka remoting mechanism from the classic Netty based one to Artery.

Brief change log

Akka RPC does not depend on Netty anymore.
Changes in Akka configurations, as artery has some different config options, but mostly configs that are not needed anymore.

Verifying this change

After deploying a job, check the job and task managers on the Flink dashboard.

This change is already covered by existing tests under the flink-rpc module.

There are some parts that may require some discussion. I disabled the RemoteAkkaRpcActorTest#failsRpcResultImmediatelyIfRemoteRpcServiceIsNotAvailable test case, because with Artery, lifecycle monitoring is only triggered if the 2 RPC service are on different nodes. Also, in the current iteration I did not exposed watch-failure-detector related fields in the AkkaOptions, which probably should be done, but first I just wanted to get some opinion about the way it is in general currently.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no (it removes Netty from flink-rpc)
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no

flinkbot · 2023-03-24T14:31:51Z

CI report:

e8d5d62 Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

zentol · 2023-03-27T08:48:43Z

because with Artery, lifecycle monitoring is only triggered if the 2 RPC service are on different nodes

And there is no way to disable this / make artery think they are on different nodes?

zentol

How much testing have to done with actual workloads in different environments?

That was the big question mark on this ticket; how to ensure things actually still work as expected.

Should we model this as an opt-in/out that we can try out in a release?

...flink-rpc-akka/src/test/java/org/apache/flink/runtime/rpc/akka/MessageSerializationTest.java

flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaUtils.java

zentol · 2023-03-27T08:53:37Z

flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaUtils.java

-                .add("        }")
-                .add("      }")
-                .add("    }")
-                .add("    log-remote-lifecycle-events = " + logLifecycleEvents)


what happened to this option?

Does not exist in Artery. What we can control in artery are these options:

# If this is "on", all inbound remote messages will be logged at DEBUG level, # if off then they are not logged log-received-messages = off # If this is "on", all outbound remote messages will be logged at DEBUG level, # if off then they are not logged log-sent-messages = off # Logging of message types with payload size in bytes larger than # this value. Maximum detected size per message type is logged once, # with an increase threshold of 10%. # By default this feature is turned off. Activate it by setting the property to # a value in bytes, such as 1000b. Note that for all messages larger than this # limit there will be extra performance and scalability cost. log-frame-size-exceeding = off

I did not added this yet, because there are multiple options here IMO:

Keep the old config option in Flink and apply that to both sent and received.

Deprecate the currently existing log option and add 2 for both received and sent.

We can also consider what to do with the frame size exceeding events.

These options seem to be doing something different than the lifecycle event one.

We may have to manually subscribe to and log these events as shown in akka/akka#28003 (comment) to replicate this option.

zentol · 2023-03-27T08:55:27Z

flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaUtils.java

-                .add("        connection-timeout = " + akkaTCPTimeout)
-                .add("        maximum-frame-size = " + akkaFramesize)
-                .add("        tcp-nodelay = on")
-                .add("        client-socket-worker-pool {")


With artery there's no additional thread pool? Do we potentially need to adjust the akka thread pools to accomodate?

The client and server socket worker pools were Netty specific options and there is no alternative in the Artery config.

That doesn't quite answer my question 😅

Are there no options for artery because it has no additional thread pool?

I do not have a definitive answer at this point, but I guess yeah. The docs are not too specific, so I quite quickly ended up checking all the configuation options, which do not provide additional thread pool config for Artery.

ok.

Then I assume they use one of Akkas thread pool; would be good to know which one that is in case we need to make it configurable.

...nk-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcServiceConfiguration.java

zentol · 2023-03-27T08:58:02Z

flink-rpc/flink-rpc-akka/pom.xml

-		<dependency>
-			<groupId>io.netty</groupId>
-			<artifactId>netty</artifactId>
-			<version>3.10.6.Final</version>


needs a NOTICE update

zentol · 2023-03-27T08:58:23Z

flink-core/src/main/java/org/apache/flink/configuration/AkkaOptions.java

@@ -235,6 +254,8 @@ public static boolean isForceRpcInvocationSerializationEnabled(Configuration con
                                    .text("Min number of threads to cap factor-based number to.")
                                    .build());

+    /** @deprecated Don't use this option anymore. It has no effect on Flink. */
+    @Deprecated


needs regeneration of the docs

zentol · 2023-03-27T08:59:40Z

flink-core/src/main/java/org/apache/flink/configuration/AkkaOptions.java

-                    .withDescription(
-                            "Milliseconds a gate should be closed for after a remote connection was disconnected.");
+    /** Retry outbound connection only after this backoff. */
+    public static final ConfigOption<String> OUTBOUND_RESTART_BACKOFF =


Is this a de-facto replacement RETRY_GATE_CLOSED_FOR?

I did this change based on the configuration comments.

classic doc:

# After failed to establish an outbound connection, the remoting will mark the # address as failed. This configuration option controls how much time should # be elapsed before reattempting a new connection. While the address is # gated, all messages sent to the address are delivered to dead-letters. # Since this setting limits the rate of reconnects setting it to a # very short interval (i.e. less than a second) may result in a storm of # reconnect attempts. retry-gate-closed-for = 5 s

artery doc:

# Retry outbound connection after this backoff. # Only used when transport is tcp or tls-tcp. outbound-restart-backoff = 1 second

The outbound-restart-backoff comments are a lot less specific, but they point to the same direction.

they point to the same direction

That was also my conclusion from the docs. Shall we add retry-gate-closed-for as a deprecated key to the new option?

Yeah, that makes sense. I also thought about using the value of retry-gate-closed-for if that is set explicitly and outbound-restart-backoff is missing.

ferenc-csaky · 2023-03-27T10:58:11Z

Thanks for the review @zentol! First things first, I moved the PR back to draft, I am sure there will be some tuning and probably some modifications as well as first I just wanted to show what is possible without changing too many things.

because with Artery, lifecycle monitoring is only triggered if the 2 RPC service are on different nodes

And there is no way to disable this / make artery think they are on different nodes?

At this point I did not went too far into that direction, but there is no obvious way to change that with config options (did not went through all parts thoroughly yet).

How much testing have to done with actual workloads in different environments?

For noew, I tested manually with a standalon and Yarn setup. On Yarn, tested internal security enabled as well to trigger the TLS parts. I plan to do the same on a K8s setup as well.

That was the big question mark on this ticket; how to ensure things actually still work as expected.

Should we model this as an opt-in/out that we can try out in a release?

After making these changes kinda leaning towards to a pluggable solution myself as well. So we make sure it does not break any existing functionality and be able to correct any remaining problem on the go.

Until now I did not touch or check the e2e-tests, so I expect some failures on that front, but I am working on that too.

zentol · 2023-03-27T11:17:21Z

Until now I did not touch or check the e2e-tests, so I expect some failures on that front, but I am working on that too.

hint: Plenty of these failures might be due to expected and perfectly fine error messages that are being picked up by the pesky "no exception in the log" rules.

zentol self-assigned this Mar 27, 2023

zentol reviewed Mar 27, 2023

View reviewed changes

ferenc-csaky marked this pull request as draft March 27, 2023 10:20

[FLINK-28372][rpc] Migrate to Akka Artery

204bd70

ferenc-csaky force-pushed the akka-artery-migration branch from d8170dd to 204bd70 Compare March 27, 2023 11:14

link "outbound-restart-backoff" config with "retry-gate-closed-for"

e8d5d62

He-Pin mentioned this pull request Sep 9, 2023

Migrate the classic transport to Netty 4 without CVEs apache/pekko#643

Merged

zentol removed their assignment Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-28372][rpc] Migrate to Akka Artery #22271

[FLINK-28372][rpc] Migrate to Akka Artery #22271

ferenc-csaky commented Mar 24, 2023

flinkbot commented Mar 24, 2023 •

edited

Loading

zentol commented Mar 27, 2023

zentol left a comment •

edited

Loading

zentol Mar 27, 2023

ferenc-csaky Mar 27, 2023

zentol Mar 28, 2023

zentol Mar 27, 2023

ferenc-csaky Mar 27, 2023

zentol Mar 27, 2023

ferenc-csaky Mar 27, 2023

zentol Mar 28, 2023

zentol Mar 27, 2023

zentol Mar 27, 2023

zentol Mar 27, 2023

ferenc-csaky Mar 27, 2023

zentol Mar 28, 2023

ferenc-csaky Mar 29, 2023

ferenc-csaky commented Mar 27, 2023

zentol commented Mar 27, 2023

[FLINK-28372][rpc] Migrate to Akka Artery #22271

Are you sure you want to change the base?

[FLINK-28372][rpc] Migrate to Akka Artery #22271

Conversation

ferenc-csaky commented Mar 24, 2023

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Mar 24, 2023 • edited Loading

CI report:

zentol commented Mar 27, 2023

zentol left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ferenc-csaky commented Mar 27, 2023

zentol commented Mar 27, 2023

flinkbot commented Mar 24, 2023 •

edited

Loading

zentol left a comment •

edited

Loading