Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-28372][rpc] Migrate to Akka Artery #22271

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ferenc-csaky
Copy link
Contributor

What is the purpose of the change

Changes Akka remoting mechanism from the classic Netty based one to Artery.

Brief change log

  • Akka RPC does not depend on Netty anymore.
  • Changes in Akka configurations, as artery has some different config options, but mostly configs that are not needed anymore.

Verifying this change

After deploying a job, check the job and task managers on the Flink dashboard.

This change is already covered by existing tests under the flink-rpc module.

There are some parts that may require some discussion. I disabled the RemoteAkkaRpcActorTest#failsRpcResultImmediatelyIfRemoteRpcServiceIsNotAvailable test case, because with Artery, lifecycle monitoring is only triggered if the 2 RPC service are on different nodes. Also, in the current iteration I did not exposed watch-failure-detector related fields in the AkkaOptions, which probably should be done, but first I just wanted to get some opinion about the way it is in general currently.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no (it removes Netty from flink-rpc)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 24, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@zentol zentol self-assigned this Mar 27, 2023
@zentol
Copy link
Contributor

zentol commented Mar 27, 2023

because with Artery, lifecycle monitoring is only triggered if the 2 RPC service are on different nodes

And there is no way to disable this / make artery think they are on different nodes?

Copy link
Contributor

@zentol zentol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much testing have to done with actual workloads in different environments?

That was the big question mark on this ticket; how to ensure things actually still work as expected.

Should we model this as an opt-in/out that we can try out in a release?

.add(" }")
.add(" }")
.add(" }")
.add(" log-remote-lifecycle-events = " + logLifecycleEvents)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happened to this option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not exist in Artery. What we can control in artery are these options:

      # If this is "on", all inbound remote messages will be logged at DEBUG level,
      # if off then they are not logged
      log-received-messages = off

      # If this is "on", all outbound remote messages will be logged at DEBUG level,
      # if off then they are not logged
      log-sent-messages = off

      # Logging of message types with payload size in bytes larger than
      # this value. Maximum detected size per message type is logged once,
      # with an increase threshold of 10%.
      # By default this feature is turned off. Activate it by setting the property to
      # a value in bytes, such as 1000b. Note that for all messages larger than this
      # limit there will be extra performance and scalability cost.
      log-frame-size-exceeding = off

I did not added this yet, because there are multiple options here IMO:

  1. Keep the old config option in Flink and apply that to both sent and received.
  2. Deprecate the currently existing log option and add 2 for both received and sent.

We can also consider what to do with the frame size exceeding events.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These options seem to be doing something different than the lifecycle event one.

We may have to manually subscribe to and log these events as shown in akka/akka#28003 (comment) to replicate this option.

.add(" connection-timeout = " + akkaTCPTimeout)
.add(" maximum-frame-size = " + akkaFramesize)
.add(" tcp-nodelay = on")
.add(" client-socket-worker-pool {")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With artery there's no additional thread pool? Do we potentially need to adjust the akka thread pools to accomodate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client and server socket worker pools were Netty specific options and there is no alternative in the Artery config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't quite answer my question 😅

Are there no options for artery because it has no additional thread pool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have a definitive answer at this point, but I guess yeah. The docs are not too specific, so I quite quickly ended up checking all the configuation options, which do not provide additional thread pool config for Artery.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

Then I assume they use one of Akkas thread pool; would be good to know which one that is in case we need to make it configurable.

<dependency>
<groupId>io.netty</groupId>
<artifactId>netty</artifactId>
<version>3.10.6.Final</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a NOTICE update

@@ -235,6 +254,8 @@ public static boolean isForceRpcInvocationSerializationEnabled(Configuration con
.text("Min number of threads to cap factor-based number to.")
.build());

/** @deprecated Don't use this option anymore. It has no effect on Flink. */
@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs regeneration of the docs

.withDescription(
"Milliseconds a gate should be closed for after a remote connection was disconnected.");
/** Retry outbound connection only after this backoff. */
public static final ConfigOption<String> OUTBOUND_RESTART_BACKOFF =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a de-facto replacement RETRY_GATE_CLOSED_FOR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this change based on the configuration comments.

classic doc:

      # After failed to establish an outbound connection, the remoting will mark the
      # address as failed. This configuration option controls how much time should
      # be elapsed before reattempting a new connection. While the address is
      # gated, all messages sent to the address are delivered to dead-letters.
      # Since this setting limits the rate of reconnects setting it to a
      # very short interval (i.e. less than a second) may result in a storm of
      # reconnect attempts.
      retry-gate-closed-for = 5 s

artery doc:

        # Retry outbound connection after this backoff.
        # Only used when transport is tcp or tls-tcp.
        outbound-restart-backoff = 1 second

The outbound-restart-backoff comments are a lot less specific, but they point to the same direction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they point to the same direction

That was also my conclusion from the docs. Shall we add retry-gate-closed-for as a deprecated key to the new option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. I also thought about using the value of retry-gate-closed-for if that is set explicitly and outbound-restart-backoff is missing.

@ferenc-csaky ferenc-csaky marked this pull request as draft March 27, 2023 10:20
@ferenc-csaky
Copy link
Contributor Author

Thanks for the review @zentol! First things first, I moved the PR back to draft, I am sure there will be some tuning and probably some modifications as well as first I just wanted to show what is possible without changing too many things.

because with Artery, lifecycle monitoring is only triggered if the 2 RPC service are on different nodes

And there is no way to disable this / make artery think they are on different nodes?

At this point I did not went too far into that direction, but there is no obvious way to change that with config options (did not went through all parts thoroughly yet).

How much testing have to done with actual workloads in different environments?

For noew, I tested manually with a standalon and Yarn setup. On Yarn, tested internal security enabled as well to trigger the TLS parts. I plan to do the same on a K8s setup as well.

That was the big question mark on this ticket; how to ensure things actually still work as expected.

Should we model this as an opt-in/out that we can try out in a release?

After making these changes kinda leaning towards to a pluggable solution myself as well. So we make sure it does not break any existing functionality and be able to correct any remaining problem on the go.

Until now I did not touch or check the e2e-tests, so I expect some failures on that front, but I am working on that too.

@ferenc-csaky ferenc-csaky force-pushed the akka-artery-migration branch from d8170dd to 204bd70 Compare March 27, 2023 11:14
@zentol
Copy link
Contributor

zentol commented Mar 27, 2023

Until now I did not touch or check the e2e-tests, so I expect some failures on that front, but I am working on that too.

hint: Plenty of these failures might be due to expected and perfectly fine error messages that are being picked up by the pesky "no exception in the log" rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants