-
Notifications
You must be signed in to change notification settings - Fork 899
Best Practices Guide
Just like any non-trivial system, Aeron has a set of best current practices associated with using it. This guide aims to provide the best practices with using Aeron for your message based communications. It is hoped this document will be a living document.
Systems utilising Aeron as a transport should consider the number of Channels and Streams as well as the number of Publishers and Subscribers on each channel and stream. A stream, while only being a number, requires a fixed amount of resources. This includes buffering between the Media Driver and the client applications. Aeron has been designed so that the normal number of streams in use will be 10s or 100s. Maybe 1000s, but very very unlikely.
By the fact that streams are within a channel, the number of channels is assumed to be small in number as well.
Systems that need large numbers of streams for separation should consider a framing (and muxing) protocol on top of Aeron.
Aeron has many settings that can be used to tweak various aspects of operation. However, it should only be necessary to adjust these settings to get the most out of the system. In most cases, the defaults should operate just fine and provide, if not optimal, at least a decent starting point for further optimisations via resource trade-offs.
Scattered throughout the topics below, you will see some settings mentioned. This is not an exhaustive list of settings. For that, please see the source or feel free to ask questions.
Applications as well as the Media Driver may have a number of threads concerned with various aspects of Aeron operation.
A Media Driver, whether being run embedded or not, needs 1-3 threads to perform its operation. The system property aeron.threading.mode
controls how many threads a Media Driver instance needs to use for operation.
There are three main Agents in the driver:
- Conductor: Responsible for reacting to client requests and house keeping duties as well as detecting loss, sending NAKs, rotating buffers, etc.
- Sender: Responsible for shovelling messages from publishers to the network.
- Receiver: Responsible for shovelling messages from the network to subscribers.
The value of aeron.threading.mode
can be one of:
-
INVOKER
: No threads. The client is responsible for using theMediaDriver.Context.driverAgentInvoker()
to invoke the duty cycle directly. -
SHARED
: All Agents share a single thread. 1 thread in total. -
SHARED_NETWORK
: Sender and Receiver shares a thread, conductor has its own thread. 2 threads in total. -
DEDICATED
: The default and dedicates one thread per Agent. 3 threads in total.
For performance, it is recommended to use DEDICATED
as long as the number of busy threads is less than or equal to the number of spare cores on the machine. If there are not enough cores to dedicate, then it is recommended to consider sharing some with SHARED_NETWORK
or SHARED
. INVOKER
can be used for low resource environments while the application using Aeron can invoke the media driver to carry out its duty cycle on a regular interval.
Within the Media Driver and possibly within some applications, Idle Strategies might be used to aid in specifying what Agent duty cycles should do if/when no work is done. An Idle Strategy takes a param indicating how much work was done in the last duty cycle and handles idling in various ways. You can specify your own idle strategies also.
There are a couple strategies of importance to understand.
-
BusySpinIdleStrategy
uses a busy spin as an idle and will eat up CPU by default. -
BackOffIdleStrategy
uses a backoff strategy of spinning, yielding, and parking to be kinder to the CPU, but to be less responsive to activity when idle for a little while.
The main difference in strategies is how responsive to changes should be the idler be when idle for a little bit of time and how much CPU should be consumed when no work is being done. There is an inherent tradeoff to consider.
There are a couple default Media Driver main functions provided for operation. A Media Driver may use one of these when used as a stand-alone process.
-
MediaDriver
is the default main and, by default, uses theBackOffIdleStrategy
for idling. Theaeron.threading.mode
can be used to further refine the threading model. -
LowLatencyMediaDriver
is the primary main for performance and uses theBusySpinIdleStrategy
for Conductor andNoOpIdleStrategy
for Sender and Receiver Agents. This main function automatically usesDEDICATED
threading mode.
Aeron applications have most of the threading requirements controlled by the application. However, there is a per Aeron
instance background thread, called the ClientConductor, that handles housekeeping and interacting with the Media Driver commands. This thread may be controlled by the application via setting a Aeron.Context.threadFactory()
or letting Aeron
spin up its own Thread
.
In many cases, this thread has very simple requirements and can be run on a dirty CPU. i.e. it doesn't need to have a dedicated CPU to function well.
Subscriber applications have more requirements, however.
Subscribers must routinely call Subscription.poll
to check for and deliver messages to the application. For the lowest latency and highest throughput, it is recommended to use a high frame limit for this call as well as BusySpinIdleStrategy
or equivalent application control and dedicate a core to reception. The Agent
class could be used to encapsulate this behaviour easily.
The Aeron MTU value impacts a lot of things. The default MTU is set to a value that is a good trade-off. However, it is suboptimal for some use cases involving very large (> 4KB) messages and for maximizing throughput above everything else. Various checks during publication and subscription/connection setup are done to verify a decent relationship with MTU. However, it is good to understand these relationships.
aeron.mtu.length
on the Media Driver controls the length of the MTU of data frames. This value is communicated to the Aeron clients during registration. So, applications do not have to concern themselves with the MTU value used by the Media Driver and use the same value.
An MTU value over the interface MTU will cause IP to fragment the datagram. This may increase the likelihood of loss under several circumstances. If increasing the MTU over the interface MTU, consider various ways to increase the interface MTU first in preparation.
The MTU value indicates the largest message that Aeron will send as a single data frame.
MTU length also has implications for socket buffer sizing. Please see below.
Aeron
instances in application, commonly referred to as "clients", communicate with Media Drivers via a set of buffers. The location of these buffers is normally in the OS file system. By default, the java.io.tmpdir
or /dev/shm/
is used to hold these files. However, it can be advantageous to move them to other places. The following property controls the directory that Media Drivers and Aeron
instances use:
-
aeron.dir
is the location directory containing the Aeron files.
Bounds checks are done by the buffer primitives by default in Aeron. These do take up some CPU cycles, but normally are predicted out. However, they can be disabled by setting aeron.disable.bounds.checks
to false.
The length of term buffers is controlled by aeron.term.buffer.length
and aeron.ipc.term.buffer.length
and aeron.term.buffer.max.length
properties. The max defaults to 1GB is the max length that any senders term buffers may be. If larger than this, an exception will be generated and shown on the Media Driver console. Setting the term buffer length is mostly a concern for how far ahead a Publisher might be over Subscribers. As a quick and dirty measure, a single term buffer is the measure. For more details see Flow Control.
When running Aeron over a network that is possibly congested and thus could experience significant loss then consider running with Congestion Control enabled. Loss can be detected by NAK counters increasing which can be observed with the AeronStat and investigated in detail with the LossReport tool.
Monitoring of various aspects of operation can be done by using the AeronStat
utility to display the value of the various counters of the Media Driver and clients. In addition, reading these counters programmatically is relatively simple.
Flow control is discussed in terms of how it functions. However, the implications for usage may not be obvious.
The Receiver Window is how much data a Sender can send immediately to a Receiver. This window length has a lot to do with the maximum throughput of a stream. The larger the window, the more throughput. The default window length allows for decent rates while limiting the amount of outstanding data before a publisher is flow controlled. Increasing the length of the window to 2MB or more should be plenty in most situations to allow high throughput rates.
Operating system socket buffers have an impact on some of the settings within Aeron.
-
SO_RCVBUF
can impact loss rates when too small for the given processing. If too large, this buffer can increase latency. Values that tend to work well with Aeron are 2MB to 4MB. This setting must be large enough for the MTU of the sender. If not, persistent loss can result. In addition, the receiver window length should be less than or equal to this value to allow plenty of space for burst traffic from a sender. -
SO_SNDBUF
can impact loss rate. Loss can occur on the sender side due to this buffer being too small. This buffer must be large enough to accommodate the MTU as a minimum. In addition, some systems, most notably Windows, need plenty of buffering on the send side to reach adequate throughput rates. If too large, this buffer can increase latency or cause loss. This usually should be less than 2MB.
As was mentioned above, changing the location of the buffers for Aeron can be a good thing. For Linux, this means that /dev/shm
will be the location of the buffers if present.
Linux normally requires some settings of sysctl values. One is net.core.rmem_max
to allow larger SO_RCVBUF
and net.core.wmem_max
to allow larger SO_SNDBUF
values to be set.
Windows tends to use SO_SNDBUF
values that are too small. It is recommended to use values more like 1MB or so.
Mac tends to use SO_SNDBUF
values that are too small. It is recommended to use larger values, like 16KB.