Draft concept: Cluster: Message Routing, Performance, Connection Handling, Inventory, Discovery #7814
Labels
area/distributed
Distributed monitoring (master, satellites, clients)
core/evaluate
Analyse/Evaluate features and problems
enhancement
New feature or request
Cluster: Message Routing, Performance and Connection Handling
TOC
Introduction
Much of the cluster technical concepts are already documented:
Versions
The current cluster implementation is v3. This was released with Icinga 2 v2.0 in June 2014.
Before this version, two technology prototypes existed.
Version 3
#1167
#752
#1505
4c02219 - this commits add the load balancing of checks in Cluster v3
Version 2
Endpoints had the
config_files_recursive
attribute which specified which config files would be synced.The link metric existed to do weight calculations.
The ClusterListener object had a reference on the Endpoint object. There were plans to support multiple ClusterListener objects with different Endpoints, making the cluster "multi environment aware". This functionality was removed with the distributed tree in v3.
Host and service objects specified the
domains
andauthorities
where they would run. This allowed a fine granular user control access where the objects would be executed.The
Domain
object allowed to specify ACLs where specific Endpoint could have read or write permissions.This was replaces with the Zone tree and membership in v3.
The
metric
attribute for Endpoint objects defined the weight inside our own "Spanning tree protocol" implementation. 8105f51v2 was replaced roughly with this commit: 7e10a2b
Version 1
Existed as
ReplicationComponent
, acfa3e6 and had Multicast and Unicast for routing the messages.https://github.com/Icinga/icinga2/blob/acfa3e6475ba22c4bf60028978acdf3bed098ea4/components/replication/replicationcomponent.cpp
Routing
To recap, the
MessageOrigin
object is an integral part of the cluster communication. It holds the details aboutSecurity
FromZone
andFromClient
serve as security layers. The registered cluster message handlers check their values and can then determine whether the child zone endpoint is actually allowed to e.g. send an acknowledgement to the parent zone and its endpoint.Prevent Loops
Whenever the MessageOrigin is used, it means the message originated from another endpoint in the cluster. Additional event handlers are registered and as such, they may trigger another send to a different endpoint/zone. The origin must be passed for these events to avoid the "return to sender" problem.
This works with 2 endpoints in the same zone, where the MessageOrigin is extracted on each message receiving.
Central Algorithm
RelayMessageOne() takes care of the routing. This involves fetching the targetZone for this message and its endpoints.
The original routing considerations source from #1505 (May 2014).
#752
Performance
The move to Boost ASIO, Context, Coroutines and Beast improved the overall performance significantly. Still, the most overhead is done within the actual data processors and TLS operations.
Message Count
#7711 provides insights into many messages being exchanged and a certain overhead.
The ApiListener RelayQueue WorkQueue might grow over time.
We should also evaluate if we can reduce certain messages, or combine updates like "in bulk".
Message Size
There's the problem with compression vs. performance.
If we would go for zlib, this makes this cluster v4 and Icinga 3.
The same applies to using Google protocol buffers instead of JSON in the Json-RPC layers.
Connection Handling
Machine Learning for Always Failing Connections
Right now, our reconnect timer is lowered from 60s to 10s, ensuring that checks are executed, synced, etc. after a deployment happened. The price we pay here is performance, especially when the master needs to connect to many agents (one workaround is to move a satellite zone in the middle).
Modern environments deploy and install agents, even if they are shutdown or the service is not yet ready. Since these connection timeouts are handled by the Linux kernel, we cannot control the TCP timeout. Each long running connection attempt blocks a coroutine/thread and may slow down the monitoring server's performance.
The idea is to implement a "marker" which calculates the health of the endpoint being connected to. Or to calculate the next reconnect time or window.
#6234
Multiple Addresses for an Endpoint
Generally a valid request, from the technical background it is hard to keep up with timeouts and retries. This follows along with the missing ML for reconnecting to endpoints, this is doing the same all over.
#5976
Requested Features
More than 2+ Endpoints in a Zone
In 2015, a problem was reported that checks started lagging when a zone had 3+ endpoints. From our analysis it seemed to happen that a check result was sent in a loop, consuming CPU resources and draining the overall performance.
This is tracked here: #3533 and linked in the docs to add more visibility. The config compiler detects if there are more than 2 endpoints in a zone, and logs a warning.
Solving the problem won't necessarily solve another request:
A worker pool of satellite checkers, as seen with Gearman job servers.
One of the latest reports tell it may be working, but before enabling this for users again in an official supported way, we need to re-test this in a cloud environment, with stress tests and many agents as well.
#3533 (comment)
Command Endpoints
Blackout Period for Reloading Agents
The agent/satellite notifies the parent whenever it attempts to reload itself. For this period, all checkable objects related to this endpoint are put into a "blackout window" until the endpoint connects again.
This follows the problem that a reload disconnects the TCP session, and a reconnect happens. During the sync period, unwanted unknown results may occur.
https://github.com/Icinga/icinga2/blob/master/lib/icinga/checkable-check.cpp#L584
#7186
Indirect command endpoint checks
The master schedules the check, the satellite forwards it, and lately it is executed on the agent. This currently is not possible and also detected as error by the config compiler.
Enabling this functionality would need to break some boundaries and dependencies on the command endpoint logic.
Pinning and Replication
If you use command_endpoint to pin a check inside the the master zone, it may occur that
This should result in an UNKNOWN state. The issue is quite old with many fixes for the command endpoint logic since then.
#3739 needs to be tested and evaluated whether this now works.
It may also be a good idea to evaluate whether it makes sense to store command_endpoint execution attempts inside the replay log, or find a different method for the infamous UNKNOWN check result.
Check Balancing for Command Endpoints
#5406 suggests to allow the notation of
with Icinga implementing its own logic, e.g. when
agent1
is not available, executing the check on the next command endpoint.The question is, how load-balancing can be detected here. The Check Scheduler would need to take this into account, and shuffle this accordingly with an offset for all
command_endpoint
checks in the queue.This mimics our HA enabled check balancing, we should carefully investigate here and avoid duplicated features.
It may make sense to introduce
check_zone
orcommand_zone
with sending a specifically crafted execution message to this child zone, e.g. where two agents exist.There's an immediate problem:
If there are connection problems, we need to keep the replay log as feature again. This is known to have problems requiring a re-design in #7752
A similar request exists with automated failover for the object authority in HA enabled zones in #6986.
Combine Remote Command Endpoint Cluster Messages
Instead of firing command endpoint checks for a single service, all services from a host should be executed in bulk.
The agent then waits up until all these "grouped" checks are done, and returns the check results in bulk.
This is dependent on the check_interval though.
#7021
Command Endpoint vs Event Handlers
If a
command_endpoint
is specified, both check and event commands are executed on the remote endpoint.There are occasions where the event handler should be fired on the scheduling endpoint, and not the execution endpoint. This is requested in this issue: #5658
The culprit is the naming of the config attribute,
event_command_endpoint
could work, but does not cleanly align withcommand_endpoint
itself. We could also introduce 2 new attributes, which overrule thecommand_endpoint
:Considering this clutter in the configuration objects, it is worthwhile to think about a general overhaul of the agent and commend_endpoint components and provide a better solution for this, and the other feature requests.
Inventory of the Zone/Endpoint Tree
Currently the master instance does not necessarily know about the zone tree, especially with agents configured on the satellite only (e.g. if done via Puppet). The host/service check results are processed on the master itself. Given that we may or may not have the zone/endpoint object tree on the master, the idea was to:
A draft PoC already exists: #6499
Additional metrics
Time differences
Within the
icinga::Hello
messages exchanged between endpoints, thets
value could be sent. With a to-be-defined delta for taking the network roundtrips, this could indicateLikewise,
event::Heartbeart
event messages can be extended with more metadata.Multiple Parent Environments
This is partially related to the cluster message routing, and the zone tree configuration.
The idea from #4626 is simple:
The routing needs to take care to send the result back to the requestor.
Clustered Checks with DNS Round Robin
A service might not necessarily always run on the same endpoint, e.g. with PostgreSQL cluster checks. Also, IP addresses may change for a service using DNS round robin.
There are situations where the agent presented certificate differs on the same IP address where the connection is made onto. Then the TLS CN checks fail, and the connection is dropped.
#5108 describes this request.
A specially crafted agent could allow multiple parent environments, as well as a SAN (Subject Alt Name) from the certificates. Though, re-generating certificates is not an option, so it might be overruled with an extra config setting inside the
ApiListener
, porting this from Cluster v2:nodes
which are allowed to connect, or similar. This needs to be taken into account on the parent side as well.Config Sync
In addition to that, it gets more complicated with accepting other cluster messages, like synced configuration - check commands, etc.
This would Icinga require to run with multiple environments. A previous attempt with icinga-envs installed a TLS routing proxy up front, with different Systemd services being spawned for the chroots. There is a certain chance that this could be revamped into a more simple and sophisticated design.
If one specifies that the parent config sync is forbidden for an agent, or there is only one true source, this might be a viable solution.
Another idea would be to extend the command_endpoint logic to just executing a command, and moving away from the local custom variable macro expansion on a virtual host object. There are some bugs, as in the
check_timeout
is not passed to the remote agent, thus not having an influence.#6992
Auto-Discovery and Inventory
This plays a role inside the cluster messages, and forwarding details from the lowest zone layer to the upper master zone.
There are elements in the cluster which are prepared for taking over.
Allowing to sync these messages also needs a storage pool on the parent node. This needs to be persisted over restarts.
Inventory Messages
Start simple with the things we already have. Use the metrics gathered with the Icinga check, /v1/stats and add some more OS local facts, e.g. Windows versions.
Make these inventory messages available via API event streams to allow subscribing to them.
Inventory Message Part 2
Add this data to the inventory messages and measure the performance impact/overhead.
Inventory Message Part 3
Discovery
Allow to trigger a discovery via top down command.
executecommand
, maybe with a way to specify a fact lists or filter laterFinal Format
Define a message format/structure which is backwards compatible.
Proposed Solutions
3 Endpoints in a Zone
Agent Role or Agent light
Bolt, Tasks, mcollective
Results - mark a CheckResult with its origin. Evaluate whether we may receive more than 1 CR - for multi checks and evaluation.
Agent Scrape Target
The text was updated successfully, but these errors were encountered: