Draft concept: Cluster: Message Routing, Performance, Connection Handling, Inventory, Discovery #7814

dnsmichi · 2020-02-05T16:25:00Z

Cluster: Message Routing, Performance and Connection Handling

This is draft concept for analysing the current Icinga cluster design.
Its purpose is to learn about the state of the implementation, the
impacts, feature requests and possible solutions.

Requestor: @lippserd
Author: @dnsmichi

Last-Updated: 2020-02-17

Introduction

Much of the cluster technical concepts are already documented:

Versions

The current cluster implementation is v3. This was released with Icinga 2 v2.0 in June 2014.

Before this version, two technology prototypes existed.

Version 3

#1167
#752
#1505

4c02219 - this commits add the load balancing of checks in Cluster v3

Version 2

Endpoints had the config_files_recursive attribute which specified which config files would be synced.
The link metric existed to do weight calculations.

The ClusterListener object had a reference on the Endpoint object. There were plans to support multiple ClusterListener objects with different Endpoints, making the cluster "multi environment aware". This functionality was removed with the distributed tree in v3.

Host and service objects specified the domains and authorities where they would run. This allowed a fine granular user control access where the objects would be executed.

The Domain object allowed to specify ACLs where specific Endpoint could have read or write permissions.
This was replaces with the Zone tree and membership in v3.

The metric attribute for Endpoint objects defined the weight inside our own "Spanning tree protocol" implementation. 8105f51

v2 was replaced roughly with this commit: 7e10a2b

Version 1

Existed as ReplicationComponent, acfa3e6 and had Multicast and Unicast for routing the messages.

https://github.com/Icinga/icinga2/blob/acfa3e6475ba22c4bf60028978acdf3bed098ea4/components/replication/replicationcomponent.cpp

Routing

To recap, the MessageOrigin object is an integral part of the cluster communication. It holds the details about

FromZone where the Endpoint belongs to, which is connected and sent the message
FromClient being the JsonRpcConnection bound to the Endpoint sender

Security

FromZone and FromClient serve as security layers. The registered cluster message handlers check their values and can then determine whether the child zone endpoint is actually allowed to e.g. send an acknowledgement to the parent zone and its endpoint.

Prevent Loops

Whenever the MessageOrigin is used, it means the message originated from another endpoint in the cluster. Additional event handlers are registered and as such, they may trigger another send to a different endpoint/zone. The origin must be passed for these events to avoid the "return to sender" problem.

This works with 2 endpoints in the same zone, where the MessageOrigin is extracted on each message receiving.

Note

This might be an indicator for 3+ endpoints to figure out how the sender determines that it already knows about the message.

Central Algorithm

RelayMessageOne() takes care of the routing. This involves fetching the targetZone for this message and its endpoints.

Don’t relay messages to ourselves.
Don’t relay messages to disconnected endpoints.
Don’t relay the message to the zone through more than one endpoint unless this is our own zone.
Don’t relay messages back to the endpoint which we got the message from. THIS
Don’t relay messages back to the zone which we got the message from.
Only relay message to the zone master if we’re not currently the zone master.

The original routing considerations source from #1505 (May 2014).
#752

Performance

The move to Boost ASIO, Context, Coroutines and Beast improved the overall performance significantly. Still, the most overhead is done within the actual data processors and TLS operations.

Message Count

#7711 provides insights into many messages being exchanged and a certain overhead.

The ApiListener RelayQueue WorkQueue might grow over time.

Idea

Use a different implementation than the WQ in all places.

We should also evaluate if we can reduce certain messages, or combine updates like "in bulk".

Message Size

There's the problem with compression vs. performance.

If we would go for zlib, this makes this cluster v4 and Icinga 3.
The same applies to using Google protocol buffers instead of JSON in the Json-RPC layers.

Connection Handling

Machine Learning for Always Failing Connections

Right now, our reconnect timer is lowered from 60s to 10s, ensuring that checks are executed, synced, etc. after a deployment happened. The price we pay here is performance, especially when the master needs to connect to many agents (one workaround is to move a satellite zone in the middle).

Modern environments deploy and install agents, even if they are shutdown or the service is not yet ready. Since these connection timeouts are handled by the Linux kernel, we cannot control the TCP timeout. Each long running connection attempt blocks a coroutine/thread and may slow down the monitoring server's performance.

The idea is to implement a "marker" which calculates the health of the endpoint being connected to. Or to calculate the next reconnect time or window.

Connection attempt failed 10 times -> lower priority, skip the next 6 retries making this a minute reconnect interval.

#6234

Multiple Addresses for an Endpoint

Generally a valid request, from the technical background it is hard to keep up with timeouts and retries. This follows along with the missing ML for reconnecting to endpoints, this is doing the same all over.

#5976

Requested Features

More than 2+ Endpoints in a Zone

In 2015, a problem was reported that checks started lagging when a zone had 3+ endpoints. From our analysis it seemed to happen that a check result was sent in a loop, consuming CPU resources and draining the overall performance.

This is tracked here: #3533 and linked in the docs to add more visibility. The config compiler detects if there are more than 2 endpoints in a zone, and logs a warning.

Solving the problem won't necessarily solve another request:

A worker pool of satellite checkers, as seen with Gearman job servers.

Having this issue fixed won't enable you to spin up 10 endpoints in a zone. By design, all these endpoints need to communicate with each other, and they will balance the checks amongst them. While it should work, a general pool of "dumb workers" is not what's built into the cluster design with using one binary with different roles defined by configuration and zone trees.

One of the latest reports tell it may be working, but before enabling this for users again in an official supported way, we need to re-test this in a cloud environment, with stress tests and many agents as well.

#3533 (comment)

Command Endpoints

Blackout Period for Reloading Agents

The agent/satellite notifies the parent whenever it attempts to reload itself. For this period, all checkable objects related to this endpoint are put into a "blackout window" until the endpoint connects again.

This follows the problem that a reload disconnects the TCP session, and a reconnect happens. During the sync period, unwanted unknown results may occur.

https://github.com/Icinga/icinga2/blob/master/lib/icinga/checkable-check.cpp#L584

#7186

Indirect command endpoint checks

The master schedules the check, the satellite forwards it, and lately it is executed on the agent. This currently is not possible and also detected as error by the config compiler.

Enabling this functionality would need to break some boundaries and dependencies on the command endpoint logic.

Pinning and Replication

If you use command_endpoint to pin a check inside the the master zone, it may occur that

The check is pinned on master2
Object authority is bound to master1
Connection drops
master1 cannot execute the check since the agent is not connected

This should result in an UNKNOWN state. The issue is quite old with many fixes for the command endpoint logic since then.

#3739 needs to be tested and evaluated whether this now works.

It may also be a good idea to evaluate whether it makes sense to store command_endpoint execution attempts inside the replay log, or find a different method for the infamous UNKNOWN check result.

Check Balancing for Command Endpoints

#5406 suggests to allow the notation of

command_endpoint = [ "agent1", "agent2" ]

with Icinga implementing its own logic, e.g. when agent1 is not available, executing the check on the next command endpoint.

The question is, how load-balancing can be detected here. The Check Scheduler would need to take this into account, and shuffle this accordingly with an offset for all command_endpoint checks in the queue.

This mimics our HA enabled check balancing, we should carefully investigate here and avoid duplicated features.

It may make sense to introduce check_zone or command_zone with sending a specifically crafted execution message to this child zone, e.g. where two agents exist.

There's an immediate problem:

The check must not be executed by multiple endpoints in the zone
Without a config object on the agent hosts, the object authority needs to be calculated "on receival"

If there are connection problems, we need to keep the replay log as feature again. This is known to have problems requiring a re-design in #7752

Note

Evaluate with care. This complicates the command_endpoint design even more.

A similar request exists with automated failover for the object authority in HA enabled zones in #6986.

Combine Remote Command Endpoint Cluster Messages

Instead of firing command endpoint checks for a single service, all services from a host should be executed in bulk.
The agent then waits up until all these "grouped" checks are done, and returns the check results in bulk.

This is dependent on the check_interval though.

#7021

Command Endpoint vs Event Handlers

If a command_endpoint is specified, both check and event commands are executed on the remote endpoint.
There are occasions where the event handler should be fired on the scheduling endpoint, and not the execution endpoint. This is requested in this issue: #5658

The culprit is the naming of the config attribute, event_command_endpoint could work, but does not cleanly align with command_endpoint itself. We could also introduce 2 new attributes, which overrule the command_endpoint:

check_command_endpoint
event_command_endpoint

Considering this clutter in the configuration objects, it is worthwhile to think about a general overhaul of the agent and commend_endpoint components and provide a better solution for this, and the other feature requests.

Inventory of the Zone/Endpoint Tree

Currently the master instance does not necessarily know about the zone tree, especially with agents configured on the satellite only (e.g. if done via Puppet). The host/service check results are processed on the master itself. Given that we may or may not have the zone/endpoint object tree on the master, the idea was to:

Re-install the inventory timer from the bottom-up agent mode
Each endpoints collects its local metrics, and sends that to its parent zone
The parent endpoint updates the zone tree, and keeps a shard of the tree in memory
The master(s) receive everything and expose this stats tree. Queried via REST API, streamed to Icinga DB.

A draft PoC already exists: #6499

Additional metrics

Time differences

Within the icinga::Hello messages exchanged between endpoints, the ts value could be sent. With a to-be-defined delta for taking the network roundtrips, this could indicate

{
  "jsonrpc": "2.0",
  "method": "icinga::Hello"
  "params": {
    "ts": 1581933317
  }
}

Likewise, event::Heartbeart event messages can be extended with more metadata.

{
  "jsonrpc": "2.0",
  "method": "event::Heartbeat"
  "params": {
    "timeout": 120,
    "ts": 1581933317
  }
}

Multiple Parent Environments

This is partially related to the cluster message routing, and the zone tree configuration.

The idea from #4626 is simple:

One agent
2+ parent zones (prod, staging, dev) which can execute commands

The routing needs to take care to send the result back to the requestor.

Clustered Checks with DNS Round Robin

A service might not necessarily always run on the same endpoint, e.g. with PostgreSQL cluster checks. Also, IP addresses may change for a service using DNS round robin.

There are situations where the agent presented certificate differs on the same IP address where the connection is made onto. Then the TLS CN checks fail, and the connection is dropped.

#5108 describes this request.

A specially crafted agent could allow multiple parent environments, as well as a SAN (Subject Alt Name) from the certificates. Though, re-generating certificates is not an option, so it might be overruled with an extra config setting inside the ApiListener, porting this from Cluster v2: nodes which are allowed to connect, or similar. This needs to be taken into account on the parent side as well.

Config Sync

In addition to that, it gets more complicated with accepting other cluster messages, like synced configuration - check commands, etc.

This would Icinga require to run with multiple environments. A previous attempt with icinga-envs installed a TLS routing proxy up front, with different Systemd services being spawned for the chroots. There is a certain chance that this could be revamped into a more simple and sophisticated design.

If one specifies that the parent config sync is forbidden for an agent, or there is only one true source, this might be a viable solution.

Another idea would be to extend the command_endpoint logic to just executing a command, and moving away from the local custom variable macro expansion on a virtual host object. There are some bugs, as in the check_timeout is not passed to the remote agent, thus not having an influence.

#6992

Auto-Discovery and Inventory

This plays a role inside the cluster messages, and forwarding details from the lowest zone layer to the upper master zone.

There are elements in the cluster which are prepared for taking over.

Child zones always send their check results back to the parent zone. They will do so for Comment/Downtimes as well.
Parent zones have security measurements to skip the messages, they never reach any readable state (transport is TLS).

Allowing to sync these messages also needs a storage pool on the parent node. This needs to be persisted over restarts.

Inventory Messages

Start simple with the things we already have. Use the metrics gathered with the Icinga check, /v1/stats and add some more OS local facts, e.g. Windows versions.

{
  "jsonrpc": "2.0",
  "method": "inventory::facts"
  "params": {
    "type": "os",
    "host": "icinga2-agent1.localdomain",
    "facts": {
      "cpu_count": 4.0,
      "os_version": "10",
      "os_type": "windows"
    }
  }
}

Allow these messages to traverse up.
Do not persist these messages in the replay log, they are volatile data.
Run inventory calls in a slow interval. For testing purposes, use 60s, but then move rather to 1h or 30m similar to the Puppet agent.

Make these inventory messages available via API event streams to allow subscribing to them.

Inventory Message Part 2

Add some tcp checks for known services, use Boost.ASIO.
- Probe port 3306, 5432
- "Simulate nmap basically"

Add this data to the inventory messages and measure the performance impact/overhead.

Inventory Message Part 3

Probe a local file or executable to return inventory metadata as JSON structure.
Figure out which format would be applicable here.
Write a test plugin which does a) checks b) inventory (--icinga-inventory parameter).
Evaluate whether we can update the plugin API
Evaluate whether specific Icinga components and modules can attach to this method.

Discovery

Allow to trigger a discovery via top down command.

Similar to executecommand, maybe with a way to specify a fact lists or filter later
REST API action (easier for integrating with the Director/Web later)

Final Format

Define a message format/structure which is backwards compatible.

Defines a message version
Allows to add custom fields
Only solves this purpose.

Proposed Solutions

Ensure that more than 2 endpoints in a zone reliably work.
Consider evaluating a PoC for "checker pools" in HA enabled zones.
Avoid complicating the setup and configuration
Evaluate assigning the "agent" role explicitly
- This could lead into replacing the command_endpoint facility with something new.
Create a PoC for inventory/discovery messages

3 Endpoints in a Zone

Agent Role or Agent light

Bolt, Tasks, mcollective

Create a new cluster message type for executing a command
Pre-calculate the command arguments/line on the host which initiates the check execution
Distribute the message

Results - mark a CheckResult with its origin. Evaluate whether we may receive more than 1 CR - for multi checks and evaluation.

Receive 1 or more check results on the caller
Process the check result - if it was request to be one.
- Evaluate a way to execute a "check simulation" where the check is executed, but the result is not processed, but visible on the host/service as "test". Or, returned to the caller who is listening on the API endpoint.

Agent Scrape Target

Let the agent run checks on its own, deploy the configuration from above.
Provide a scrape endpoint via /metrics
Provide a bulk fetch/push for these check results to the parent nodes
Automatically create pre-activated objects for agents? Or, provide an inventory pool for allowing external interfaces to interact with them, and import/activate the objects.

The text was updated successfully, but these errors were encountered:

dnsmichi · 2020-02-26T09:34:26Z

A possible message for fact inventory could look like this, as discussed with @LordHepipud.

{
  "jsonrpc": "2.0",
  "method": "inventory::UpdateFacts",
  "params": {
    "host": NodeName,
    "facts": {
      "icinga": {
        "michi.int.netways.de": {
          "features": [ "mainlog", "checker", "api" ]
        }
      },

      "custom": {
        "michi.int.netways.de": {
          "disks": {
            "C": {
              "Partition": "1", 
              "Disk": "0",
              "Size": 510779191296, 
              "Free Space": 63.0008049
            }, 
            "E": {
              "Partition": "1", 
              "Disk": "1",
              "Size": 965841252352, 
              "Free Space": 27.6378059
            }, 
            "D": {
              "Partition": "0", 
              "Disk": "1",
              "Size": 34347155456, 
              "Free Space": 98.56515
            }
          }
        },
        "christian.int.netways.de": {
          "disks": [ ]
        }
      }
    }
  }
}

dnsmichi added area/distributed Distributed monitoring (master, satellites, clients) core/evaluate Analyse/Evaluate features and problems labels Feb 5, 2020

dnsmichi self-assigned this Feb 5, 2020

This was referenced Feb 6, 2020

Make OS facts available in the icinga check #7751

Open

Recognize PENDING -> UNKNOWN as state change #7815

Open

dnsmichi mentioned this issue Feb 14, 2020

Set log_duration=0 for Host/Service Command Endpoints #7839

Closed

dnsmichi changed the title ~~Draft concept: Cluster: Message Routing, Performance and Connection Handling~~ Draft concept: Cluster: Message Routing, Performance, Connection Handling, Inventory, Discovery Feb 14, 2020

dnsmichi removed their assignment Mar 5, 2020

Al2Klimov added the enhancement New feature or request label Aug 13, 2020

julianbrost mentioned this issue May 23, 2022

Agent does not support subjectAltName when connecting to Satellite #9376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft concept: Cluster: Message Routing, Performance, Connection Handling, Inventory, Discovery #7814

Draft concept: Cluster: Message Routing, Performance, Connection Handling, Inventory, Discovery #7814

dnsmichi commented Feb 5, 2020 •

edited

Loading

dnsmichi commented Feb 26, 2020

Draft concept: Cluster: Message Routing, Performance, Connection Handling, Inventory, Discovery #7814

Draft concept: Cluster: Message Routing, Performance, Connection Handling, Inventory, Discovery #7814

Comments

dnsmichi commented Feb 5, 2020 • edited Loading

Cluster: Message Routing, Performance and Connection Handling

TOC

Introduction

Versions

Version 3

Version 2

Version 1

Routing

Security

Prevent Loops

Central Algorithm

Performance

Message Count

Message Size

Connection Handling

Machine Learning for Always Failing Connections

Multiple Addresses for an Endpoint

Requested Features

More than 2+ Endpoints in a Zone

Command Endpoints

Blackout Period for Reloading Agents

Indirect command endpoint checks

Pinning and Replication

Check Balancing for Command Endpoints

Combine Remote Command Endpoint Cluster Messages

Command Endpoint vs Event Handlers

Inventory of the Zone/Endpoint Tree

Additional metrics

Time differences

Multiple Parent Environments

Clustered Checks with DNS Round Robin

Config Sync

Auto-Discovery and Inventory

Inventory Messages

Inventory Message Part 2

Inventory Message Part 3

Discovery

Final Format

Proposed Solutions

3 Endpoints in a Zone

Agent Role or Agent light

Bolt, Tasks, mcollective

Agent Scrape Target

dnsmichi commented Feb 26, 2020

dnsmichi commented Feb 5, 2020 •

edited

Loading