Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Cluster Config Sync #6716

Closed
dnsmichi opened this issue Oct 23, 2018 · 10 comments · Fixed by #6727
Closed

Improve Cluster Config Sync #6716

dnsmichi opened this issue Oct 23, 2018 · 10 comments · Fixed by #6727
Assignees
Labels
area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention enhancement New feature or request queue/wishlist ref/NC
Milestone

Comments

@dnsmichi
Copy link
Contributor

dnsmichi commented Oct 23, 2018

Improve Cluster Config Sync

Purpose of this document is to provide solutions for existing sync problems and broken stages.

Scenarios

Client doesn't have (global) zone configured where objects reference to

The master configuration validates fine with a combination of a global zone for templates and the satellite host configuration.

vim zones.d/global-templates/templates.conf

template Host "linux-host-tmpl" {
  check_command = "hostalive"
}
vim zones.d/satellite/hosts.conf

object Host "linux-sat1" {
  import "linux-host-tmpl"
}

The master's zones.conf looks like this:

vim zones.conf

object Zone "master" { ... }
object Zone "satellite" { ... }
object Zone "global-templates" { global = true }

The satellite as client instance doesn't have global-templates configured yet, and will deny the synced configuration.

vim zones.conf

object Zone "master" { ... }
object Zone "satellite" { ... }

This leads to an incomplete configuration as the host object wants to import a template which does not exist.

[2018-09-28 12:06:05 +0200] critical/config: Error: Import references unknown template: 'linux-host-tmpl'
Location: in icinga2b/lib/icinga2/api/zones/satellite/_etc/hosts.conf: 4:3-4:26
icinga2b/lib/icinga2/api/zones/satellite/_etc/hosts.conf(2):   import "linux-host-tmpl"
                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^
icinga2b/lib/icinga2/api/zones/satellite/_etc/hosts.conf(3): }
icinga2b/lib/icinga2/api/zones/satellite/_etc/hosts.conf(4):

[2018-09-28 12:06:05 +0200] critical/config: 1 error

Solution

Do not immediately sync the received configuration into /var/lib/icinga2/api/zones but use a staging directory first.
Validate the staged configuration first, and when successful, copy it into production.

  • Client receives configuration via config::Update message
  • Stage directory is purged, config diff is taken between production and cluster message config
  • Configuration is written to the stage directory
  • All paths for this sync are collected for later processing
  • Only the current synced files are taken into account for a change - no extra configuration will be copied (deleted files, etc.)
  • When a config change is detected, this spawns a new process which runs the config validation
  • A callback is registered upon completion for:
  • Exit-Code 0: Validation was ok, rmdir(api/zones) and cp(api/zones-stage api/zones). This then triggers the full restart.
  • Error: Configuration restart aborted, nothing is copied from the stage to production. The error output is logged into the ApiListener status for REST API access.

This solves

#4354
#4301

User renames/deletes zone configuration in zones.conf on the client

This isn't related to the config sync per se, and requires configuration changes followed by reloads on the client itself (admin action).

Currently, the entire configuration in /var/lib/icinga2/api/zones is included during config compilation.
At this stage in the config compiler, we don't have any objects activated yet, so we cannot check against non-configured zone objects unfortunately.

/*
object Zone "global-templates" { global = true }
*/

This leads into this perpetum mobile situation:

The configuration is loaded and compiled ...

mbmif /usr/local/tests/icinga2/master-slave (master *+) # cat icinga2b/lib/icinga2/api/zones/global-templates/_etc/commands.conf
object CheckCommand "sleep" {
  command = [ "/bin/sleep", 30 ]
}

... but the config validation fails since the Zone object 'global-templates' does not exist.

[2018-09-28 11:45:47 +0200] critical/config: Error: Validation failed for object 'sleep' of type 'CheckCommand'; Attribute 'zone': Object 'global-templates' of type 'Zone' does not exist.
Location: in icinga2b/lib/icinga2/api/zones/global-templates/_etc/commands.conf: 1:0-1:26
icinga2b/lib/icinga2/api/zones/global-templates/_etc/commands.conf(1): object CheckCommand "sleep" {
                                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
icinga2b/lib/icinga2/api/zones/global-templates/_etc/commands.conf(2):   command = [ "/bin/sleep", 30 ]
icinga2b/lib/icinga2/api/zones/global-templates/_etc/commands.conf(3): }

[2018-09-28 11:45:47 +0200] critical/config: 1 error

Solution

Really a problem, since we can only guess which zones may survive the config compiler. Since we would end in a broken configuration validation later on if zones contain errors, we can use this advantage to enforce a guess.

How?

By making the bold statement that you cannot sync zones via cluster zone sync for this specific instance.

This way, we can fairly assume that the configuration for zones was done statically/via API package.
The solution is to check whether a configitem of the type "Zone" and the directory name already exists.

/* We don't have an activated zone object yet. We may forcefully guess from configitems
 * to not include this specific synced zones directory.
 */
if(!ConfigItem::GetByTypeAndName(Type::GetByName("Zone"), zoneName)) {
        Log(LogWarning, "config")
                << "Ignoring directory '" << path << "' for unknown zone '" << zoneName << "'.";
        return;
}

This solves #3323

Upon next config sync, the directories are automatically purged then (requires the solution with staged syncs).

User renames zones on the client

If the production code imports a template which was previously available in a zone, this is a different error.
The user forcefully renamed something and broke the production configuration.

Future versions of Icinga 2 won't trigger this problem this visible since the cluster config sync is now staged, and production config is synced from above, if valid.

If the user renames a zone, and doesn't copy the zones directory in /var/lib/icinga2/api/zones, manual actions will still be needed. Icinga 2 may detect a difference, but doesn't know which zone was renamed from A to B.

If such zone renames take place, the best option still is to purge /var/lib/icinga2/api/zones manually and restart the client.

Since zone renaming usually happens in the early stages of a cluster design, this problem is not so important than with syncing partial configuration or including removed zones.

User renames zone on the config master or deletes files

Simulate an empty directory where now all files are removed upon sync.

cd icinga2a/etc/icinga2/zones.d/
mv master master-test
mkdir master

The config sync receives an empty directory then, and needs to diff that. This leads into possible errors
and the files are left intact in production.

Solution

With the cluster config sync, we need to purge staging before putting any new files in there. Since the config::Update message doesn't contain any reference, the diff mentioned before takes place. The defined state
from the cluster message is put into staging and then put into production, without any leftovers.

This requires the cluster config sync stages with additional cleanup handling.

This solves #4191

Cleanup

Stage Cleanup

On a new sync via cluster message, the stage directory is purged entirely.

This is mainly to ensure that deleted configuration files and directories (on the config master) are not left there, and would accidentally cause false positives on config validation.

This also forbids custom modifications made by the local user ...

mbmif /usr/local/tests/icinga2/master-slave (master *+) # vim icinga2b/lib/icinga2/api/zones-stage/master/_etc/myown.conf

<restart here>

mbmif /usr/local/tests/icinga2/master-slave (master *+) # ls icinga2b/lib/icinga2/api/zones-stage/master/_etc/myown.conf
ls: icinga2b/lib/icinga2/api/zones-stage/master/_etc/myown.conf: No such file or directory

... for example when attempting to fix the synced configuration. This really needs to be fixed on the config master.

Production Cleanup

When the stage configuration validation went fine, the production configuration is purged away.

This is to ensure that deleted or modified files are really deleted in production, and the same
config files are loaded into memory as validated before inside the stage.

Up until the successful validation, the broken stage validation remains "active" and users can read the "startup.log" and "status" files generated on disk.

Notes

ref/NC/581835

@dnsmichi dnsmichi added enhancement New feature or request area/distributed Distributed monitoring (master, satellites, clients) labels Oct 23, 2018
@dnsmichi dnsmichi added this to the 2.11.0 milestone Oct 23, 2018
@dnsmichi dnsmichi self-assigned this Oct 23, 2018
@dnsmichi
Copy link
Contributor Author

This also includes improved logging for configuration with the origin from endpoint where we keep an authoritative copy (the central config master). This isn't a warning either, so its information now.

screen shot 2018-10-25 at 11 41 46

@dnsmichi
Copy link
Contributor Author

Design Drafts

From what I did last summer .. just joking, a few weeks ago.

img_2188

img_2189

img_2190

@dnsmichi
Copy link
Contributor Author

Sync local zones directory

img_2191

@dnsmichi
Copy link
Contributor Author

@lippserd Here's a first draft for local config checksums between received config and local production configuration.

No change

screen shot 2018-10-26 at 14 12 26

Changes

screen shot 2018-10-26 at 14 13 16

@dnsmichi
Copy link
Contributor Author

Another round of explanations with @Al2Klimov

icinga2_cluster_config_sync_session_AK

My task is still to finalize the PR implementation.

@dnsmichi
Copy link
Contributor Author

I'm moving the part with "User renames/deletes zone configuration in zones.conf on the client" into a separate PR.

dnsmichi pushed a commit that referenced this issue Apr 15, 2019
The culprit is that we're in compiling configuration stage here,
we don't have access to `Zone::GetByName()` as objects have not
been activated yet.

Our best guess is from a config item loaded before (e.g. from zones.conf)
since no-one can sync zones via cluster config sync either.

It may not be 100% correct since the zone object itself may be invalid.
Still, if the zone object validator fails later, the config breaks either way.

The problem with removal of these directories is dealt by the cluster
config sync with stages.

refs #6727
refs #6716
@dnsmichi
Copy link
Contributor Author

ref/NC/509507
ref/NC/469941
ref/NC/581835
ref/NC/481777

@dnsmichi
Copy link
Contributor Author

dnsmichi commented Jun 6, 2019

Sync child object with missing template in master zone

michi@mbpmif ~/dev/testing/i2-local-cluster (master *=) $ mkdir etc-a/icinga2/zones.d/agent

michi@mbpmif ~/dev/testing/i2-local-cluster (master *=) $ cat etc-a/icinga2/zones.d/master/agent-tmpl.conf
template Host "agent-host" {
  check_command = "random"
  vars.bumsti = "keksi"
}


michi@mbpmif ~/dev/testing/i2-local-cluster (master *=) $ mkdir etc-a/icinga2/zones.d/agent

michi@mbpmif ~/dev/testing/i2-local-cluster (master *=) $ cat etc-a/icinga2/zones.d/agent/host.conf
object Host "sync-agent" {
  import "agent-host"
}

Master 2

Receives the configuration, it is valid (since the template exists in the master zone). Reload is triggered.

[2019-06-06 16:56:52 +0200] information/ApiListener: Received configuration for zone 'agent' from endpoint 'master1'. Comparing the checksums.
[2019-06-06 16:56:52 +0200] critical/ApiListener: Comparing old (0): '{}' to new (4): '{"/.authoritative":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","/.checksums":"7ede1276a9a32019c1412a52779804a976e163943e268ec4066e6b6ec4d15d73","/.timestamp":"c5ab56a3b481ce564a3a46e820c27735576394a1d0e2121523e06c2d7bd82e7b","/_etc/host.conf":"35d4823684d83a5ab0ca853c9a3aa8e592adfca66210762cdf2e54339ccf0a44"}'.
[2019-06-06 16:56:52 +0200] information/ApiListener: Stage: Updating received configuration file 'var-b/lib/icinga2/api/zones-stage/agent//.checksums' for zone 'agent'.
[2019-06-06 16:56:52 +0200] information/ApiListener: Stage: Updating received configuration file 'var-b/lib/icinga2/api/zones-stage/agent//.timestamp' for zone 'agent'.
[2019-06-06 16:56:52 +0200] information/ApiListener: Stage: Updating received configuration file 'var-b/lib/icinga2/api/zones-stage/agent//_etc/host.conf' for zone 'agent'.
[2019-06-06 16:56:52 +0200] information/ApiListener: Applying configuration file update for path 'var-b/lib/icinga2/api/zones-stage/agent' (154 Bytes).
[2019-06-06 16:56:52 +0200] information/ApiListener: Received configuration for zone 'master' from endpoint 'master1'. Comparing the checksums.
[2019-06-06 16:56:52 +0200] critical/ApiListener: Comparing old (0): '{}' to new (4): '{"/.authoritative":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","/.checksums":"eb5b7d42d1e56e5ff11534eb41dcce72bf459b873274cd7976868ccc9946046a","/.timestamp":"89c41ee9b7b80f6d442b5bee86e9b6c85fa6c01ac06c97e27f615a12a1e60e0a","/_etc/agent-tmpl.conf":"ea8a1dbf48e3967f05300fd885303955f1209e25634aabeb8b636d5ebe4a7d24"}'.
[2019-06-06 16:56:52 +0200] information/ApiListener: Stage: Updating received configuration file 'var-b/lib/icinga2/api/zones-stage/master//.checksums' for zone 'master'.
[2019-06-06 16:56:52 +0200] information/ApiListener: Stage: Updating received configuration file 'var-b/lib/icinga2/api/zones-stage/master//.timestamp' for zone 'master'.
[2019-06-06 16:56:52 +0200] information/ApiListener: Stage: Updating received configuration file 'var-b/lib/icinga2/api/zones-stage/master//_etc/agent-tmpl.conf' for zone 'master'.
[2019-06-06 16:56:52 +0200] information/ApiListener: Applying configuration file update for path 'var-b/lib/icinga2/api/zones-stage/master' (191 Bytes).
[2019-06-06 16:56:52 +0200] information/ApiListener: Received configuration from endpoint 'master1' is different to production, triggering validation and reload.
[2019-06-06 16:56:52 +0200] information/ApiListener: Config validation for stage 'var-b/lib/icinga2/api/zones-stage/' was OK, replacing into 'var-b/lib/icinga2/api/zones/' and triggering reload.

Agent

Config is received but the stage is broken.

[2019-06-06 16:56:51 +0200] information/ApiListener: Received configuration for zone 'agent' from endpoint 'master1'. Comparing the checksums.
[2019-06-06 16:56:51 +0200] critical/ApiListener: Comparing old (0): '{}' to new (4): '{"/.authoritative":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","/.checksums":"7ede1276a9a32019c1412a52779804a976e163943e268ec4066e6b6ec4d15d73","/.timestamp":"c5ab56a3b481ce564a3a46e820c27735576394a1d0e2121523e06c2d7bd82e7b","/_etc/host.conf":"35d4823684d83a5ab0ca853c9a3aa8e592adfca66210762cdf2e54339ccf0a44"}'.
[2019-06-06 16:56:51 +0200] information/ApiListener: Stage: Updating received configuration file 'var-c/lib/icinga2/api/zones-stage/agent//.checksums' for zone 'agent'.
[2019-06-06 16:56:51 +0200] information/ApiListener: Stage: Updating received configuration file 'var-c/lib/icinga2/api/zones-stage/agent//.timestamp' for zone 'agent'.
[2019-06-06 16:56:51 +0200] information/ApiListener: Stage: Updating received configuration file 'var-c/lib/icinga2/api/zones-stage/agent//_etc/host.conf' for zone 'agent'.
[2019-06-06 16:56:51 +0200] information/ApiListener: Applying configuration file update for path 'var-c/lib/icinga2/api/zones-stage/agent' (154 Bytes).
[2019-06-06 16:56:51 +0200] information/ApiListener: Received configuration from endpoint 'master1' is different to production, triggering validation and reload.
[2019-06-06 16:56:52 +0200] critical/ApiListener: Config validation failed for staged cluster config sync in 'var-c/lib/icinga2/api/zones-stage/'. Aborting. Logs: 'var-c/lib/icinga2/api/zones-stage//startup.log'

It even points us to a log file ...

Screen Shot 2019-06-06 at 17 00 01

@dnsmichi
Copy link
Contributor Author

dnsmichi commented Jun 6, 2019

I've just found a small issue with syncing that .authoritative marker.

@dnsmichi
Copy link
Contributor Author

As always, technical concept plus upgrading docs are provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention enhancement New feature or request queue/wishlist ref/NC
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants