-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Cluster Config Sync #6716
Comments
@lippserd Here's a first draft for local config checksums between received config and local production configuration. No changeChanges |
Another round of explanations with @Al2Klimov My task is still to finalize the PR implementation. |
I'm moving the part with "User renames/deletes zone configuration in zones.conf on the client" into a separate PR. |
The culprit is that we're in compiling configuration stage here, we don't have access to `Zone::GetByName()` as objects have not been activated yet. Our best guess is from a config item loaded before (e.g. from zones.conf) since no-one can sync zones via cluster config sync either. It may not be 100% correct since the zone object itself may be invalid. Still, if the zone object validator fails later, the config breaks either way. The problem with removal of these directories is dealt by the cluster config sync with stages. refs #6727 refs #6716
ref/NC/509507 |
Sync child object with missing template in master zone
Master 2Receives the configuration, it is valid (since the template exists in the master zone). Reload is triggered.
AgentConfig is received but the stage is broken.
It even points us to a log file ... |
I've just found a small issue with syncing that .authoritative marker. |
As always, technical concept plus upgrading docs are provided. |
Improve Cluster Config Sync
Purpose of this document is to provide solutions for existing sync problems and broken stages.
Scenarios
Client doesn't have (global) zone configured where objects reference to
The master configuration validates fine with a combination of a global zone for templates and the satellite host configuration.
The master's zones.conf looks like this:
The satellite as client instance doesn't have
global-templates
configured yet, and will deny the synced configuration.This leads to an incomplete configuration as the host object wants to import a template which does not exist.
Solution
Do not immediately sync the received configuration into
/var/lib/icinga2/api/zones
but use a staging directory first.Validate the staged configuration first, and when successful, copy it into production.
config::Update
messageThis solves
#4354
#4301
User renames/deletes zone configuration in zones.conf on the client
This isn't related to the config sync per se, and requires configuration changes followed by reloads on the client itself (admin action).
Currently, the entire configuration in
/var/lib/icinga2/api/zones
is included during config compilation.At this stage in the config compiler, we don't have any objects activated yet, so we cannot check against non-configured zone objects unfortunately.
This leads into this perpetum mobile situation:
The configuration is loaded and compiled ...
... but the config validation fails since the Zone object 'global-templates' does not exist.
Solution
Really a problem, since we can only guess which zones may survive the config compiler. Since we would end in a broken configuration validation later on if zones contain errors, we can use this advantage to enforce a guess.
How?
By making the bold statement that you cannot sync zones via cluster zone sync for this specific instance.
This way, we can fairly assume that the configuration for zones was done statically/via API package.
The solution is to check whether a configitem of the type "Zone" and the directory name already exists.
This solves #3323
Upon next config sync, the directories are automatically purged then (requires the solution with staged syncs).
User renames zones on the client
If the production code imports a template which was previously available in a zone, this is a different error.
The user forcefully renamed something and broke the production configuration.
Future versions of Icinga 2 won't trigger this problem this visible since the cluster config sync is now staged, and production config is synced from above, if valid.
If the user renames a zone, and doesn't copy the zones directory in
/var/lib/icinga2/api/zones
, manual actions will still be needed. Icinga 2 may detect a difference, but doesn't know which zone was renamed from A to B.If such zone renames take place, the best option still is to purge
/var/lib/icinga2/api/zones
manually and restart the client.Since zone renaming usually happens in the early stages of a cluster design, this problem is not so important than with syncing partial configuration or including removed zones.
User renames zone on the config master or deletes files
Simulate an empty directory where now all files are removed upon sync.
The config sync receives an empty directory then, and needs to diff that. This leads into possible errors
and the files are left intact in production.
Solution
With the cluster config sync, we need to purge staging before putting any new files in there. Since the
config::Update
message doesn't contain any reference, the diff mentioned before takes place. The defined statefrom the cluster message is put into staging and then put into production, without any leftovers.
This requires the cluster config sync stages with additional cleanup handling.
This solves #4191
Cleanup
Stage Cleanup
On a new sync via cluster message, the stage directory is purged entirely.
This is mainly to ensure that deleted configuration files and directories (on the config master) are not left there, and would accidentally cause false positives on config validation.
This also forbids custom modifications made by the local user ...
... for example when attempting to fix the synced configuration. This really needs to be fixed on the config master.
Production Cleanup
When the stage configuration validation went fine, the production configuration is purged away.
This is to ensure that deleted or modified files are really deleted in production, and the same
config files are loaded into memory as validated before inside the stage.
Up until the successful validation, the broken stage validation remains "active" and users can read the "startup.log" and "status" files generated on disk.
Notes
ref/NC/581835
The text was updated successfully, but these errors were encountered: