-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icinga2 looses state history and notifications during restart #10179
Comments
Here is everything you will need to reproduce the same errors locally:
Here is the archive with all needed data: |
I could also reproduce the same problem on both Debian and RHEL 9, but it was orders of magnitude less likely to happen there. |
I have now also reproduced the same issue with icingadb: How to Reproduce:
From my local tests on a VM with 4 Cores and 8GB of RAM, the error should be observable in around 10 Minutes / two deploys. To run that, download the tar -xaf dropped_state_query.tar.gz
cd dropped_state_query
cargo run --release -- --host 127.0.0.1 --user icingadb --password icingadb --icingadb | tee analyzed-history.log Hint: Rust can be installed easily from https://rustup.rs/ Once the rust program has found some missing state_history and/or notification, you can verify that by looking at the service history. |
Hi all, Thank you for all the details. I am trying to reproduce the scenario, but so far without success. When was the screenshot taken? Immediately after executing the queries to check if something is missing in the database? As there is a high chance that Icinga DB has not inserted everything yet. Also, missing entries in the database do not necessarily mean that Icinga has not sent a notification. That should rather be verified using custom check and notification plugins. Best regards, |
Hello Eric. I have run the tests over the weekend. Even after several days they do not appear in icingadb. This is of course obvious to you, as I posted the screenshot on the 14th, while the state changes happened on the 10th. And as you know, if it were a few minutes ago, it would not have shown the date, but rather the delta time since then. As for the notifications, we became aware of the problem, because one of our services went into critical without sending any notifications. That does normally work and we have tried the configuration to make sure it works. It was on pure coincident that we noticed that, which lead us to investigate. |
@lippserd I have opened a PR to fix the issue. |
@lippserd Have you seen my PR? I would really appreciate some feedback. |
The discussion in #10191 was rapidly superseded by the other comments, which is why I am attempting to clarify it her. I would like to begin by emphasising that I do really appreciate the analyses and PR you have presented and have no intention of discrediting them in any way. However, your script relies heavily on data from the database. I am not saying that the data from the database is entirely inaccurate, but there are other factors at play like Redis, Icinga DB etc. Thus, you can't trace a bug in Icinga 2 solely based on the database info. Though, while writing down this comment, I was able to successfully trigger some of the described behaviours with just a few hosts. But first, I'll start by outlining the current deactivation order (without your PR) of the config objects. Initially, the checker is deactivated before anything else, and typically, during a normal circumstance, there won't be running checks after scheduler stops its thread. icinga2/lib/checker/checkercomponent.cpp Lines 79 to 81 in 7a20d98
icinga2/lib/checker/checkercomponent.cpp Lines 107 to 108 in 7a20d98
Afterwards, it will not schedule any new checks, but it will not stop already scheduled checks either. As a result, you may still observe running checks even after deactivating the checker. However, such running checks left overs should not cause a state change as once the Host or Service is also deactivated, no check result is processed (should not are the key words here! It is challenging, but it is possible to trigger state change events when the timings are perfectly matched). As illustrated in the example below, even if the check plugin has reported a problem state, most of the check results are nevertheless discarded because the corresponding hosts have already been deactivated. icinga2/lib/icinga/checkable-check.cpp Lines 151 to 152 in 7a20d98
Icinga2.confinclude <itl>
object CheckerComponent "checker" {}
object NotificationComponent "notification" {}
object IcingaDB "icingadb" {
host = "127.0.0.1"
port = 6381
}
object NotificationCommand "send" {
command = ["true"]
}
apply Notification "recover" to Host {
command = "send"
users = ["icingaadmin"]
types = [ Recovery ]
assign where true
}
apply Notification "problem" to Host {
command = "send"
users = ["icingaadmin"]
types = [ Problem ]
assign where true
}
object User "icingaadmin" {
types = [ Problem, Recovery ]
}
template Host "default" default {
check_command = "dummy"
max_check_attempts = 1
check_interval = 1s
var that = this
vars.dummy_state = () use (that) => {
sleep(3s)
if (that.state != 0.0) {
log("Host recovered")
return 0
} else {
log("Host run into or is in problem state")
return 2
}
}
vars.dummy_text = "I'm just testing something"
}
object Host "test" {}
object Host "test1" {}
object Host "test2" {}
object Host "test3" {}
object Host "test4" {} Local Changesdiff --git a/lib/base/configobject.cpp b/lib/base/configobject.cpp
index 4317771d1..7fee0cc71 100644
--- a/lib/base/configobject.cpp
+++ b/lib/base/configobject.cpp
@@ -599,7 +599,7 @@ void ConfigObject::StopObjects()
for (const ConfigObject::Ptr& object : dtype->GetObjects()) {
#ifdef I2_DEBUG
- Log(LogDebug, "ConfigObject")
+ Log(LogInformation, "ConfigObject")
<< "Deactivate() called for config object '" << object->GetName() << "' with type '" << type->GetName() << "'.";
#endif /* I2_DEBUG */
object->Deactivate();
diff --git a/lib/checker/checkercomponent.cpp b/lib/checker/checkercomponent.cpp
index d92101f4e..37a7d3699 100644
--- a/lib/checker/checkercomponent.cpp
+++ b/lib/checker/checkercomponent.cpp
@@ -104,9 +104,12 @@ void CheckerComponent::CheckThreadProc()
while (idx.begin() == idx.end() && !m_Stopped)
m_CV.wait(lock);
- if (m_Stopped)
+ if (m_Stopped) {
+ Log(LogInformation, "Check Scheduler") << "Scheduler stopped";
break;
+ }
+ Log(LogInformation, "Check Scheduler") << "Scheduling a check";
auto it = idx.begin();
CheckableScheduleInfo csi = *it;
diff --git a/lib/icinga/checkable-check.cpp b/lib/icinga/checkable-check.cpp
index 6e3b8764b..509a98881 100644
--- a/lib/icinga/checkable-check.cpp
+++ b/lib/icinga/checkable-check.cpp
@@ -148,6 +148,8 @@ Checkable::ProcessingResult Checkable::ProcessCheckResult(const CheckResult::Ptr
}
+ Log(LogInformation, "Checkable") << "ProcessCheckResult IsActive - " << IsActive() << " state - " << cr->GetState() << " output - " << cr->GetOutput();
+
if (!IsActive())
return Result::CheckableInactive;
diff --git a/lib/icingadb/icingadb-objects.cpp b/lib/icingadb/icingadb-objects.cpp
index 920251969..102e3f3e6 100644
--- a/lib/icingadb/icingadb-objects.cpp
+++ b/lib/icingadb/icingadb-objects.cpp
@@ -34,6 +34,7 @@
#include <chrono>
#include <cmath>
#include <cstdint>
+#include <iostream>
#include <iterator>
#include <map>
#include <memory>
@@ -116,6 +117,7 @@ void IcingaDB::ConfigStaticInitialize()
const NotificationType& type, const CheckResult::Ptr& cr, const String& author, const String& text,
const MessageOrigin::Ptr&
) {
+ std::cerr << "IcingaDB::Checkable::OnNotificationSentToAllUsers." << std::endl;
IcingaDB::NotificationSentToAllUsersHandler(notification, checkable, users, type, cr, author, text);
});
@@ -125,6 +127,7 @@ void IcingaDB::ConfigStaticInitialize()
Checkable::OnFlappingChange.connect(&IcingaDB::FlappingChangeHandler);
Checkable::OnNewCheckResult.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr&, const MessageOrigin::Ptr&) {
+ std::cerr << "IcingaDB::Checkable::OnNewCheckResult" << std::endl;
IcingaDB::NewCheckResultHandler(checkable);
});
@@ -1773,6 +1776,7 @@ void IcingaDB::SendSentNotification(
)
{
if (!GetActive()) {
+ std::cerr << "IcingaDB::SendSentNotification dropped a notification" << std::endl;
return;
} $ icinga2 daemon -c icinga2.conf
...
[2024-11-13 15:17:24 +0100] information/config: Host recovered
[2024-11-13 15:17:24 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:24 +0100] information/Checkable: Checkable 'test' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:24 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:24 +0100] information/config: Host recovered
[2024-11-13 15:17:24 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:24 +0100] information/Checkable: Checkable 'test1' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:24 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:24 +0100] information/config: Host recovered
[2024-11-13 15:17:24 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:24 +0100] information/Checkable: Checkable 'test3' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:24 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/config: Host recovered
[2024-11-13 15:17:25 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:25 +0100] information/Checkable: Checkable 'test4' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/config: Host recovered
[2024-11-13 15:17:25 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:25 +0100] information/Checkable: Checkable 'test2' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:26 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:26 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:26 +0100] information/Check Scheduler: Scheduling a check
^C[2024-11-13 15:17:26 +0100] information/Application: Received request to shut down.
[2024-11-13 15:17:26 +0100] information/Application: Shutting down...
[2024-11-13 15:17:26 +0100] information/ConfigObject: Deactivate() called for config object 'checker' with type 'CheckerComponent'.
[2024-11-13 15:17:26 +0100] information/Check Scheduler: Scheduler stopped
[2024-11-13 15:17:26 +0100] information/CheckerComponent: 'checker' stopped.
[2024-11-13 15:17:26 +0100] information/ConfigObject: Deactivate() called for config object 'notification' with type 'NotificationComponent'.
[2024-11-13 15:17:26 +0100] information/NotificationComponent: 'notification' stopped.
[2024-11-13 15:17:26 +0100] information/ConfigObject: Deactivate() called for config object 'icingadb' with type 'IcingaDB'.
[2024-11-13 15:17:26 +0100] information/IcingaDB: Flushing history data buffer to Redis.
[2024-11-13 15:17:28 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:28 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 2 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:28 +0100] information/Checkable: Checkable 'test' has 2 notification(s). Checking filters for type 'Problem', sends will be logged.
[2024-11-13 15:17:28 +0100] information/Notification: Sending 'Problem' notification 'test!problem' for user 'icingaadmin'
IcingaDB::Checkable::OnNotificationSentToAllUsers.
IcingaDB::SendSentNotification dropped a notification <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
[2024-11-13 15:17:28 +0100] information/Notification: Completed sending 'Problem' notification 'test!problem' for checkable 'test' and user 'icingaadmin' using command 'send'.
[2024-11-13 15:17:28 +0100] critical/IcingaDB: history: 2 queries failed (attempt #3) while we're about to shut down. Giving up and discarding additional 8 queued history queries.
[2024-11-13 15:17:28 +0100] information/IcingaDB: 'icingadb' stopped.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'icingaadmin' with type 'User'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test4' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test3' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test1' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test2' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'icingadb' with type 'CheckCommand'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test1!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test2!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test4!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test3!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test3!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test2!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test4!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test1!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'send' with type 'NotificationCommand'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'app' with type 'IcingaApplication'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Dumping program state to file '/Users/yhabteab/Workspace/icinga2/prefix/var/lib/icinga2/icinga2.state'
[2024-11-13 15:17:28 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:28 +0100] information/Checkable: ProcessCheckResult IsActive - 0 state - 2 output - I'm just testing something
[2024-11-13 15:17:28 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:28 +0100] information/Checkable: ProcessCheckResult IsActive - 0 state - 2 output - I'm just testing something
[2024-11-13 15:17:29 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:29 +0100] information/Checkable: ProcessCheckResult IsActive - 0 state - 2 output - I'm just testing something <<<<<<<<<<<<< is discarded
[2024-11-13 15:17:29 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:29 +0100] information/Checkable: ProcessCheckResult IsActive - 0 state - 2 output - I'm just testing something <<<<<<<<<<<< is discarded
[2024-11-13 15:17:29 +0100] information/IcingaApplication: Icinga has shut down. It is possible that the issue you are experiencing may be resolved by the proposed PR #10191, but not because of the checker is stopped last, rather because of the changes to IDO and the Notification class. Therefore, I guess it makes sense to stop IDO, IcingaDB and the Notification component last, as well as to start them before the other objects, as is the case now. |
This did indeed alleviate the problem, but it still does not completely solve it. |
Would you mind sharing a few more details on what you've done and what isn't still fixed? :) |
I took this suggestion and changed the |
I had also a look on the CheckerComponent::ObjectHandler function and saw that if an object was paused, it was erased from m_IdleCheckables and m_PendingCheckables. For testing I cleared both before the shutdown and have not seen any problems since, but I'm still testing. I hope I understood the use of void CheckerComponent::Stop(bool runtimeRemoved)
{
{
std::unique_lock<std::mutex> lock(m_Mutex);
m_Stopped = true;
if (!runtimeRemoved) {
m_IdleCheckables.clear();
m_PendingCheckables.clear();
}
m_CV.notify_all();
}
m_ResultTimer->Stop(true);
m_Thread.join();
Log(LogInformation, "CheckerComponent")
<< "'" << GetName() << "' stopped.";
ObjectImpl<CheckerComponent>::Stop(runtimeRemoved);
} |
Have you verified in the logs whether these objects were stopped last? And how did you detect the missing state change? You can also add some custom logs to trace the execution of the check without having to enable debug logs for all components. I just quickly did some poor man's patching and testing, and couldn't find any ignored events from Icinga DB. diff --git a/lib/base/configobject.cpp b/lib/base/configobject.cpp
index 4317771d1..4435456f4 100644
--- a/lib/base/configobject.cpp
+++ b/lib/base/configobject.cpp
@@ -586,6 +586,13 @@ void ConfigObject::StopObjects()
std::vector<Type::Ptr> types = Type::GetAllTypes();
std::sort(types.begin(), types.end(), [](const Type::Ptr& a, const Type::Ptr& b) {
+ if (a->GetName() == "IcingaDB" || a->GetName() == "NotificationComponent") {
+ return false;
+ }
+ if (b->GetName() == "IcingaDB" || b->GetName() == "NotificationComponent") {
+ return true;
+ }
+
if (a->GetActivationPriority() > b->GetActivationPriority())
return true;
return false;
@@ -599,7 +606,7 @@ void ConfigObject::StopObjects()
for (const ConfigObject::Ptr& object : dtype->GetObjects()) {
#ifdef I2_DEBUG
- Log(LogDebug, "ConfigObject")
+ Log(LogInformation, "ConfigObject")
<< "Deactivate() called for config object '" << object->GetName() << "' with type '" << type->GetName() << "'.";
That parameter indicates whether a configuration object was deleted at runtime, e.g. via the REST API or via a cluster message from some other endpoint. It is unlikely that this will happen for the Checker component though, as it is typically managed in configuration files and not through the API. |
Yes, I did as you suggested in the comment and logged the deactivate on LogInformation.
As always by looking at the database.
Thanks, exactly as I expected. With the last fix I posted, I can no longer see any missed State changes. I have seen one lost notifications, which is a problem, as we rely on the notifications to trigger a webhook, but that might have effectively just been the db connection, as I can't replicate it anymore. |
@yhabteab Thank you for your insight. It was really helpful to get at the bottom of this bug. With the changes I have made to the PR, this issue does no longer occur in my test environment. If there are any concerns left, please let me know and I'll do my best to address them. |
Describe the bug
If an object has a state change during an icinga2 restart (e.g. during a deploy), it is sometimes not written to the database and does not trigger the notifications.
To Reproduce
icingacli director basket restore < icinga-lost-statechange-basket.json
icingacli director config deploy
With that configuration running, deploy icinga2 a few times:
icingacli director config deploy --force --wait
Soon there will be state changes in the state history that should not be possible:
In this case, the service went from hard warning into soft warning. The soft warning history says that the last state was Ok, but that was never written into the history.
To find lost state histories quicker I used the following script:
dropped_state_query.tar.gz
It needs as parameters the endpoint, user and password. If the db is postgres, it can be run with the
--postgres
flag.Expected behavior
I expect icinga2 to not loose state changes like that.
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
):icinga2 feature list
):icinga2 daemon -C
):Additional context
I could observe the loss of notifications in production, have however not yet reproduced that behavior locally. I suspect however that the two behavior are linked.
We could also observe the same behavior when creating objects over the icinga2 api and then immediately sending a check-result. Once again, I have not replicated this locally yet, but I suspect the problem is the same in all these cases.
The text was updated successfully, but these errors were encountered: