You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following proposal has the goal of reducing the memory footprint of the Icinga DB feature within Icinga 2 and also make it more resistant to both slow Redis query processing and high object update rates originating from Icinga 2.
Summary of the status quo
Let's take the processing of a check result and the corresponding state update in Redis as an example. The following summary is a bit simplified, but it should contain every thing relevant for following along.
Icinga DB registers a handler for our corresponding Boost signal that will be called for each check result:
This handler just calls two function on all currently active Icinga DB feature objects (technically there can be multiple, in practice, there will typically be only one):
Let's ignore SendNextUpdate() for the moment and just focus on UpdateState(). That function already generates the corresponding Redis update queries and hands them over to the Redis connection:
From that queue, the queries are then written to the Redis socket, but from here on, it's no longer relevant for the proposal so I won't go into more details from here on.
Room for improvement
There are two things where I believe this can be improved.
Obsolete pending queries: if there is a bit of backlog in the query queue, it can happen that there are two state updates for the same checkable in that queue. However, the first update is made obsolete by second one and there's little reason for still executing the first one. However, as it's in the queue, it will still be executed, doing redundant work.
Memory footprint: By generating the full queries that early, the queue contains quite a bit of bloat. Given that the queries are stored as std::vector<icinga::String>, the queue may contain numerous copies of strings like "HSET", "icinga:host:state", etc.
Proposal
In short: Instead of using the existing query queue, use a queue of object pointers and a per-object set of dirty flags. Make sure that the object is only present once in the queue.
Given that newer queries make previous queries obsolete, it should suffice to just place the object in the queue with an extra flag saying that its state information has to be updated. Once it's the object's turn to be updated in Redis, we can just take the most up-to-date state information and write it to Redis. This may be newer than when the object was put into the queue, but it's the current information at the time of the write, so it should be just fine (or even better as the latest information is available faster).
Pseudo-C++-ish example of what I imagine: First of all rough idea for the dirty bits (probably not final, the comments should give an idea what they mean, maybe some Redis keys are missing there, maybe some can even be split further, like state and next update):
enum DirtyBits {
// For example for hosts, update icinga:host, icinga:host:customvar, icinga:checksum:host// and generate corresponding runtime updates.
Config = 1<<0,
// For example for hosts, update icinga:host:state, icinga:checksum:host:state, icinga:nextupdate:host
StateVolatile = 1<<1,
// For example for hosts, send a runtime update for icinga:host:state
StateRuntimeUpdate = 1<<2,
};
Then, for the queue data structures, something like this (again, not final, maybe this can be done nicer with boost::multi_index_container, maybe not):
// Queue of pending objects, must not contain duplicates.
std::queue<ConfigObject::Ptr> pendingObjects;
// Dirty bits per object, must contain a key if and only if it's also contained in pendingObjects.// This allows it to be used for checking if objects are already enqueued.
std::unordered_map<ConfigObject::Ptr, DirtyBits> pendingObjectUpdates;
Queueing an object would look something like this (not showing the synchronization, of course that's a critical section that has to be protected):
// If the object has dirty bits, it's already contained in the queue.if (!pendingObjectUpdates.contains(host)) {
pendingObjects.push(host);
}
// Set the desired dirty bit.
pendingObjectUpdates[host] |= StateVolatile;
Processing an object from the queue would be something like (same for synchronization as before):
object = pendingObjects.front();
dirty = pendingObjectUpdates[object];
pendingObjects.pop();
// Clears the dirty bits and signals that the object is no longer in the queue.
pendingObjectUpdates.erase(object);
// Next: check dirty bits and generate and write all necessary queries.
Regarding the memory footprint, this will reduce the queue size significantly. Instead of fully expanded Redis queries, it now only contains a pointer and a small bitset. Additionally, the queue size is bounded: once all objects are in the queue, it will not grow any further. In contrast, if the queries can't be processed fast enough, with the current implementation, the queue would grow indefinitely. In contrast, with this proposal, some updates would just be skipped, so for example maybe only every fifth check result would be written to Redis. That would result in some extra latency overall, but apart from this, it would work just fine. Also, as simply as that implementation looks, it would result in quite fair scheduling actually: the next object that will be updated is the one that was waiting the longest for its update.
Overall, this would also make the Icinga DB implementation less sensitive to suboptimal/excessive triggering of our Boost signals. If the same signal is called multiple times in quick succession, there's a good chance that the extra calls just do nothing as the dirty bits are still set as the update was not yet processed. In fact, this is actually part of the reason for this proposal: while looking into #10082, it turned out that this is somewhat related to how SetNextCheck() is called (see that issue and #10082 (comment) for more information): If Icinga DB would just trigger the related Redis updates based on Checkable::OnNextCheckChanged (i.e. every time Checkable::SetNextCheck() is called), the issue would be gone2. However, SetNextCheck() is called multiple times for every check execution, so with the current Redis update logic, that's not an easy fix as it has the downside of producing pointless Redis queries. With this proposal, these extra calls wouldn't matter, making it an easy fix3.
Remark: This change is not possible for all queries. In particular, for history entries, newer queries don't obsolete previous ones, so they still have to be queued4.
Footnotes
Well, actually there are two queues involved. asio::post() enqueues a task in it's work queue and that task then inserts the queries in the actual query queue. ↩
Currently, it uses Checkable::OnNextCheckUpdated (caution: similar but not identical name, definitely not confusing at all) instead which is called less often but does not work reliable across the cluster. ↩
If the server and connection are fast enough to process the redundant queries, there's still a chance that these pointless queries are sent. If that's an issue, it can be addressed by introducing a small deliberate delay in the query processing: for each object, additionally store when it was placed in the queue and only process the object say a second later. Then from the first update signaled to Icinga DB, the object has a time window of a second to finish processing the check result for example, and then all the dirty bits accumulated in that window will be handled at once. ↩
There may still be some room for improvement there, like maybe instead of storing a full query, a more compact struct only containing the dynamic values for the query could be put into the queue. However, this proposal is already large enough, so I wouldn't make this part of it. ↩
The text was updated successfully, but these errors were encountered:
julianbrost
changed the title
[Proposal] Better config and state update queueing
[Proposal] Icinga DB: Better config and state update queueing
Oct 11, 2024
The following proposal has the goal of reducing the memory footprint of the Icinga DB feature within Icinga 2 and also make it more resistant to both slow Redis query processing and high object update rates originating from Icinga 2.
Summary of the status quo
Let's take the processing of a check result and the corresponding state update in Redis as an example. The following summary is a bit simplified, but it should contain every thing relevant for following along.
Icinga DB registers a handler for our corresponding Boost signal that will be called for each check result:
icinga2/lib/icingadb/icingadb-objects.cpp
Lines 127 to 129 in 5e9e0bb
This handler just calls two function on all currently active Icinga DB feature objects (technically there can be multiple, in practice, there will typically be only one):
icinga2/lib/icingadb/icingadb-objects.cpp
Lines 2843 to 2849 in 5e9e0bb
Let's ignore
SendNextUpdate()
for the moment and just focus onUpdateState()
. That function already generates the corresponding Redis update queries and hands them over to the Redis connection:icinga2/lib/icingadb/icingadb-objects.cpp
Lines 1154 to 1159 in 5e9e0bb
FireAndForgetQueries()
then just appends them to a queue1:icinga2/lib/icingadb/redisconnection.cpp
Lines 151 to 155 in 5e9e0bb
From that queue, the queries are then written to the Redis socket, but from here on, it's no longer relevant for the proposal so I won't go into more details from here on.
Room for improvement
There are two things where I believe this can be improved.
std::vector<icinga::String>
, the queue may contain numerous copies of strings like"HSET"
,"icinga:host:state"
, etc.Proposal
In short: Instead of using the existing query queue, use a queue of object pointers and a per-object set of dirty flags. Make sure that the object is only present once in the queue.
Given that newer queries make previous queries obsolete, it should suffice to just place the object in the queue with an extra flag saying that its state information has to be updated. Once it's the object's turn to be updated in Redis, we can just take the most up-to-date state information and write it to Redis. This may be newer than when the object was put into the queue, but it's the current information at the time of the write, so it should be just fine (or even better as the latest information is available faster).
Pseudo-C++-ish example of what I imagine: First of all rough idea for the dirty bits (probably not final, the comments should give an idea what they mean, maybe some Redis keys are missing there, maybe some can even be split further, like state and next update):
Then, for the queue data structures, something like this (again, not final, maybe this can be done nicer with
boost::multi_index_container
, maybe not):Queueing an object would look something like this (not showing the synchronization, of course that's a critical section that has to be protected):
Processing an object from the queue would be something like (same for synchronization as before):
Regarding the memory footprint, this will reduce the queue size significantly. Instead of fully expanded Redis queries, it now only contains a pointer and a small bitset. Additionally, the queue size is bounded: once all objects are in the queue, it will not grow any further. In contrast, if the queries can't be processed fast enough, with the current implementation, the queue would grow indefinitely. In contrast, with this proposal, some updates would just be skipped, so for example maybe only every fifth check result would be written to Redis. That would result in some extra latency overall, but apart from this, it would work just fine. Also, as simply as that implementation looks, it would result in quite fair scheduling actually: the next object that will be updated is the one that was waiting the longest for its update.
Overall, this would also make the Icinga DB implementation less sensitive to suboptimal/excessive triggering of our Boost signals. If the same signal is called multiple times in quick succession, there's a good chance that the extra calls just do nothing as the dirty bits are still set as the update was not yet processed. In fact, this is actually part of the reason for this proposal: while looking into #10082, it turned out that this is somewhat related to how
SetNextCheck()
is called (see that issue and #10082 (comment) for more information): If Icinga DB would just trigger the related Redis updates based onCheckable::OnNextCheckChanged
(i.e. every timeCheckable::SetNextCheck()
is called), the issue would be gone2. However,SetNextCheck()
is called multiple times for every check execution, so with the current Redis update logic, that's not an easy fix as it has the downside of producing pointless Redis queries. With this proposal, these extra calls wouldn't matter, making it an easy fix3.Remark: This change is not possible for all queries. In particular, for history entries, newer queries don't obsolete previous ones, so they still have to be queued4.
Footnotes
Well, actually there are two queues involved.
asio::post()
enqueues a task in it's work queue and that task then inserts the queries in the actual query queue. ↩Currently, it uses
Checkable::OnNextCheckUpdated
(caution: similar but not identical name, definitely not confusing at all) instead which is called less often but does not work reliable across the cluster. ↩If the server and connection are fast enough to process the redundant queries, there's still a chance that these pointless queries are sent. If that's an issue, it can be addressed by introducing a small deliberate delay in the query processing: for each object, additionally store when it was placed in the queue and only process the object say a second later. Then from the first update signaled to Icinga DB, the object has a time window of a second to finish processing the check result for example, and then all the dirty bits accumulated in that window will be handled at once. ↩
There may still be some room for improvement there, like maybe instead of storing a full query, a more compact struct only containing the dynamic values for the query could be put into the queue. However, this proposal is already large enough, so I wouldn't make this part of it. ↩
The text was updated successfully, but these errors were encountered: