[Proposal] Icinga DB: Better config and state update queueing #10186

julianbrost · 2024-10-10T13:52:04Z

The following proposal has the goal of reducing the memory footprint of the Icinga DB feature within Icinga 2 and also make it more resistant to both slow Redis query processing and high object update rates originating from Icinga 2.

Summary of the status quo

Let's take the processing of a check result and the corresponding state update in Redis as an example. The following summary is a bit simplified, but it should contain every thing relevant for following along.

Icinga DB registers a handler for our corresponding Boost signal that will be called for each check result:

icinga2/lib/icingadb/icingadb-objects.cpp

Lines 127 to 129 in 5e9e0bb

    
           Checkable::OnNewCheckResult.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr&, const MessageOrigin::Ptr&) { 
        
           	IcingaDB::NewCheckResultHandler(checkable); 
        
           });

This handler just calls two function on all currently active Icinga DB feature objects (technically there can be multiple, in practice, there will typically be only one):

icinga2/lib/icingadb/icingadb-objects.cpp

Lines 2843 to 2849 in 5e9e0bb

    
           void IcingaDB::NewCheckResultHandler(const Checkable::Ptr& checkable) 
        
           { 
        
           	for (auto& rw : ConfigType::GetObjectsByType<IcingaDB>()) { 
        
           		rw->UpdateState(checkable, StateUpdate::Volatile); 
        
           		rw->SendNextUpdate(checkable); 
        
           	} 
        
           }

Let's ignore SendNextUpdate() for the moment and just focus on UpdateState(). That function already generates the corresponding Redis update queries and hands them over to the Redis connection:

icinga2/lib/icingadb/icingadb-objects.cpp

Lines 1154 to 1159 in 5e9e0bb

    
           if (mode & StateUpdate::Volatile) { 
        
           	m_Rcon->FireAndForgetQueries({ 
        
           		{"HSET", redisStateKey, objectKey, JsonEncode(stateAttrs)}, 
        
           		{"HSET", redisChecksumKey, objectKey, JsonEncode(new Dictionary({{"checksum", checksum}}))}, 
        
           	}, Prio::RuntimeStateSync); 
        
           }

FireAndForgetQueries() then just appends them to a queue¹:

icinga2/lib/icingadb/redisconnection.cpp

Lines 151 to 155 in 5e9e0bb

    
           asio::post(m_Strand, [this, item, priority, ctime, affects]() { 
        
           	m_Queues.Writes[priority].emplace(WriteQueueItem{nullptr, item, nullptr, nullptr, nullptr, ctime, affects}); 
        
           	m_QueuedWrites.Set(); 
        
           	IncreasePendingQueries(item->size()); 
        
           });

From that queue, the queries are then written to the Redis socket, but from here on, it's no longer relevant for the proposal so I won't go into more details from here on.

Room for improvement

There are two things where I believe this can be improved.

Obsolete pending queries: if there is a bit of backlog in the query queue, it can happen that there are two state updates for the same checkable in that queue. However, the first update is made obsolete by second one and there's little reason for still executing the first one. However, as it's in the queue, it will still be executed, doing redundant work.
Memory footprint: By generating the full queries that early, the queue contains quite a bit of bloat. Given that the queries are stored as std::vector<icinga::String>, the queue may contain numerous copies of strings like "HSET", "icinga:host:state", etc.

Proposal

In short: Instead of using the existing query queue, use a queue of object pointers and a per-object set of dirty flags. Make sure that the object is only present once in the queue.

Given that newer queries make previous queries obsolete, it should suffice to just place the object in the queue with an extra flag saying that its state information has to be updated. Once it's the object's turn to be updated in Redis, we can just take the most up-to-date state information and write it to Redis. This may be newer than when the object was put into the queue, but it's the current information at the time of the write, so it should be just fine (or even better as the latest information is available faster).

Pseudo-C++-ish example of what I imagine: First of all rough idea for the dirty bits (probably not final, the comments should give an idea what they mean, maybe some Redis keys are missing there, maybe some can even be split further, like state and next update):

enum DirtyBits {
	// For example for hosts, update icinga:host, icinga:host:customvar, icinga:checksum:host
	// and generate corresponding runtime updates.
	Config = 1<<0,

	// For example for hosts, update icinga:host:state, icinga:checksum:host:state, icinga:nextupdate:host
	StateVolatile = 1<<1,

	// For example for hosts, send a runtime update for icinga:host:state
	StateRuntimeUpdate = 1<<2,
};

Then, for the queue data structures, something like this (again, not final, maybe this can be done nicer with boost::multi_index_container, maybe not):

// Queue of pending objects, must not contain duplicates.
std::queue<ConfigObject::Ptr> pendingObjects;

// Dirty bits per object, must contain a key if and only if it's also contained in pendingObjects.
// This allows it to be used for checking if objects are already enqueued.
std::unordered_map<ConfigObject::Ptr, DirtyBits> pendingObjectUpdates;

Queueing an object would look something like this (not showing the synchronization, of course that's a critical section that has to be protected):

// If the object has dirty bits, it's already contained in the queue.
if (!pendingObjectUpdates.contains(host)) {
	pendingObjects.push(host);
}

// Set the desired dirty bit.
pendingObjectUpdates[host] |= StateVolatile;

Processing an object from the queue would be something like (same for synchronization as before):

object = pendingObjects.front();
dirty = pendingObjectUpdates[object];

pendingObjects.pop();
// Clears the dirty bits and signals that the object is no longer in the queue.
pendingObjectUpdates.erase(object);

// Next: check dirty bits and generate and write all necessary queries.

Regarding the memory footprint, this will reduce the queue size significantly. Instead of fully expanded Redis queries, it now only contains a pointer and a small bitset. Additionally, the queue size is bounded: once all objects are in the queue, it will not grow any further. In contrast, if the queries can't be processed fast enough, with the current implementation, the queue would grow indefinitely. In contrast, with this proposal, some updates would just be skipped, so for example maybe only every fifth check result would be written to Redis. That would result in some extra latency overall, but apart from this, it would work just fine. Also, as simply as that implementation looks, it would result in quite fair scheduling actually: the next object that will be updated is the one that was waiting the longest for its update.

Overall, this would also make the Icinga DB implementation less sensitive to suboptimal/excessive triggering of our Boost signals. If the same signal is called multiple times in quick succession, there's a good chance that the extra calls just do nothing as the dirty bits are still set as the update was not yet processed. In fact, this is actually part of the reason for this proposal: while looking into #10082, it turned out that this is somewhat related to how SetNextCheck() is called (see that issue and #10082 (comment) for more information): If Icinga DB would just trigger the related Redis updates based on Checkable::OnNextCheckChanged (i.e. every time Checkable::SetNextCheck() is called), the issue would be gone². However, SetNextCheck() is called multiple times for every check execution, so with the current Redis update logic, that's not an easy fix as it has the downside of producing pointless Redis queries. With this proposal, these extra calls wouldn't matter, making it an easy fix³.

Remark: This change is not possible for all queries. In particular, for history entries, newer queries don't obsolete previous ones, so they still have to be queued⁴.

Well, actually there are two queues involved. asio::post() enqueues a task in it's work queue and that task then inserts the queries in the actual query queue. ↩
Currently, it uses Checkable::OnNextCheckUpdated (caution: similar but not identical name, definitely not confusing at all) instead which is called less often but does not work reliable across the cluster. ↩
If the server and connection are fast enough to process the redundant queries, there's still a chance that these pointless queries are sent. If that's an issue, it can be addressed by introducing a small deliberate delay in the query processing: for each object, additionally store when it was placed in the queue and only process the object say a second later. Then from the first update signaled to Icinga DB, the object has a time window of a second to finish processing the check result for example, and then all the dirty bits accumulated in that window will be handled at once. ↩
There may still be some room for improvement there, like maybe instead of storing a full query, a more compact struct only containing the dynamic values for the query could be put into the queue. However, this proposal is already large enough, so I wouldn't make this part of it. ↩

The text was updated successfully, but these errors were encountered:

julianbrost added area/icingadb New backend core/quality Improve code, libraries, algorithms, inline docs labels Oct 10, 2024

julianbrost mentioned this issue Oct 10, 2024

Overdue state doesn't honor set time periods #10082

Open

julianbrost changed the title ~~[Proposal] Better config and state update queueing~~ [Proposal] Icinga DB: Better config and state update queueing Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Icinga DB: Better config and state update queueing #10186

[Proposal] Icinga DB: Better config and state update queueing #10186

julianbrost commented Oct 10, 2024 •

edited

Loading

[Proposal] Icinga DB: Better config and state update queueing #10186

[Proposal] Icinga DB: Better config and state update queueing #10186

Comments

julianbrost commented Oct 10, 2024 • edited Loading

Summary of the status quo

Room for improvement

Proposal

Footnotes

julianbrost commented Oct 10, 2024 •

edited

Loading