Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken downtime comment sync #10000

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

yhabteab
Copy link
Member

@yhabteab yhabteab commented Feb 12, 2024

All objects must be synced sorted by their load dependency. Otherwise, downtimes and/or comments might get synced before their respective Checkables, which will result in comments and downtimes being ignored by the other endpoint since it does not yet know about their checkables. Given that the runtime config updates event does not trigger a reload on the remote endpoint, these objects won't be synced again until the next reload.

~/master2/icinga2 (bundled-cluster-fixes ✗) ls prefix/var/lib/icinga2/api/packages/_api/*/conf.d/downtimes | wc -l
    3501
~/master2/icinga2 (bundled-cluster-fixes ✗) curl -sSku root:icinga 'https://localhost:5666/v1/objects/downtimes?pretty=1' | grep ' "attrs": {' | wc -l
    1501
~/master1/icinga2 (bundled-cluster-fixes ✗) curl -sSku root:icinga 'https://localhost:5665/v1/objects/downtimes?pretty=1' | grep ' "attrs": {' | wc -l
    3501

After master2 reload:

~/master2/icinga2 (bundled-cluster-fixes ✗) curl -sSku root:icinga 'https://localhost:5666/v1/objects/downtimes?pretty=1' | grep ' "attrs": {' | wc -l
    3501

closes #7786
closes #9873

TODO

@yhabteab yhabteab added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) ref/IP area/runtime Downtimes, comments, dependencies, events labels Feb 12, 2024
@cla-bot cla-bot bot added the cla/signed label Feb 12, 2024
lib/remote/apilistener-configsync.cpp Outdated Show resolved Hide resolved
@log1-c
Copy link
Contributor

log1-c commented Jul 22, 2024

Quick question: would this also fix the following issue:
Downtimes scheduled via API (sometimes) get synced/re-created with delay and are doubled #10078

@Mordecaine
Copy link

Mordecaine commented Aug 13, 2024

Quick question: would this also fix the following issue: Downtimes scheduled via API (sometimes) get synced/re-created with delay and are doubled #10078

It would be very helpful to get an answer to this.

@yhabteab
Copy link
Member Author

Quick question: would this also fix the following issue: Downtimes scheduled via API (sometimes) get synced/re-created with delay and are doubled #10078

It would be very helpful to get an answer to this.

Hi, we don't know for sure whether this will fix #10078 as we still haven't identified exactly what is going wrong there, other than something is not working as expected. It's unlikely that this PR will fix #10078, but we can't tell you for sure until the cause for #10078 is identified.

@Al2Klimov Al2Klimov removed their assignment Aug 20, 2024
@Al2Klimov Al2Klimov self-requested a review August 20, 2024 11:16
@dgiesselbach
Copy link

@yhabteab When will this request be completed? Is there a timeline?

@yhabteab yhabteab force-pushed the broken-downtime-comment-sync branch from e3c289f to 78095b8 Compare September 16, 2024 07:21
@yhabteab yhabteab changed the base branch from master to enhanced-sort-types-by-load-dependencies September 16, 2024 07:23
@yhabteab yhabteab force-pushed the enhanced-sort-types-by-load-dependencies branch 4 times, most recently from cb4fe57 to eb97676 Compare September 20, 2024 14:18
@Al2Klimov Al2Klimov changed the base branch from enhanced-sort-types-by-load-dependencies to master September 20, 2024 15:30
@Al2Klimov Al2Klimov marked this pull request as draft September 20, 2024 15:31
@julianbrost
Copy link
Contributor

I believe similar problems will still exist for other types where there's no load_after dependency at the moment: many objects can refer to a time period, however, there's not a single load_after TimePeriod. There are more such examples: Host/Service -> *Command, Notification -> NotificationCommand (not necessarily a complete list).

@yhabteab yhabteab force-pushed the broken-downtime-comment-sync branch 2 times, most recently from 8645d1e to 1261254 Compare December 3, 2024 07:58
@yhabteab yhabteab force-pushed the broken-downtime-comment-sync branch from 1261254 to 163964f Compare December 3, 2024 10:58
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I've overseen an elephant, despite I like small diffs. Sorry😅

lib/remote/apilistener-configsync.cpp Outdated Show resolved Hide resolved
lib/remote/apilistener-configsync.cpp Show resolved Hide resolved
lib/remote/apilistener-configsync.cpp Outdated Show resolved Hide resolved
@yhabteab yhabteab force-pushed the broken-downtime-comment-sync branch 3 times, most recently from 845a61b to 2fd478a Compare December 3, 2024 14:39
@yhabteab yhabteab requested a review from Al2Klimov December 3, 2024 14:41
Al2Klimov
Al2Klimov previously approved these changes Dec 3, 2024
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LFTM

@yhabteab yhabteab force-pushed the broken-downtime-comment-sync branch 2 times, most recently from 3254f36 to 770b730 Compare December 4, 2024 07:16
@yhabteab
Copy link
Member Author

yhabteab commented Dec 4, 2024

Fortunately, @julianbrost found out during testing that the objects are not being synchronised in the correct order, as they are supposed to be with this PR. Thus, I made some modifications in the DependencyGraph class to achieve the desired result without producing too many code changes.

Before

[2024-12-04 08:45:47 +0100] critical/config: Error: Validation failed for object 'dsl-groups' of type 'HostGroup'; Attribute 'groups': Object 'dns-groups' of type 'HostGroup' does not exist.
Location: in /Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf: 2:2-2:26
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(1): object HostGroup "dsl-groups" {
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(2):  groups = [ "dns-groups" ]
                                                                                                                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(3):  version = 1733298340.337465
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(4):  zone = "master"
[2024-12-04 08:45:47 +0100] critical/config: 1 error
[2024-12-04 08:45:47 +0100] critical/ApiListener: Could not create object 'dsl-groups':
[2024-12-04 08:45:47 +0100] critical/ApiListener: Error: Validation failed for object 'dsl-groups' of type 'HostGroup'; Attribute 'groups': Object 'dns-groups' of type 'HostGroup' does not exist.
Location: in /Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf: 2:2-2:26
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(1): object HostGroup "dsl-groups" {
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(2):  groups = [ "dns-groups" ]
                                                                                                                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(3):  version = 1733298340.337465
/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/packages/_api/354bb4c7-3d94-4df3-bba3-add86e471cd3/conf.d/hostgroups/dsl-groups.conf(4):  zone = "master"
[2024-12-04 08:45:47 +0100] information/ConfigObjectUtility: Created and activated object 'dns-groups' of type 'HostGroup'.

After

[2024-12-04 08:51:01 +0100] information/ApiListener: Copying file 'master//_etc/overflow.conf' from config sync staging to production zones directory.
[2024-12-04 08:51:01 +0100] information/ApiListener: Copying file 'master//_etc/services.conf' from config sync staging to production zones directory.
[2024-12-04 08:51:02 +0100] information/ConfigObjectUtility: Created and activated object 'dns-groups' of type 'HostGroup'.
[2024-12-04 08:51:02 +0100] information/ConfigObjectUtility: Created and activated object 'dsl-groups' of type 'HostGroup'.
[2024-12-04 08:51:03 +0100] information/Application: Got reload command: Starting new instance.

@yhabteab yhabteab requested a review from Al2Klimov December 4, 2024 07:55
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd try the below suggestions starting with "💡" myself. (I'm kind of "waiting" for #10254 and #9990.)

lib/base/dependencygraph.cpp Outdated Show resolved Hide resolved
lib/base/dependencygraph.cpp Show resolved Hide resolved
lib/base/dependencygraph.cpp Outdated Show resolved Hide resolved
lib/base/dependencygraph.hpp Show resolved Hide resolved
lib/base/dependencygraph.hpp Show resolved Hide resolved
lib/base/dependencygraph.hpp Show resolved Hide resolved
lib/base/dependencygraph.cpp Outdated Show resolved Hide resolved
lib/base/dependencygraph.hpp Show resolved Hide resolved
boost::multi_index::indexed_by<
// The first indexer is used for lookups by the Edge object itself, thus it
// needs its own hash function and comparison predicate.
boost::multi_index::hashed_non_unique<boost::multi_index::identity<Edge>, Edge::HashPredicate, Edge::HashPredicate>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why NON unique if you don't use this as multi-map anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But let's keep this IF you'll use this as multimap. (#10000 (comment))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why NON unique if you don't use this as multi-map anyway?

Especially given there IS a unique version.

@yhabteab yhabteab force-pushed the broken-downtime-comment-sync branch from 770b730 to dd6f9b9 Compare December 4, 2024 11:39
@yhabteab yhabteab requested a review from Al2Klimov December 4, 2024 11:43

if (it->second == 0)
refs.erase(it);
std::unique_lock<std::mutex> lock(m_Mutex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional

Suggested change
std::unique_lock<std::mutex> lock(m_Mutex);
std::unique_lock lock(m_Mutex);

Copy link
Member

@Al2Klimov Al2Klimov Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, but highly desired for NEW code which won't be backported to v2.13. (Doesn't CLion annoy you like ugh can be omitted?)

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any doubt that the terminology used so far by DependencyGraph is misleading? This terminology has caused confusion in the past (it was overlooked for quite some time that it returned the opposite direction from what was needed/expected in this PR), causes confusion right no (it makes my head hurt while looking at the PR) and will almost certainly cause more confusion in the future.

To keep this naming, you already have added multiple "this is stupid naming and should be the opposite" comments and I believe it would need even more warnings (basically the same warning/explanation of a strange child/parent definition for each method) that it raises the question why we shouldn't simply fix the terminology instead. Alternative would be to use better naming for the new wording for the new method to leave the existing callers untouched (something like GetRequirementsOf() for the new method), but mixing that sounds like another recipe for confusion.

Comment on lines +54 to +89
struct HashPredicate
{
/**
* Generates a unique hash of the given Edge object.
*
* Note, the hash value is generated only by combining the hash values of the parent and child pointers.
*
* @param edge The Edge object to be hashed.
*
* @return size_t The resulting hash value of the given object.
*/
size_t operator()(const Edge& edge) const
{
size_t seed = 0;
boost::hash_combine(seed, edge.parent);
boost::hash_combine(seed, edge.child);

return seed;
}

/**
* Compares whether the two Edge objects contain the same parent and child pointers.
*
* Note, the member property count is not taken into account for equality checks.
*
* @param a The first Edge object to compare.
* @param b The second Edge object to compare.
*
* @return bool Returns true if the two objects are equal, false otherwise.
*/
bool operator()(const Edge& a, const Edge& b) const
{
return a.parent == b.parent && a.child == b.child;
}
};
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining both these operations in a single type technically works, but it quite uncommon. You'd typically define two separate types so that you have Hash::operator() and Equal::operator() to do the corresponding operation (and it's clear without function overloading). That's also the reason there are two template argument.

Comment on lines +94 to +96
// The first indexer is used for lookups by the Edge object itself, thus it
// needs its own hash function and comparison predicate.
boost::multi_index::hashed_non_unique<boost::multi_index::identity<Edge>, Edge::HashPredicate, Edge::HashPredicate>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"by the Edge object itself" is a bit vague as Edge includes count which isn't included in the lookup. I'd rather say something like "the edge from child to parent".

Comment on lines +97 to +99
// The second and third indexers are used for lookups by the parent and child pointers.
boost::multi_index::hashed_non_unique<boost::multi_index::member<Edge, Object*, &Edge::parent>>,
boost::multi_index::hashed_non_unique<boost::multi_index::member<Edge, Object*, &Edge::child>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saying first/second/third here is somewhat confusing as accessing them is 0-indexed.

@Al2Klimov
Copy link
Member

why we shouldn't simply fix the terminology instead.

boost::multi_index::indexed_by<
// The first indexer is used for lookups by the Edge object itself, thus it
// needs its own hash function and comparison predicate.
boost::multi_index::hashed_non_unique<boost::multi_index::identity<Edge>, Edge::HashPredicate, Edge::HashPredicate>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why NON unique if you don't use this as multi-map anyway?

Especially given there IS a unique version.


if (it->second == 0)
refs.erase(it);
std::unique_lock<std::mutex> lock(m_Mutex);
Copy link
Member

@Al2Klimov Al2Klimov Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, but highly desired for NEW code which won't be backported to v2.13. (Doesn't CLion annoy you like ugh can be omitted?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) area/runtime Downtimes, comments, dependencies, events bug Something isn't working cla/signed consider backporting Should be considered for inclusion in a bugfix release ref/IP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants