-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure dangling cluster checks can be re-scheduled #3035
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the investigation an the fix!
That was introduced late in #2862. Before that, Schedule
was a simple loop to add
with no side-effect.
Your solution makes the configuration go though patchConfiguration
again, which will duplicate the cluster_name
tag, and possibility have other side-effects.
I think it's safer to patch dispatcher.run to not use Schedule
as a shortcut, and directly call add
on every item of the slice instead. WDYT?
Codecov Report
@@ Coverage Diff @@
## master #3035 +/- ##
==========================================
+ Coverage 53.88% 53.89% +<.01%
==========================================
Files 537 537
Lines 38014 38193 +179
==========================================
+ Hits 20485 20584 +99
- Misses 16316 16385 +69
- Partials 1213 1224 +11
|
9426d88
to
9cd2cea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
I think this makes the lifecycle clearer. |
* ensure dangling cluster checks can be re-scheduled
What does this PR do?
Fix proposal for a bug I encountered in the QA of the Datadog Cluster Agent 1.2.0
Essentially, configurations that had been scheduled once could not be rescheduled as Cluster Level Checks.
Investigation
Context: Cluster Agent schedules a check on a Cluster Check Worker. Rolling out of an update for the Cluster Check Worker, the new Worker does not pick up the Cluster Level Checks left dangling by its churned predecessor(s).
What we observe
In the Cluster Agent:
Adding some logging:
When scheduling:
Note the
true
, Cluster Level Check configuration is processed for the first time.Once patched:
It's now false.
Although the configuration will be retrieved in the retrieveAndClearDangling
, it will not be rescheduled as it's marked as a non Cluster Check (from the patching above), in the next Scheduling.
As a reminder, we need to patch this variable so that the checks are not scheduled on the Cluster Agent, but an be scheduled on a Node Agent. See here.
With this fix:
Note that the
DanglingConfigs
config is marked as a true Cluster Check config after having been removed from the nodeip-10-1XXX2-149.ec2.internal
Later on, we can see the same configuration
kubernetes_state:fa9da93b28dcbfae
is scheduled onip-1XXX1-98.ec2.internal
Additional Notes
Tried adding a comprehensive test to reproduce this scenario - It fails on master.