Ensure dangling cluster checks can be re-scheduled #3035

CharlyF · 2019-02-14T00:16:52Z

What does this PR do?

Fix proposal for a bug I encountered in the QA of the Datadog Cluster Agent 1.2.0
Essentially, configurations that had been scheduled once could not be rescheduled as Cluster Level Checks.

Investigation

Context: Cluster Agent schedules a check on a Cluster Check Worker. Rolling out of an update for the Cluster Check Worker, the new Worker does not pick up the Cluster Level Checks left dangling by its churned predecessor(s).

What we observe

In the Cluster Agent:

    Check Configurations: 2
      - Dispatched: 0
      - Unassigned: 0

Adding some logging:

When scheduling:

2019-02-13 22:22:56 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_main.go:51 in Schedule) | [DEV] Scheduling [{kubernetes_state [[107 ...105 110 103 10]] [123 125] [] [] [kube_service://24dc8f6a-1845-11e9-b2b4-12b196e72d60] kubernetes-services kube_service://24dc8f6a-1845-11e9-b2b4-12b196e72d60 true 1}

Note the true, Cluster Level Check configuration is processed for the first time.

Once patched:

2019-02-13 22:22:56 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_main.go:61 in Schedule) | [DEV] Add {kubernetes_state [[101 109 ...110 103 10]] [123 125] [] [] [] kubernetes-services kube_service://24dc8f6a-1845-11e9-b2b4-12b196e72d60 false 1}

It's now false.

Although the configuration will be retrieved in the retrieveAndClearDangling
, it will not be rescheduled as it's marked as a non Cluster Check (from the patching above), in the next Scheduling.

As a reminder, we need to patch this variable so that the checks are not scheduled on the Cluster Agent, but an be scheduled on a Node Agent. See here.

With this fix:

2019-02-13 22:54:16 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_nodes.go:107 in expireNodes) | [DEV] Removing fa9da93b28dcbfae
2019-02-13 22:54:16 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_nodes.go:108 in expireNodes) | [DEV] d.store.digestToNode map[fa9da93b28dcbfae:ip-10-1XXX2-149.ec2.internal]
2019-02-13 22:54:16 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_nodes.go:110 in expireNodes) | [DEV] DanglingConfigs map[7a5d781a60553256:{kubernetes_state [[101 109 112 116 ... 103 10]] [123 125] [] [] [] kubernetes-services kube_service://24dc8f6a-1845-11e9-b2b4-12b196e72d60 true 1}]
[...]
2019-02-13 22:54:16 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_configs.go:118 in retrieveAndClearDangling) | [DEV] State of danglingConfigs after clear map[]
[...]
2019-02-13 22:54:16 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_main.go:61 in Schedule) | [DEV] Add {kubernetes_state [[101 109 112 ... 105 110 103 10]] [123 125] [] [] [] kubernetes-services kube_service://24dc8f6a-1845-11e9-b2b4-12b196e72d60 false 1}
[...]
2019-02-13 22:54:16 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_main.go:90 in add) | Dispatching configuration kubernetes_state:fa9da93b28dcbfae to node ip-1XXX1-98.ec2.internal

Note that the DanglingConfigs config is marked as a true Cluster Check config after having been removed from the node ip-10-1XXX2-149.ec2.internal
Later on, we can see the same configuration kubernetes_state:fa9da93b28dcbfae is scheduled on ip-1XXX1-98.ec2.internal

Additional Notes

Tried adding a comprehensive test to reproduce this scenario - It fails on master.

xvello

Thanks for the investigation an the fix!
That was introduced late in #2862. Before that, Schedule was a simple loop to add with no side-effect.

Your solution makes the configuration go though patchConfiguration again, which will duplicate the cluster_name tag, and possibility have other side-effects.

I think it's safer to patch dispatcher.run to not use Schedule as a shortcut, and directly call add on every item of the slice instead. WDYT?

codecov · 2019-02-14T09:13:17Z

Codecov Report

Merging #3035 into master will increase coverage by <.01%.
The diff coverage is 75%.

@@            Coverage Diff             @@
##           master    #3035      +/-   ##
==========================================
+ Coverage   53.88%   53.89%   +<.01%     
==========================================
  Files         537      537              
  Lines       38014    38193     +179     
==========================================
+ Hits        20485    20584      +99     
- Misses      16316    16385      +69     
- Partials     1213     1224      +11

Flag	Coverage Δ
#linux	`56.86% <ø> (-0.04%)`	⬇️
#windows	`55.67% <75%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
pkg/clusteragent/clusterchecks/dispatcher_nodes.go	`97.59% <100%> (+0.02%)`	⬆️
pkg/clusteragent/clusterchecks/dispatcher_main.go	`80.24% <71.42%> (+0.24%)`	⬆️
pkg/logs/config/processing_rules.go	`36.84% <0%> (-26.32%)`	⬇️
pkg/logs/config/integration_config.go	`71.42% <0%> (-23.02%)`	⬇️
pkg/logs/status/status.go	`78.88% <0%> (-21.12%)`	⬇️
pkg/logs/processor/processor.go	`32.72% <0%> (-2.83%)`	⬇️
pkg/trace/writer/stats.go	`87.62% <0%> (-2.07%)`	⬇️
pkg/forwarder/worker.go	`92.5% <0%> (-1.25%)`	⬇️
pkg/logs/scheduler/scheduler.go	`60.13% <0%> (-0.61%)`	⬇️
pkg/logs/input/kubernetes/launcher.go	`33.06% <0%> (-0.55%)`	⬇️
... and 5 more

xvello

🎉

CharlyF · 2019-02-14T14:03:24Z

2019-02-14 14:00:49 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_nodes.go:104 in expireNodes) | Expiring out node ip-10-1XXX-65.ec2.internal, last status report 44 seconds ago
2019-02-14 14:00:49 UTC | DEBUG | (pkg/clusteragent/clusterchecks/dispatcher_nodes.go:108 in expireNodes) | Adding kubernetes_state:fa9da93b28dcbfae as a dangling Cluster Check config
2019-02-14 14:00:49 UTC | DEBUG | (pkg/clusteragent/clusterchecks/dispatcher_nodes.go:108 in expireNodes) | Adding kubernetes_state:7a5d781a60553256 as a dangling Cluster Check config
2019-02-14 14:00:49 UTC | DEBUG | (pkg/clusteragent/clusterchecks/dispatcher_main.go:82 in reschedule) | Rescheduling the check kubernetes_state:fa9da93b28dcbfae
2019-02-14 14:00:49 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_main.go:94 in add) | Dispatching configuration kubernetes_state:fa9da93b28dcbfae to node ip-10-1XXX-152.ec2.internal
2019-02-14 14:00:49 UTC | DEBUG | (pkg/clusteragent/clusterchecks/dispatcher_main.go:82 in reschedule) | Rescheduling the check kubernetes_state:7a5d781a60553256
2019-02-14 14:00:49 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_main.go:94 in add) | Dispatching configuration kubernetes_state:7a5d781a60553256 to node ip-10-12XXX-152.ec2.internal

I think this makes the lifecycle clearer.

* ensure dangling cluster checks can be re-scheduled

ensure dangling cluster checks can be re-scheduled

7e73994

CharlyF requested a review from a team as a code owner February 14, 2019 00:16

CharlyF added team/containers changelog/no-changelog labels Feb 14, 2019

CharlyF added this to the DCA_1.2.0 milestone Feb 14, 2019

CharlyF requested a review from xvello February 14, 2019 00:17

xvello reviewed Feb 14, 2019

View reviewed changes

moved patch to dispatcher run

9cd2cea

CharlyF force-pushed the fix-dangling-config branch from 9426d88 to 9cd2cea Compare February 14, 2019 13:43

use Digest to be consistent

2e2b687

xvello approved these changes Feb 14, 2019

View reviewed changes

CharlyF merged commit bb4fad8 into master Feb 14, 2019

CharlyF deleted the fix-dangling-config branch February 14, 2019 14:04

CharlyF pushed a commit that referenced this pull request Feb 14, 2019

Ensure dangling cluster checks can be re-scheduled (#3035)

52053af

* ensure dangling cluster checks can be re-scheduled

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure dangling cluster checks can be re-scheduled #3035

Ensure dangling cluster checks can be re-scheduled #3035

CharlyF commented Feb 14, 2019

xvello left a comment

codecov bot commented Feb 14, 2019 •

edited

Loading

xvello left a comment

CharlyF commented Feb 14, 2019

Ensure dangling cluster checks can be re-scheduled #3035

Ensure dangling cluster checks can be re-scheduled #3035

Conversation

CharlyF commented Feb 14, 2019

What does this PR do?

Investigation

What we observe

Additional Notes

xvello left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 14, 2019 • edited Loading

Codecov Report

xvello left a comment

Choose a reason for hiding this comment

CharlyF commented Feb 14, 2019

codecov bot commented Feb 14, 2019 •

edited

Loading