Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure dangling cluster checks can be re-scheduled #3035

Merged
merged 3 commits into from
Feb 14, 2019
Merged

Conversation

CharlyF
Copy link
Contributor

@CharlyF CharlyF commented Feb 14, 2019

What does this PR do?

Fix proposal for a bug I encountered in the QA of the Datadog Cluster Agent 1.2.0
Essentially, configurations that had been scheduled once could not be rescheduled as Cluster Level Checks.

Investigation

Context: Cluster Agent schedules a check on a Cluster Check Worker. Rolling out of an update for the Cluster Check Worker, the new Worker does not pick up the Cluster Level Checks left dangling by its churned predecessor(s).

What we observe

In the Cluster Agent:

    Check Configurations: 2
      - Dispatched: 0
      - Unassigned: 0

Adding some logging:

When scheduling:

2019-02-13 22:22:56 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_main.go:51 in Schedule) | [DEV] Scheduling [{kubernetes_state [[107 ...105 110 103 10]] [123 125] [] [] [kube_service://24dc8f6a-1845-11e9-b2b4-12b196e72d60] kubernetes-services kube_service://24dc8f6a-1845-11e9-b2b4-12b196e72d60 true 1}

Note the true, Cluster Level Check configuration is processed for the first time.

Once patched:

2019-02-13 22:22:56 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_main.go:61 in Schedule) | [DEV] Add {kubernetes_state [[101 109 ...110 103 10]] [123 125] [] [] [] kubernetes-services kube_service://24dc8f6a-1845-11e9-b2b4-12b196e72d60 false 1}

It's now false.

Although the configuration will be retrieved in the retrieveAndClearDangling
, it will not be rescheduled as it's marked as a non Cluster Check (from the patching above), in the next Scheduling.

As a reminder, we need to patch this variable so that the checks are not scheduled on the Cluster Agent, but an be scheduled on a Node Agent. See here.

With this fix:

2019-02-13 22:54:16 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_nodes.go:107 in expireNodes) | [DEV] Removing fa9da93b28dcbfae
2019-02-13 22:54:16 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_nodes.go:108 in expireNodes) | [DEV] d.store.digestToNode map[fa9da93b28dcbfae:ip-10-1XXX2-149.ec2.internal]
2019-02-13 22:54:16 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_nodes.go:110 in expireNodes) | [DEV] DanglingConfigs map[7a5d781a60553256:{kubernetes_state [[101 109 112 116 ... 103 10]] [123 125] [] [] [] kubernetes-services kube_service://24dc8f6a-1845-11e9-b2b4-12b196e72d60 true 1}]
[...]
2019-02-13 22:54:16 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_configs.go:118 in retrieveAndClearDangling) | [DEV] State of danglingConfigs after clear map[]
[...]
2019-02-13 22:54:16 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_main.go:61 in Schedule) | [DEV] Add {kubernetes_state [[101 109 112 ... 105 110 103 10]] [123 125] [] [] [] kubernetes-services kube_service://24dc8f6a-1845-11e9-b2b4-12b196e72d60 false 1}
[...]
2019-02-13 22:54:16 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_main.go:90 in add) | Dispatching configuration kubernetes_state:fa9da93b28dcbfae to node ip-1XXX1-98.ec2.internal

Note that the DanglingConfigs config is marked as a true Cluster Check config after having been removed from the node ip-10-1XXX2-149.ec2.internal
Later on, we can see the same configuration kubernetes_state:fa9da93b28dcbfae is scheduled on ip-1XXX1-98.ec2.internal

Additional Notes

Tried adding a comprehensive test to reproduce this scenario - It fails on master.

@CharlyF CharlyF requested a review from a team as a code owner February 14, 2019 00:16
@CharlyF CharlyF added this to the DCA_1.2.0 milestone Feb 14, 2019
@CharlyF CharlyF requested a review from xvello February 14, 2019 00:17
Copy link
Contributor

@xvello xvello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation an the fix!
That was introduced late in #2862. Before that, Schedule was a simple loop to add with no side-effect.

Your solution makes the configuration go though patchConfiguration again, which will duplicate the cluster_name tag, and possibility have other side-effects.

I think it's safer to patch dispatcher.run to not use Schedule as a shortcut, and directly call add on every item of the slice instead. WDYT?

@codecov
Copy link

codecov bot commented Feb 14, 2019

Codecov Report

Merging #3035 into master will increase coverage by <.01%.
The diff coverage is 75%.

@@            Coverage Diff             @@
##           master    #3035      +/-   ##
==========================================
+ Coverage   53.88%   53.89%   +<.01%     
==========================================
  Files         537      537              
  Lines       38014    38193     +179     
==========================================
+ Hits        20485    20584      +99     
- Misses      16316    16385      +69     
- Partials     1213     1224      +11
Flag Coverage Δ
#linux 56.86% <ø> (-0.04%) ⬇️
#windows 55.67% <75%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pkg/clusteragent/clusterchecks/dispatcher_nodes.go 97.59% <100%> (+0.02%) ⬆️
pkg/clusteragent/clusterchecks/dispatcher_main.go 80.24% <71.42%> (+0.24%) ⬆️
pkg/logs/config/processing_rules.go 36.84% <0%> (-26.32%) ⬇️
pkg/logs/config/integration_config.go 71.42% <0%> (-23.02%) ⬇️
pkg/logs/status/status.go 78.88% <0%> (-21.12%) ⬇️
pkg/logs/processor/processor.go 32.72% <0%> (-2.83%) ⬇️
pkg/trace/writer/stats.go 87.62% <0%> (-2.07%) ⬇️
pkg/forwarder/worker.go 92.5% <0%> (-1.25%) ⬇️
pkg/logs/scheduler/scheduler.go 60.13% <0%> (-0.61%) ⬇️
pkg/logs/input/kubernetes/launcher.go 33.06% <0%> (-0.55%) ⬇️
... and 5 more

@CharlyF CharlyF force-pushed the fix-dangling-config branch from 9426d88 to 9cd2cea Compare February 14, 2019 13:43
Copy link
Contributor

@xvello xvello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@CharlyF
Copy link
Contributor Author

CharlyF commented Feb 14, 2019

2019-02-14 14:00:49 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_nodes.go:104 in expireNodes) | Expiring out node ip-10-1XXX-65.ec2.internal, last status report 44 seconds ago
2019-02-14 14:00:49 UTC | DEBUG | (pkg/clusteragent/clusterchecks/dispatcher_nodes.go:108 in expireNodes) | Adding kubernetes_state:fa9da93b28dcbfae as a dangling Cluster Check config
2019-02-14 14:00:49 UTC | DEBUG | (pkg/clusteragent/clusterchecks/dispatcher_nodes.go:108 in expireNodes) | Adding kubernetes_state:7a5d781a60553256 as a dangling Cluster Check config
2019-02-14 14:00:49 UTC | DEBUG | (pkg/clusteragent/clusterchecks/dispatcher_main.go:82 in reschedule) | Rescheduling the check kubernetes_state:fa9da93b28dcbfae
2019-02-14 14:00:49 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_main.go:94 in add) | Dispatching configuration kubernetes_state:fa9da93b28dcbfae to node ip-10-1XXX-152.ec2.internal
2019-02-14 14:00:49 UTC | DEBUG | (pkg/clusteragent/clusterchecks/dispatcher_main.go:82 in reschedule) | Rescheduling the check kubernetes_state:7a5d781a60553256
2019-02-14 14:00:49 UTC | INFO | (pkg/clusteragent/clusterchecks/dispatcher_main.go:94 in add) | Dispatching configuration kubernetes_state:7a5d781a60553256 to node ip-10-12XXX-152.ec2.internal

I think this makes the lifecycle clearer.

@CharlyF CharlyF merged commit bb4fad8 into master Feb 14, 2019
@CharlyF CharlyF deleted the fix-dangling-config branch February 14, 2019 14:04
CharlyF pushed a commit that referenced this pull request Feb 14, 2019
* ensure dangling cluster checks can be re-scheduled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants