Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[profiles] DAP slow rollout on DS creation #1365

Merged
merged 6 commits into from
Sep 4, 2024
Merged

Conversation

khewonc
Copy link
Contributor

@khewonc khewonc commented Aug 21, 2024

What does this PR do?

When DSs are first created, they spin up pods on all nodes matching their selector/affinity. With the node agent and profiles, this can be problematic in large clusters in which hundreds to thousands of node agent pods must be spun up at a time. This progressively labels nodes for profiles to target to slow down the initial rollout. Future updates to the DS will follow the update strategy defined in the DS

Motivation

https://datadoghq.atlassian.net/browse/CECO-1382

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

Describe your test plan

Write there any instructions and details you may have to test your PR.

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label

@khewonc khewonc added the enhancement New feature or request label Aug 21, 2024
@khewonc khewonc added this to the v1.9.0 milestone Aug 21, 2024
@codecov-commenter
Copy link

codecov-commenter commented Aug 21, 2024

Codecov Report

Attention: Patch coverage is 64.84848% with 58 lines in your changes missing coverage. Please review.

Project coverage is 47.21%. Comparing base (93b9f8f) to head (e820793).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
pkg/agentprofile/agent_profile.go 62.38% 36 Missing and 5 partials ⚠️
...ers/datadogagent/controller_reconcile_v2_common.go 42.85% 16 Missing ⚠️
...ontrollers/datadogagent/controller_reconcile_v2.go 75.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1365      +/-   ##
==========================================
- Coverage   47.29%   47.21%   -0.09%     
==========================================
  Files         217      217              
  Lines       18710    18968     +258     
==========================================
+ Hits         8849     8955     +106     
- Misses       9399     9544     +145     
- Partials      462      469       +7     
Flag Coverage Δ
unittests 47.21% <64.84%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...is/datadoghq/v1alpha1/datadogagentprofile_types.go 100.00% <ø> (ø)
...rollers/datadogagent/controller_reconcile_agent.go 53.44% <100.00%> (ø)
pkg/agentprofile/status.go 100.00% <100.00%> (+24.39%) ⬆️
...ontrollers/datadogagent/controller_reconcile_v2.go 53.91% <75.00%> (ø)
...ers/datadogagent/controller_reconcile_v2_common.go 29.96% <42.85%> (+2.50%) ⬆️
pkg/agentprofile/agent_profile.go 72.22% <62.38%> (-4.62%) ⬇️

... and 7 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93b9f8f...e820793. Read the comment docs.

Copy link
Contributor

@celenechang celenechang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall i think the logic is good, i just left some suggestions about naming to make names more intuitive and/or concise. (there may be a couple other places that can be updated in the same vein)

a suggestion for the future, especially if we add it to a CRD; perhaps we can call it CreationStrategy to be parallel to the existing UpdateStrategy

pkg/agentprofile/agent_profile.go Outdated Show resolved Hide resolved
pkg/agentprofile/agent_profile.go Outdated Show resolved Hide resolved
for node, hasCorrectProfileLabel := range nodesThatMatchProfile {
if useSlowStart {
if hasCorrectProfileLabel {
profileStatus.SlowStart.NodesLabeled++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LabeledNodesCount

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting to change the CRD value as well?

NodesLabeled int32 `json:"nodesLabeled"`
Should the others change to match that format? PodsReady -> ReadyPodsCount and MaxUnavailable -> MaxUnavailableCount? We don't use Count for the DaemonSetStatus so I tried to base the names off of that format

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, thanks for explaining.. yes i was thinking we could change the CRD, so that it is more intuitive that it's a number and not a list. if you prefer a couple letters less, we could use num instead? NumReadyPods, NumMaxUnavailable, NumLabeledNodes ? either is OK with me

pkg/agentprofile/agent_profile.go Show resolved Hide resolved
Comment on lines 123 to 139
for node, hasCorrectProfileLabel := range nodesThatMatchProfile {
if useSlowStart {
if hasCorrectProfileLabel {
profileStatus.SlowStart.NodesLabeled++
} else {
if numNodesToLabel <= 0 {
continue
}
numNodesToLabel--
profileStatus.SlowStart.NodesLabeled++
}
}

profileAppliedByNode[node] = types.NamespacedName{
Namespace: profile.Namespace,
Name: profile.Name,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for node, hasCorrectProfileLabel := range nodesThatMatchProfile {
if useSlowStart {
if hasCorrectProfileLabel {
profileStatus.SlowStart.NodesLabeled++
} else {
if numNodesToLabel <= 0 {
continue
}
numNodesToLabel--
profileStatus.SlowStart.NodesLabeled++
}
}
profileAppliedByNode[node] = types.NamespacedName{
Namespace: profile.Namespace,
Name: profile.Name,
}
}
for node, hasCorrectProfileLabel := range nodesThatMatchProfile {
if useSlowStart {
if hasCorrectProfileLabel {
profileStatus.SlowStart.NodesLabeled++
} else if numNodesToLabel > 0 {
profileAppliedByNode[node] = types.NamespacedName{
Namespace: profile.Namespace,
Name: profile.Name,
}
numNodesToLabel--
profileStatus.SlowStart.NodesLabeled++
}
} else {
profileAppliedByNode[node] = types.NamespacedName{
Namespace: profile.Namespace,
Name: profile.Name,
}
}
}

would that work to not apply nodes to labels that are already correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All nodes should be added to profileAppliedByNode if a profile applies to them. Otherwise, their profile label will be removed later in the reconcile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this code a bit challenging to read and well worth refactoring at some point.

pkg/agentprofile/agent_profile.go Outdated Show resolved Hide resolved

profileAppliedByNode, err = agentprofile.ProfileToApply(logger, &profile, nodeList, profileAppliedByNode, now)
maxUnavailable := agentprofile.GetMaxUnavailable(logger, dda, &profile, len(nodeList))
profileAppliedByNode, err = agentprofile.ApplyProfile(logger, &profile, nodeList, profileAppliedByNode, now, maxUnavailable)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Code Vulnerability

Potential memory range aliasing. Avoid using the memory reference. (...read more)

Implicit memory aliasing in for loops refers to a scenario in Go programming when two or more pointers reference the same location in memory, creating unexpected side effects. This often results in a common mistake amongst Go programmers due to the 'range' clause.

Consider this example, where a slice of pointers is created in a loop:

data := []int{1, 2, 3}
pointers := make([]*int, 3)
for i, v := range data {
    pointers[i] = &v
}

You might expect the 'pointers' slice to hold addresses of elements in 'data' slice, but that's not the case. In each iteration of the loop, variable 'v' gets a new value but its memory address doesn't change because it's a loop variable. As a result, each element in 'pointers' slice points to the same memory location - the address of the loop variable 'v'. The final value of 'v' is '3', and since all pointers point to 'v', dereferencing the pointers would yield '3' regardless of the pointer's index in the slice.

To avoid implicit memory aliasing in for loops in Go, you should address the actual elements in the original data structure, like so:

data := []int{1, 2, 3}
pointers := make([]*int, 3)
for i := range data {
    pointers[i] = &data[i]
}

In this example, each pointer in the 'pointers' slice correctly points to the respective element in the 'data' slice.

Learn More

View in Datadog  Leave us feedback  Documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the linked stack overflow issue, should be fine with go 1.22

@@ -349,8 +349,8 @@ func (r *Reconciler) profilesToApply(ctx context.Context, logger logr.Logger, no

sortedProfiles := agentprofile.SortProfiles(profilesList.Items)
for _, profile := range sortedProfiles {

profileAppliedByNode, err = agentprofile.ProfileToApply(logger, &profile, nodeList, profileAppliedByNode, now)
maxUnavailable := agentprofile.GetMaxUnavailable(logger, dda, &profile, len(nodeList))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Code Vulnerability

Potential memory range aliasing. Avoid using the memory reference. (...read more)

Implicit memory aliasing in for loops refers to a scenario in Go programming when two or more pointers reference the same location in memory, creating unexpected side effects. This often results in a common mistake amongst Go programmers due to the 'range' clause.

Consider this example, where a slice of pointers is created in a loop:

data := []int{1, 2, 3}
pointers := make([]*int, 3)
for i, v := range data {
    pointers[i] = &v
}

You might expect the 'pointers' slice to hold addresses of elements in 'data' slice, but that's not the case. In each iteration of the loop, variable 'v' gets a new value but its memory address doesn't change because it's a loop variable. As a result, each element in 'pointers' slice points to the same memory location - the address of the loop variable 'v'. The final value of 'v' is '3', and since all pointers point to 'v', dereferencing the pointers would yield '3' regardless of the pointer's index in the slice.

To avoid implicit memory aliasing in for loops in Go, you should address the actual elements in the original data structure, like so:

data := []int{1, 2, 3}
pointers := make([]*int, 3)
for i := range data {
    pointers[i] = &data[i]
}

In this example, each pointer in the 'pointers' slice correctly points to the respective element in the 'data' slice.

Learn More

View in Datadog  Leave us feedback  Documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the linked stack overflow issue, should be fine with go 1.22

@khewonc khewonc marked this pull request as ready for review August 29, 2024 21:02
@khewonc khewonc requested a review from a team as a code owner August 29, 2024 21:02
Copy link
Contributor

@levan-m levan-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certain things can be refactored and test coverage improved, functionally looks good to me. Given this functionality is behind SlowStart check we can merge and take care of those items in next release.

@khewonc khewonc merged commit ac2e823 into main Sep 4, 2024
19 checks passed
@khewonc khewonc deleted the khewonc/dap-slow-rollout branch September 4, 2024 14:07
mftoure pushed a commit that referenced this pull request Oct 3, 2024
* First pass slow rollout dap

* Fix autogenerated files

* Remove timeout max_unavailable env vars

* review suggestions

* unblock rollout after completion

* no slow start after completion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants