Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APMON-406] Re-define Single Step Instrumentation logic for existing configuration options #21673

Merged
merged 12 commits into from
Jan 5, 2024

Conversation

liliyadd
Copy link
Contributor

@liliyadd liliyadd commented Dec 19, 2023

What does this PR do?

  1. Changes SSI logic for introduced earlier configuration options https://docs.google.com/document/d/1QJoTwJLvpAsBsLHuDedbXi1FlmW_askMsp57Z03qNLU/edit?usp=sharing
  2. Disable SSI in the namespace running Datadog resources.
  3. Do not allow explicitly enabling SSI in kube-system or Datadog namespaces.

Motivation

Existing SSI configuration options might be a bit confusing for the customers. While we are still in Beta for SSI, we would like to change the feature behavior for the configuration we defined earlier.
This PR also move the logic of disabling SSI in Datadog namespace from Helm/Operator to Cluster Agent where it belongs.

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

  1. Build Cluster Agent Dev image.
  2. Us the image from the previous step to deploy Datadog agent in minikube, using various combination of SSI configurations.
  3. Deploy a test app (without injection annotations and labels) in the cluster and check if depending on SSI configuration, the test app is instrumented:
    3.1. datadog.apm.instrumentation.enabled=false - no libraries should be injected.
    3.2. datadog.apm.instrumentation.enabled=false && datadog.apm.instrumentation.enabledNamespaces=[apps], where apps is the ns of the test app - no libraries should be injected.
    3.3. datadog.apm.instrumentation.enabled=false && datadog.apm.instrumentation.disabledNamespaces=[apps2], where apps is the ns of the test app - no libraries should be injected.
    3.4. datadog.apm.instrumentation.enabled=false && datadog.apm.instrumentation.enabledNamespaces=[apps] && datadog.apm.instrumentation.disabledNamespaces=[apps2] - helm install should fail
    3.5. datadog.apm.instrumentation.enabled=true && datadog.apm.instrumentation.enabledNamespaces=[apps], where apps is the ns of the test app - APM libraries should be injected into the test app.
    3.6. datadog.apm.instrumentation.enabled=true && datadog.apm.instrumentation.enabledNamespaces=[apps2], where apps is the ns of the test app - no libraries should be injected in the test app in the namespace apps
    3.7. datadog.apm.instrumentation.enabled=true && datadog.apm.instrumentation.disabledNamespaces=[apps], where apps is the ns of the test app - no libraries should be injected in the test app in the namespace apps
    3.8. datadog.apm.instrumentation.enabled=true && datadog.apm.instrumentation.disabledNamespaces=[apps2], where apps is the ns of the test app - APM libraries should be injected into the test app in the namespace apps
    3.9. datadog.apm.instrumentation.enabled=true && datadog.apm.instrumentation.enabledNamespaces=[apps] &&datadog.apm.instrumentation.disabledNamespaces=[apps2], where apps is the ns of the test app -helm install should fail

Reviewer's Checklist

  • If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • Use the major_change label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.
  • A release note has been added or the changelog/no-changelog label has been applied.
  • Changed code has automated tests for its functionality.
  • Adequate QA/testing plan information is provided. Except if the qa/skip-qa label, with required either qa/done or qa/no-code-change labels, are applied.
  • At least one team/.. label has been applied, indicating the team(s) that should QA this change.
  • If applicable, docs team has been notified or an issue has been opened on the documentation repo.
  • If applicable, the need-change/operator and need-change/helm labels have been applied.
  • If applicable, the k8s/<min-version> label, indicating the lowest Kubernetes version compatible with this feature.
  • If applicable, the config template has been updated.

@liliyadd liliyadd added changelog/no-changelog need-change/operator Add this label if you change request also a change in the datadog-operator need-change/helm-chart Add this label if your change require also a change in the datadog helm chart team/apm-onboarding labels Dec 19, 2023
@liliyadd liliyadd added this to the 7.51.0 milestone Dec 19, 2023
@liliyadd liliyadd requested a review from a team December 19, 2023 20:05
@liliyadd liliyadd requested a review from a team as a code owner December 19, 2023 20:05
@liliyadd liliyadd changed the title [APM Instrumentation] Re-define Single Step Instrumentation logic for existing configuration options [APMON-406] Re-define Single Step Instrumentation logic for existing configuration options Dec 19, 2023
@pr-commenter
Copy link

pr-commenter bot commented Dec 19, 2023

Bloop Bleep... Dogbot Here

Regression Detector Results

Run ID: 88c376bb-ae37-4d7e-9b4f-3c2c45f13805
Baseline: 6e461e3
Comparison: c0a8062
Total CPUs: 7

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI
file_tree memory utilization +0.43 [+0.32, +0.53]
idle memory utilization -0.09 [-0.12, -0.06]
file_to_blackhole % cpu utilization -0.72 [-7.26, +5.82]

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI
file_tree memory utilization +0.43 [+0.32, +0.53]
trace_agent_msgpack ingress throughput +0.02 [+0.01, +0.04]
trace_agent_json ingress throughput +0.00 [-0.03, +0.03]
uds_dogstatsd_to_api ingress throughput +0.00 [-0.03, +0.03]
tcp_dd_logs_filter_exclude ingress throughput -0.00 [-0.05, +0.05]
process_agent_standard_check memory utilization -0.00 [-0.05, +0.05]
process_agent_real_time_mode memory utilization -0.02 [-0.06, +0.02]
idle memory utilization -0.09 [-0.12, -0.06]
tcp_syslog_to_blackhole ingress throughput -0.35 [-0.42, -0.29]
process_agent_standard_check_with_stats memory utilization -0.40 [-0.45, -0.35]
file_to_blackhole % cpu utilization -0.72 [-7.26, +5.82]
otel_to_otel_logs ingress throughput -1.58 [-2.32, -0.83]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

return true
}
// Always disable Single Step Instrumentation in kube-system and Datadog namespaces
if (namespace == namespaceWithAlwaysDisabledInjections) || (namespace == apiServerCommon.GetResourcesNamespace()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
I think that defining a function | filter module could be better for code maintainability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using filter won't work well in this case. Each filter is not a simple yes or no, but has three states - yes, no or check the next filter. Making this logic to work with filters is possible, but I don't think it will be clean or easy to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I follow the point of @AliDatadog, having a function func filterNamespace(ns string) bool, can extract from this function the logic from line 243 and 260.

we can even create a struct to encapsulate the exiting generic container filter: https://github.com/DataDog/datadog-agent/blob/c3a2b5547319212e098e53882d2b0e35e088123a/pkg/util/containers/filter.go#L220C19-L220C29

By doing this, it will allow:

  • reuse exiting filtering capability, and align the APM admission controller namespace filtering logic with other products.
  • regex can be supported easily in the future
type namespaceFilter struct {
  filter *containers.Filter
}

func newNamespaceFilter(cfg config.Config) (namespaceFilter, error) {
  var includeList, excludeList []string
  // TODO:
  //   - build the exclude and include lists, from the config
  //   - add the prefix `kube_namespace:` to the namespace name in the lists to be compatible
  //     with the containers.Filter logic

  filter, err := containers.NewFilter(containers.GlobalFilter, includeList, excludeList)
  if err != nil {
    return nil, err
  }
  return namespaceFilter{ filter: filter}, nil
}

func (f* namespaceFilter) filter(namespace string) bool {
  return !f.filter.IsExcluded(nil,"","",namespace)
}

So it is maybe not necessary for this PR. but it can be a good refactoring

Copy link
Contributor Author

@liliyadd liliyadd Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support three checks for the namespace value:

  1. if len(includeList) > 0 && len(excludeList) > 0 - this cannot be checked by filter. From the filter point of view, it's valid to have both includeList and excludeList set at the same time. So we cannot replace this check by the filter.
  2. if (namespace == "kube_system") || (namespace == "datadog") - we could add kube_system and datadog namespaces to the excludeList; however since includeList takes priority, users can add kube_system to the includeList and do APM Instrumentation in Kubernetes namespace. That goes against expected feature behavior.
if len(includeList) > 0 {
    if slices.Contains[[]string, string](includeList, namespace) {
	return true
    }
    return false
}
if slices.Contains[[]string, string](excludeList, namespace) {
    return false
}

I think this logic can be almost replaced by !f.filter.IsExcluded(nil,"","",namespace), except for returning false if includeList is non-empty, but doesn't contain current namespace. Maybe I'm missing how it can be done?

I could see the filtering logic getting complicated enough to warrant using filters. But, I cleaned up the logic a bit and now it's super simpler and readable. I think it will be clearer to our future selves especially those without knowledge of the filter API, to use the current solution. Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 👍 for moving this check to run once when admission controller starts
  2. I do think that instrumenting kube-system could be too dangerous even if users explicitly want it. But I'm OK to do it if it's OK from product POV
  3. If includeList is non-empty, but doesn't contain current namespace, filter.IsExcluded will return false - . But we want the opposite - if includeList is non-empty and the current namespace is not in this list, the namespace should be excluded, i.e. IsExcluded should return true.

@AliDatadog
Copy link
Contributor

Could you enhance the PR description with details on how to test it ?

@liliyadd liliyadd requested a review from a team January 2, 2024 18:27
// filterNamespace returns a bool indicating if Single Step Instrumentation on the namespace
func filterNamespace(ns string) bool {
apmEnabledNamespaces := config.Datadog.GetStringSlice("apm_config.instrumentation.enabled_namespaces")
apmDisabledNamespaces := config.Datadog.GetStringSlice("apm_config.instrumentation.disabled_namespaces")

// apm.instrumentation.enabled_namespaces and apm.instrumentation.disabled_namespaces configuration cannot be set at the same time
if len(apmEnabledNamespaces) > 0 && len(apmDisabledNamespaces) > 0 {
Copy link
Contributor

@clamoriniere clamoriniere Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better that we define a priority between enabled and disabled lists (similar to include/exclude container.Filter capability) if the user never sets both.

Because with the current implementation the admission controller logs an error and we consider that the NS is discarded. But the user might not see the log error, and it will consider it doesn't work.

IMO we can keep the log. but also said that enabled_namespaces got the priority on disabled_namespaces if a namespace is present in both, we consider that it is enabled.
It is the logic that we apply with the container.Filter too.

Also I think the helm-chart already trigger a configuration error when the user run helm install|upgrade if both parameters are set.

@liliyadd liliyadd merged commit 30c3524 into main Jan 5, 2024
97 of 111 checks passed
@liliyadd liliyadd deleted the liliya.belaus/ssi-exclude-dda-namespace branch January 5, 2024 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog/no-changelog need-change/helm-chart Add this label if your change require also a change in the datadog helm chart need-change/operator Add this label if you change request also a change in the datadog-operator team/apm-onboarding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants