Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cut tagger stream event responses into chunks of 4MB max size each #30192

Conversation

adel121
Copy link
Contributor

@adel121 adel121 commented Oct 16, 2024

What does this PR do?

This PR modifies the remote tagger server so that it streams the tagger events in chunks with each chunk having a size of 4 MB at most.

Motivation

Avoid failure of communication between client and server on large clusters on which the size of the message might exceed 4MB.

Describe how to test/QA your changes

Ensure that the remote tagger still works as expected.

  1. Deploy the agent and cluster agent with the orchestrator cluster check running in a CLC runner and using the cluster tagger. Also configure namespace labels as tags and enable kubernetes tags collection so that the cluster tagger has some tags for pods and namespaces:
datadog:
  apiKeyExistingSecret: datadog-secret
  appKeyExistingSecret: datadog-secret
  clusterChecks:
    enabled: true
  kubelet:
    tlsVerify: false

  clusterTagger:
    collectKubernetesTags: true

  namespaceLabelsAsTags:
    kubernetes.io/metadata.name: name

clusterChecksRunner:
  enabled: true
  replicas: 1
  env:
    - name: DD_CLC_RUNNER_REMOTE_TAGGER_ENABLED
      value: true

clusterAgent:
  enabled: true
  replicas: 1

  advancedConfd:
    orchestrator.d:
      1.yaml: |-
        cluster_check: true
        init_config:
        instances:
          - collectors:
            - pods
            skip_leader_election: true

  1. Check that the cluster tagger contains tags for pods and namespaces:
kubectl exec <cluster-agent-pod-name> -- agent tagger-list

=== Entity kubernetes_metadata:///namespaces//default ===
== Source workloadmeta-kubernetes_metadata =
=Tags: [name:default]
===

=== Entity kubernetes_metadata:///namespaces//kube-public ===
== Source workloadmeta-kubernetes_metadata =
=Tags: [name:kube-public]
===

=== Entity kubernetes_pod_uid://481c1f63-89ef-4f7e-8b92-25b9e4df2add ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_app_component:clusterchecks-agent kube_app_instance:datadog-agent kube_app_managed_by:Helm kube_app_name:datadog-agent kube_cluster_name:adel-orchestrator-clc kube_deployment:datadog-agent-clusterchecks kube_namespace:default kube_ownerref_kind:replicaset kube_ownerref_name:datadog-agent-clusterchecks-574dfbd86f kube_qos:BestEffort kube_replica_set:datadog-agent-clusterchecks-574dfbd86f pod_name:datadog-agent-clusterchecks-574dfbd86f-ptp8t pod_phase:running]
===

=== Entity kubernetes_metadata:///namespaces//kube-system ===
== Source workloadmeta-kubernetes_metadata =
=Tags: [name:kube-system]
===

=== Entity kubernetes_metadata:///namespaces//local-path-storage ===
== Source workloadmeta-kubernetes_metadata =
=Tags: [name:local-path-storage]
===

=== Entity kubernetes_pod_uid://73d766d6-869e-4e89-8a15-63c11eb6ffb3 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_namespace:kube-system kube_ownerref_kind:node kube_ownerref_name:kind-control-plane kube_priority_class:system-node-critical kube_qos:Burstable pod_name:kube-apiserver-kind-control-plane pod_phase:running]
===

=== Entity kubernetes_pod_uid://75491183-12b9-483f-a8bd-d573f983a53b ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_namespace:default kube_qos:BestEffort pod_name:unschedulable-pod pod_phase:pending]
===

=== Entity kubernetes_pod_uid://ebb6d4f5-9850-4b4a-9a4c-5c7418b2f912 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_namespace:kube-system kube_ownerref_kind:node kube_ownerref_name:kind-control-plane kube_priority_class:system-node-critical kube_qos:Burstable pod_name:kube-scheduler-kind-control-plane pod_phase:running]
===

=== Entity kubernetes_metadata:///namespaces//kube-node-lease ===
== Source workloadmeta-kubernetes_metadata =
=Tags: [name:kube-node-lease]
===

=== Entity kubernetes_pod_uid://3956c4f6-32df-4f6a-b5be-72149745bb00 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_namespace:kube-system kube_ownerref_kind:node kube_ownerref_name:kind-control-plane kube_priority_class:system-node-critical kube_qos:Burstable pod_name:kube-controller-manager-kind-control-plane pod_phase:running]
===

=== Entity kubernetes_pod_uid://541a6a3e-f829-434c-9bb7-dcf3b8840bb2 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_daemon_set:kube-proxy kube_namespace:kube-system kube_ownerref_kind:daemonset kube_ownerref_name:kube-proxy kube_priority_class:system-node-critical kube_qos:BestEffort pod_name:kube-proxy-d2bj9 pod_phase:running]
===

=== Entity kubernetes_pod_uid://620e8961-83ea-4f3d-9256-0bb67b2b959d ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_deployment:local-path-provisioner kube_namespace:local-path-storage kube_ownerref_kind:replicaset kube_ownerref_name:local-path-provisioner-6bc4bddd6b kube_qos:BestEffort kube_replica_set:local-path-provisioner-6bc4bddd6b pod_name:local-path-provisioner-6bc4bddd6b-qjrtf pod_phase:running]
===

=== Entity kubernetes_pod_uid://8ef52246-fd65-48ec-b04d-afc5ab949c25 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_daemon_set:kindnet kube_namespace:kube-system kube_ownerref_kind:daemonset kube_ownerref_name:kindnet kube_qos:Guaranteed pod_name:kindnet-4lzgr pod_phase:running]
===

=== Entity kubernetes_pod_uid://fce5e4a8-db5e-4b4b-8cab-be6029e1b762 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_namespace:kube-system kube_ownerref_kind:node kube_ownerref_name:kind-control-plane kube_priority_class:system-node-critical kube_qos:Burstable pod_name:etcd-kind-control-plane pod_phase:running]
===

=== Entity internal://global-entity-id ===
== Source workloadmeta-static =
=Tags: [kube_cluster_name:adel-orchestrator-clc]
===

=== Entity kubernetes_pod_uid://184c26b0-f03b-4887-896c-092250f61138 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_app_component:cluster-agent kube_app_instance:datadog-agent kube_app_managed_by:Helm kube_app_name:datadog-agent kube_cluster_name:adel-orchestrator-clc kube_deployment:datadog-agent-cluster-agent kube_namespace:default kube_ownerref_kind:replicaset kube_ownerref_name:datadog-agent-cluster-agent-78cdcb7c8 kube_qos:BestEffort kube_replica_set:datadog-agent-cluster-agent-78cdcb7c8 pod_name:datadog-agent-cluster-agent-78cdcb7c8-sf9l4 pod_phase:running]
===

=== Entity kubernetes_pod_uid://23b009dc-8790-41df-a6ad-d315ec695b7b ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_deployment:coredns kube_namespace:kube-system kube_ownerref_kind:replicaset kube_ownerref_name:coredns-5d78c9869d kube_priority_class:system-cluster-critical kube_qos:Burstable kube_replica_set:coredns-5d78c9869d pod_name:coredns-5d78c9869d-7tsr5 pod_phase:running]
===

=== Entity kubernetes_pod_uid://b77e5126-07bd-4c37-b613-616fb8ba38c1 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_app_component:clusterchecks-agent kube_app_instance:datadog-agent kube_app_managed_by:Helm kube_app_name:datadog-agent kube_cluster_name:adel-orchestrator-clc kube_deployment:datadog-agent-clusterchecks kube_namespace:default kube_ownerref_kind:replicaset kube_ownerref_name:datadog-agent-clusterchecks-574dfbd86f kube_qos:BestEffort kube_replica_set:datadog-agent-clusterchecks-574dfbd86f pod_name:datadog-agent-clusterchecks-574dfbd86f-fwkw5 pod_phase:running]
===

=== Entity kubernetes_pod_uid://c4c80235-9a96-4427-9b60-e146a9477eda ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_deployment:coredns kube_namespace:kube-system kube_ownerref_kind:replicaset kube_ownerref_name:coredns-5d78c9869d kube_priority_class:system-cluster-critical kube_qos:Burstable kube_replica_set:coredns-5d78c9869d pod_name:coredns-5d78c9869d-frvkc pod_phase:running]
===

=== Entity kubernetes_pod_uid://c4ec4af1-9b22-4273-98dd-6780ceb42563 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_app_component:agent kube_app_instance:datadog-agent kube_app_managed_by:Helm kube_app_name:datadog-agent kube_cluster_name:adel-orchestrator-clc kube_daemon_set:datadog-agent kube_namespace:default kube_ownerref_kind:daemonset kube_ownerref_name:datadog-agent kube_qos:BestEffort pod_name:datadog-agent-d5j2t pod_phase:running]
===

  1. Verify that only namespace tags are synced into the CLC runner remote tagger:
kubectl <clc_runner_pod_name> -- agent tagger-list


=== Entity kubernetes_metadata:///namespaces//default ===
== Source remote =
=Tags: [name:default]
===

=== Entity kubernetes_metadata:///namespaces//kube-node-lease ===
== Source remote =
=Tags: [name:kube-node-lease]
===

=== Entity kubernetes_metadata:///namespaces//kube-public ===
== Source remote =
=Tags: [name:kube-public]
===

=== Entity kubernetes_metadata:///namespaces//kube-system ===
== Source remote =
=Tags: [name:kube-system]
===

=== Entity kubernetes_metadata:///namespaces//local-path-storage ===
== Source remote =
=Tags: [name:local-path-storage]
===

Possible Drawbacks / Trade-offs

Additional Notes

@adel121 adel121 added this to the 7.60.0 milestone Oct 16, 2024
@adel121 adel121 force-pushed the adelhajhassan/cut_taggerstream_into_4mb_chunks_with_remote_tagger branch from 7fa0a76 to 83a1cb7 Compare October 16, 2024 17:51
@adel121 adel121 force-pushed the adelhajhassan/cut_taggerstream_into_4mb_chunks_with_remote_tagger branch from 83a1cb7 to 61bdcf0 Compare October 16, 2024 17:52
@adel121 adel121 marked this pull request as ready for review October 16, 2024 17:52
@adel121 adel121 requested review from a team as code owners October 16, 2024 17:52
@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Oct 16, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=47031106 --os-family=ubuntu

Note: This applies to commit a5c6aca

// The size of each item is calculated using computeSize
//
// This function assumes that the size of each single item of the initial slice is not larger than maxChunkSize
func splitBySize[T any](slice []T, maxChunkSize int, computeSize func(T) int) [][]T {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

declared this generic function in order to make it easy to unit test the split functionality

Comment on lines +56 to +57
grpc.MaxRecvMsgSize(maxMessageSize),
grpc.MaxSendMsgSize(maxMessageSize),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default message size in grpc is 4MB.

I added these explicitly here in order to make sure the tagger server gets the same message max size when splitting the response into chunks.

Comment on lines +131 to +132
grpc.MaxSendMsgSize(maxMessageSize),
grpc.MaxRecvMsgSize(maxMessageSize),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default message size in grpc is 4MB.

I added these explicitly here in order to make sure the tagger server gets the same message max size when splitting the response into chunks.

Copy link
Contributor

@ogaca-dd ogaca-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for files owned by ASC

Copy link
Contributor

@clamoriniere clamoriniere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @adel121

change looks good. I added 2 small comments

@@ -48,18 +48,21 @@ func (server *apiServer) startCMDServer(

// gRPC server
authInterceptor := grpcutil.AuthInterceptor(parseToken)
const maxMessageSize = 4 * 1024 * 1024 // 4 MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: move const definition on the beginning of the file (after line 33)

}

// splitEvents splits the array of events to chunks with at most maxChunkSize each
func splitEvents(events []*pb.StreamTagsEvent, maxChunkSize int) [][]*pb.StreamTagsEvent {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function works, and because it is a slice of slice of *pb.StreamTagsEvent so the memory use to create the new slice should not be to large.

However for memory optimisation, I'm wondering if has been able to see if you can return an iterator instead of creating new slices like slices.Chunk function is doing: https://pkg.go.dev/slices#Chunk
It would help to create sequentially the new slice when create the proto messages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterators were introduced in golang v1.23.0, the slices Chunk method that you mention was itself added in 1.23.0.

Currently, we are on v1.22.0 on the agent.

IMO I don't think it is worth it to upgrade the go version now to do this change.

We can create a card to make the change once the agent golang version is upgraded so that we don't forget to do this optimisation.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes perfect
Thanks for looking at it 🙇

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can add a comment (TODO) in the code

@adel121 adel121 requested a review from clamoriniere October 18, 2024 10:08
Copy link
Contributor

@clamoriniere clamoriniere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix and optimisation 👍

Copy link

Regression Detector

@adel121
Copy link
Contributor Author

adel121 commented Oct 21, 2024

/merge

@dd-devflow
Copy link

dd-devflow bot commented Oct 21, 2024

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 22m.

Use /merge -c to cancel this operation!

@dd-mergequeue dd-mergequeue bot merged commit 87369f7 into main Oct 21, 2024
221 of 224 checks passed
@dd-mergequeue dd-mergequeue bot deleted the adelhajhassan/cut_taggerstream_into_4mb_chunks_with_remote_tagger branch October 21, 2024 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants