Cut tagger stream event responses into chunks of 4MB max size each #30192

adel121 · 2024-10-16T17:24:36Z

What does this PR do?

This PR modifies the remote tagger server so that it streams the tagger events in chunks with each chunk having a size of 4 MB at most.

Motivation

Avoid failure of communication between client and server on large clusters on which the size of the message might exceed 4MB.

Describe how to test/QA your changes

Ensure that the remote tagger still works as expected.

Deploy the agent and cluster agent with the orchestrator cluster check running in a CLC runner and using the cluster tagger. Also configure namespace labels as tags and enable kubernetes tags collection so that the cluster tagger has some tags for pods and namespaces:

datadog:
  apiKeyExistingSecret: datadog-secret
  appKeyExistingSecret: datadog-secret
  clusterChecks:
    enabled: true
  kubelet:
    tlsVerify: false

  clusterTagger:
    collectKubernetesTags: true

  namespaceLabelsAsTags:
    kubernetes.io/metadata.name: name

clusterChecksRunner:
  enabled: true
  replicas: 1
  env:
    - name: DD_CLC_RUNNER_REMOTE_TAGGER_ENABLED
      value: true

clusterAgent:
  enabled: true
  replicas: 1

  advancedConfd:
    orchestrator.d:
      1.yaml: |-
        cluster_check: true
        init_config:
        instances:
          - collectors:
            - pods
            skip_leader_election: true

Check that the cluster tagger contains tags for pods and namespaces:

kubectl exec <cluster-agent-pod-name> -- agent tagger-list

=== Entity kubernetes_metadata:///namespaces//default ===
== Source workloadmeta-kubernetes_metadata =
=Tags: [name:default]
===

=== Entity kubernetes_metadata:///namespaces//kube-public ===
== Source workloadmeta-kubernetes_metadata =
=Tags: [name:kube-public]
===

=== Entity kubernetes_pod_uid://481c1f63-89ef-4f7e-8b92-25b9e4df2add ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_app_component:clusterchecks-agent kube_app_instance:datadog-agent kube_app_managed_by:Helm kube_app_name:datadog-agent kube_cluster_name:adel-orchestrator-clc kube_deployment:datadog-agent-clusterchecks kube_namespace:default kube_ownerref_kind:replicaset kube_ownerref_name:datadog-agent-clusterchecks-574dfbd86f kube_qos:BestEffort kube_replica_set:datadog-agent-clusterchecks-574dfbd86f pod_name:datadog-agent-clusterchecks-574dfbd86f-ptp8t pod_phase:running]
===

=== Entity kubernetes_metadata:///namespaces//kube-system ===
== Source workloadmeta-kubernetes_metadata =
=Tags: [name:kube-system]
===

=== Entity kubernetes_metadata:///namespaces//local-path-storage ===
== Source workloadmeta-kubernetes_metadata =
=Tags: [name:local-path-storage]
===

=== Entity kubernetes_pod_uid://73d766d6-869e-4e89-8a15-63c11eb6ffb3 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_namespace:kube-system kube_ownerref_kind:node kube_ownerref_name:kind-control-plane kube_priority_class:system-node-critical kube_qos:Burstable pod_name:kube-apiserver-kind-control-plane pod_phase:running]
===

=== Entity kubernetes_pod_uid://75491183-12b9-483f-a8bd-d573f983a53b ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_namespace:default kube_qos:BestEffort pod_name:unschedulable-pod pod_phase:pending]
===

=== Entity kubernetes_pod_uid://ebb6d4f5-9850-4b4a-9a4c-5c7418b2f912 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_namespace:kube-system kube_ownerref_kind:node kube_ownerref_name:kind-control-plane kube_priority_class:system-node-critical kube_qos:Burstable pod_name:kube-scheduler-kind-control-plane pod_phase:running]
===

=== Entity kubernetes_metadata:///namespaces//kube-node-lease ===
== Source workloadmeta-kubernetes_metadata =
=Tags: [name:kube-node-lease]
===

=== Entity kubernetes_pod_uid://3956c4f6-32df-4f6a-b5be-72149745bb00 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_namespace:kube-system kube_ownerref_kind:node kube_ownerref_name:kind-control-plane kube_priority_class:system-node-critical kube_qos:Burstable pod_name:kube-controller-manager-kind-control-plane pod_phase:running]
===

=== Entity kubernetes_pod_uid://541a6a3e-f829-434c-9bb7-dcf3b8840bb2 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_daemon_set:kube-proxy kube_namespace:kube-system kube_ownerref_kind:daemonset kube_ownerref_name:kube-proxy kube_priority_class:system-node-critical kube_qos:BestEffort pod_name:kube-proxy-d2bj9 pod_phase:running]
===

=== Entity kubernetes_pod_uid://620e8961-83ea-4f3d-9256-0bb67b2b959d ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_deployment:local-path-provisioner kube_namespace:local-path-storage kube_ownerref_kind:replicaset kube_ownerref_name:local-path-provisioner-6bc4bddd6b kube_qos:BestEffort kube_replica_set:local-path-provisioner-6bc4bddd6b pod_name:local-path-provisioner-6bc4bddd6b-qjrtf pod_phase:running]
===

=== Entity kubernetes_pod_uid://8ef52246-fd65-48ec-b04d-afc5ab949c25 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_daemon_set:kindnet kube_namespace:kube-system kube_ownerref_kind:daemonset kube_ownerref_name:kindnet kube_qos:Guaranteed pod_name:kindnet-4lzgr pod_phase:running]
===

=== Entity kubernetes_pod_uid://fce5e4a8-db5e-4b4b-8cab-be6029e1b762 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_namespace:kube-system kube_ownerref_kind:node kube_ownerref_name:kind-control-plane kube_priority_class:system-node-critical kube_qos:Burstable pod_name:etcd-kind-control-plane pod_phase:running]
===

=== Entity internal://global-entity-id ===
== Source workloadmeta-static =
=Tags: [kube_cluster_name:adel-orchestrator-clc]
===

=== Entity kubernetes_pod_uid://184c26b0-f03b-4887-896c-092250f61138 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_app_component:cluster-agent kube_app_instance:datadog-agent kube_app_managed_by:Helm kube_app_name:datadog-agent kube_cluster_name:adel-orchestrator-clc kube_deployment:datadog-agent-cluster-agent kube_namespace:default kube_ownerref_kind:replicaset kube_ownerref_name:datadog-agent-cluster-agent-78cdcb7c8 kube_qos:BestEffort kube_replica_set:datadog-agent-cluster-agent-78cdcb7c8 pod_name:datadog-agent-cluster-agent-78cdcb7c8-sf9l4 pod_phase:running]
===

=== Entity kubernetes_pod_uid://23b009dc-8790-41df-a6ad-d315ec695b7b ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_deployment:coredns kube_namespace:kube-system kube_ownerref_kind:replicaset kube_ownerref_name:coredns-5d78c9869d kube_priority_class:system-cluster-critical kube_qos:Burstable kube_replica_set:coredns-5d78c9869d pod_name:coredns-5d78c9869d-7tsr5 pod_phase:running]
===

=== Entity kubernetes_pod_uid://b77e5126-07bd-4c37-b613-616fb8ba38c1 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_app_component:clusterchecks-agent kube_app_instance:datadog-agent kube_app_managed_by:Helm kube_app_name:datadog-agent kube_cluster_name:adel-orchestrator-clc kube_deployment:datadog-agent-clusterchecks kube_namespace:default kube_ownerref_kind:replicaset kube_ownerref_name:datadog-agent-clusterchecks-574dfbd86f kube_qos:BestEffort kube_replica_set:datadog-agent-clusterchecks-574dfbd86f pod_name:datadog-agent-clusterchecks-574dfbd86f-fwkw5 pod_phase:running]
===

=== Entity kubernetes_pod_uid://c4c80235-9a96-4427-9b60-e146a9477eda ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_cluster_name:adel-orchestrator-clc kube_deployment:coredns kube_namespace:kube-system kube_ownerref_kind:replicaset kube_ownerref_name:coredns-5d78c9869d kube_priority_class:system-cluster-critical kube_qos:Burstable kube_replica_set:coredns-5d78c9869d pod_name:coredns-5d78c9869d-frvkc pod_phase:running]
===

=== Entity kubernetes_pod_uid://c4ec4af1-9b22-4273-98dd-6780ceb42563 ===
== Source workloadmeta-kubernetes_pod =
=Tags: [kube_app_component:agent kube_app_instance:datadog-agent kube_app_managed_by:Helm kube_app_name:datadog-agent kube_cluster_name:adel-orchestrator-clc kube_daemon_set:datadog-agent kube_namespace:default kube_ownerref_kind:daemonset kube_ownerref_name:datadog-agent kube_qos:BestEffort pod_name:datadog-agent-d5j2t pod_phase:running]
===

Verify that only namespace tags are synced into the CLC runner remote tagger:

kubectl <clc_runner_pod_name> -- agent tagger-list


=== Entity kubernetes_metadata:///namespaces//default ===
== Source remote =
=Tags: [name:default]
===

=== Entity kubernetes_metadata:///namespaces//kube-node-lease ===
== Source remote =
=Tags: [name:kube-node-lease]
===

=== Entity kubernetes_metadata:///namespaces//kube-public ===
== Source remote =
=Tags: [name:kube-public]
===

=== Entity kubernetes_metadata:///namespaces//kube-system ===
== Source remote =
=Tags: [name:kube-system]
===

=== Entity kubernetes_metadata:///namespaces//local-path-storage ===
== Source remote =
=Tags: [name:local-path-storage]
===

Possible Drawbacks / Trade-offs

Additional Notes

agent-platform-auto-pr · 2024-10-16T17:52:46Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=47031106 --os-family=ubuntu

Note: This applies to commit a5c6aca

adel121 · 2024-10-16T17:56:11Z

comp/core/tagger/taggerimpl/server/util.go

+// The size of each item is calculated using computeSize
+//
+// This function assumes that the size of each single item of the initial slice is not larger than maxChunkSize
+func splitBySize[T any](slice []T, maxChunkSize int, computeSize func(T) int) [][]T {


declared this generic function in order to make it easy to unit test the split functionality

adel121 · 2024-10-16T17:57:06Z

comp/api/api/apiimpl/server_cmd.go

+		grpc.MaxRecvMsgSize(maxMessageSize),
+		grpc.MaxSendMsgSize(maxMessageSize),


the default message size in grpc is 4MB.

I added these explicitly here in order to make sure the tagger server gets the same message max size when splitting the response into chunks.

adel121 · 2024-10-16T17:57:14Z

cmd/cluster-agent/api/server.go

+		grpc.MaxSendMsgSize(maxMessageSize),
+		grpc.MaxRecvMsgSize(maxMessageSize),


the default message size in grpc is 4MB.

I added these explicitly here in order to make sure the tagger server gets the same message max size when splitting the response into chunks.

ogaca-dd

LGTM for files owned by ASC

clamoriniere

Hi @adel121

change looks good. I added 2 small comments

clamoriniere · 2024-10-17T12:15:33Z

comp/api/api/apiimpl/server_cmd.go

@@ -48,18 +48,21 @@ func (server *apiServer) startCMDServer(

 	// gRPC server
 	authInterceptor := grpcutil.AuthInterceptor(parseToken)
+	const maxMessageSize = 4 * 1024 * 1024 // 4 MB


nits: move const definition on the beginning of the file (after line 33)

clamoriniere · 2024-10-17T12:24:38Z

comp/core/tagger/taggerimpl/server/util.go

+}
+
+// splitEvents splits the array of events to chunks with at most maxChunkSize each
+func splitEvents(events []*pb.StreamTagsEvent, maxChunkSize int) [][]*pb.StreamTagsEvent {


this function works, and because it is a slice of slice of *pb.StreamTagsEvent so the memory use to create the new slice should not be to large.

However for memory optimisation, I'm wondering if has been able to see if you can return an iterator instead of creating new slices like slices.Chunk function is doing: https://pkg.go.dev/slices#Chunk
It would help to create sequentially the new slice when create the proto messages

Iterators were introduced in golang v1.23.0, the slices Chunk method that you mention was itself added in 1.23.0.

Currently, we are on v1.22.0 on the agent.

IMO I don't think it is worth it to upgrade the go version now to do this change.

We can create a card to make the change once the agent golang version is upgraded so that we don't forget to do this optimisation.

WDYT?

Yes perfect
Thanks for looking at it 🙇

If you can add a comment (TODO) in the code

clamoriniere

Thanks for this fix and optimisation 👍

cit-pr-commenter · 2024-10-21T12:55:39Z

Regression Detector

adel121 · 2024-10-21T13:41:18Z

/merge

dd-devflow · 2024-10-21T13:41:27Z

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 22m.

Use /merge -c to cancel this operation!

adel121 added kind/enhancement component/tagger changelog/no-changelog team/container-platform The Container Platform Team labels Oct 16, 2024

adel121 added this to the 7.60.0 milestone Oct 16, 2024

adel121 force-pushed the adelhajhassan/cut_taggerstream_into_4mb_chunks_with_remote_tagger branch from 7fa0a76 to 83a1cb7 Compare October 16, 2024 17:51

Cut tagger stream event responses into chunks of 4MB max size each

61bdcf0

adel121 force-pushed the adelhajhassan/cut_taggerstream_into_4mb_chunks_with_remote_tagger branch from 83a1cb7 to 61bdcf0 Compare October 16, 2024 17:52

adel121 marked this pull request as ready for review October 16, 2024 17:52

adel121 requested review from a team as code owners October 16, 2024 17:52

adel121 commented Oct 16, 2024

View reviewed changes

ogaca-dd approved these changes Oct 17, 2024

View reviewed changes

clamoriniere reviewed Oct 17, 2024

View reviewed changes

adel121 requested a review from clamoriniere October 18, 2024 10:08

clamoriniere approved these changes Oct 18, 2024

View reviewed changes

PR review

a5c6aca

dd-mergequeue bot merged commit 87369f7 into main Oct 21, 2024
221 of 224 checks passed

dd-mergequeue bot deleted the adelhajhassan/cut_taggerstream_into_4mb_chunks_with_remote_tagger branch October 21, 2024 14:02

adel121 mentioned this pull request Oct 22, 2024

Make remote tagger grpc max message size cofigurable with default of 2MB #30373

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cut tagger stream event responses into chunks of 4MB max size each #30192

Cut tagger stream event responses into chunks of 4MB max size each #30192

adel121 commented Oct 16, 2024

agent-platform-auto-pr bot commented Oct 16, 2024 •

edited

Loading

adel121 Oct 16, 2024

adel121 Oct 16, 2024

adel121 Oct 16, 2024

ogaca-dd left a comment

clamoriniere left a comment

clamoriniere Oct 17, 2024

clamoriniere Oct 17, 2024

adel121 Oct 18, 2024

clamoriniere Oct 18, 2024

clamoriniere Oct 18, 2024

clamoriniere left a comment

cit-pr-commenter bot commented Oct 21, 2024

adel121 commented Oct 21, 2024

dd-devflow bot commented Oct 21, 2024

		grpc.MaxRecvMsgSize(maxMessageSize),
		grpc.MaxSendMsgSize(maxMessageSize),

Cut tagger stream event responses into chunks of 4MB max size each #30192

Cut tagger stream event responses into chunks of 4MB max size each #30192

Conversation

adel121 commented Oct 16, 2024

What does this PR do?

Motivation

Describe how to test/QA your changes

Possible Drawbacks / Trade-offs

Additional Notes

agent-platform-auto-pr bot commented Oct 16, 2024 • edited Loading

Test changes on VM

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogaca-dd left a comment

Choose a reason for hiding this comment

clamoriniere left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clamoriniere left a comment

Choose a reason for hiding this comment

cit-pr-commenter bot commented Oct 21, 2024

Regression Detector

adel121 commented Oct 21, 2024

dd-devflow bot commented Oct 21, 2024

agent-platform-auto-pr bot commented Oct 16, 2024 •

edited

Loading