Metrics stop updating after k8s agent is redeployed. #64

tcrozier · 2018-07-24T21:16:58Z

I am running a .net core client using dogstatsd-csharp-client to send custom metrics to a V6 datadog agent . The client is running in a kubernetes pod talking to the agent which is running on the same node. When everything is deployed the metrics are transmitted and received correctly. However, if I redeploy the datadog agent the metrics are no longer received. I can however run the following command from the pod the client is running on and still get custom metrics.

echo -n "custom_metric:50|g|#shell" >/dev/udp/172.21.97.95/8125

If I redeploy the client afterwards everything starts to communicate correctly once again.

brantburnett · 2018-07-30T17:42:09Z

Based on symptoms I've seen, I feel like this is a problem with the C# client not reconnecting if the DataDog agent is recycled.

Deploying a new version of the agent causes the agent to go down and back up. After this, stats are no longer delivered from the application. However, manually delivered stats still work (via command line mentioned above). Additionally, recycling the application causes stats to start delivering again.

brantburnett · 2018-07-30T19:49:01Z

To clarify a bit further, we're sending custom metrics via the DogStatsD server included with the DataDog agent, following these steps revolving around "hostPort": https://www.datadoghq.com/blog/monitor-kubernetes-docker/#bind-the-dogstatsd-port-to-a-host-port

When we upgrade the DD agent in the K8S DaemonSet, it kills the pod on each node and replaces it with a new pod. This is when the problem occurs, and we have to recycle our application pods before custom metrics resume flowing.

I've attempted to reproduce this issue in plain Docker on my local machine without success. Which makes me suspect that it might be at least partially related to some detail of Kubernetes networking.

brantburnett · 2018-07-30T21:44:21Z

So I think I've tracked this down to the way Kubernetes (in this case on AWS with Calico CNI networking layer) is handling UDP connections.

When the DD Agent pod is launched as hostPort, Kubernetes registers a NAT rule in iptables on the host to forward port 8125 to the pod
When the client app sends a UDP packet to the DogStatsD agent via the host IP, a UDP session is tracked in NAT on the host, based on source and dest IP and port
You can see this session on the K8S host using sudo cat /proc/net/ip_conntrack | grep "dport=8125"
This session has a lifetime of approximately 30 seconds (seems to be the default)
If the DD Agent pod is killed and recreated, i.e. during DaemonSet upgrade or via kubectl delete pod, it acquires a new IP
Any further traffic from the same source IP and port continues to be routed to the old IP so long as the session stays active (packets < every 30 seconds), therefore never reaching the new pod

The dogstatsd-csharp-client maintains a single .Net Socket class for the lifetime of the application. Therefore, the source port is always constant, thus triggering this situation so long as you keep trying to send stats.

I was able to replicate this in my Kubernetes cluster using these steps on the command line:

kubectl create ns test to create a test namespace
kubectl annotate ns test scheduler.alpha.kubernetes.io/node-selector=test=test restricts all pods in this namespace to nodes with the "test=test" label
kubectl label node ip-172-21-98-62.ec2.internal test=test to label a single node in my cluster (running DD agent) so we can test on just this node
kubectl run -n test cmdline --rm -it --image=ubuntu --restart=Never to launch an instance of Ubuntu on that node
apt-get update && apt-get install -y netcat to install netcat on the instance of Ubuntu

send a stat every 3 seconds 100 times. Note the IP is the host node IP address.

for i in `seq 1 100`; do echo "custom_metric:50|g|#shell\n" | nc -u -p 40001 -q 1 172.21.98.62 8125; sleep 2; done

After you see stats arriving, kubectl delete pod dd-agent-abcdef to delete the DD agent on the node being used (in a separate terminal so that the loop above is still running)
Notice that stats have stopped arriving
To start stats arriving again, cancel the loop and restart but change the source port from 40001 to 40002, or wait at least 30 seconds before restarting the loop

I think the solution may be to occasionally dispose and recreate the Socket in the .NET SDK. However, that still wouldn't prevent the loss of some statistics. And since UDP sockets are stateless, I'm not sure how to detect the situation and fix it sooner. @crhairr suggested that it might be possible to use QoS to detect that a packet wasn't delivered, but I'm not sure about this idea's performance implications, difficulty, or compatibility with Kubernetes.

truthbk · 2018-08-13T16:06:58Z

@brantburnett did you try killing the conntrack entry with conntrack -D -p udp -s <srcIP> -d <dstIP> --sport <origSrcPort> --dport 8125 (or similar)?

This might be a calico bug more than anything else.

brantburnett · 2018-08-14T03:38:02Z

@truthbk I did not. I would assume this would work, based on the symptoms, but it's not something I specifically tested. I'm traveling this week, so I doubt I'll have time to look at it, but I'll let you know once I try it.

brantburnett · 2018-08-15T22:54:24Z

@truthbk

I can confirm that the conntrack command line you provided does indeed correct the problem (once I apt-get update && apt-get install -y conntrack on the Kubernetes host node).

Unfortunately, I'm not seeing any error logs in Calico indicating a failure:

2018-08-15 22:32:24.460 [INFO][105] calc_graph.go 211: Local endpoint deleted id=WorkloadEndpoint(node=ip-172-21-98-62, orchestrator=k8s, workload=default.datadog-agent-s5rmm, name=eth0)
2018-08-15 22:32:24.460 [INFO][105] int_dataplane.go 574: Received *proto.WorkloadEndpointRemove update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"default.datadog-agent-s5rmm" endpoint_id:"eth0" >
2018-08-15 22:32:24.461 [INFO][105] int_dataplane.go 574: Received *proto.ActiveProfileRemove update from calculation graph msg=id:<name:"k8s_ns.default" >
2018-08-15 22:32:24.461 [INFO][105] table.go 422: Queing deletion of chain. chainName="cali-pri-k8s_ns.default" ipVersion=0x4 table="filter"
2018-08-15 22:32:24.461 [INFO][105] table.go 717: Invalidating dataplane cache ipVersion=0x4 reason="chain removal" table="filter"
2018-08-15 22:32:24.461 [INFO][105] table.go 422: Queing deletion of chain. chainName="cali-pro-k8s_ns.default" ipVersion=0x4 table="filter"
2018-08-15 22:32:24.461 [INFO][105] int_dataplane.go 690: Applying dataplane updates
2018-08-15 22:32:24.461 [INFO][105] endpoint_mgr.go 465: Workload removed, deleting its chains. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default.datadog-agent-s5rmm", EndpointId:"eth0"}
2018-08-15 22:32:24.461 [INFO][105] table.go 422: Queing deletion of chain. chainName="cali-tw-calice46a2b65f1" ipVersion=0x4 table="filter"
2018-08-15 22:32:24.461 [INFO][105] table.go 422: Queing deletion of chain. chainName="cali-fw-calice46a2b65f1" ipVersion=0x4 table="filter"
2018-08-15 22:32:24.461 [INFO][105] endpoint_mgr.go 470: Workload removed, deleting old state. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default.datadog-agent-s5rmm", EndpointId:"eth0"}
2018-08-15 22:32:24.461 [INFO][105] table.go 398: Queueing update of chain. chainName="cali-from-wl-dispatch" ipVersion=0x4 table="filter"
2018-08-15 22:32:24.461 [INFO][105] table.go 398: Queueing update of chain. chainName="cali-to-wl-dispatch" ipVersion=0x4 table="filter"
2018-08-15 22:32:24.461 [INFO][105] endpoint_mgr.go 331: Re-evaluated workload endpoint status adminUp=false failed=false known=false operUp=false status="" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default.datadog-agent-s5rmm", EndpointId:"eth0"}
2018-08-15 22:32:24.461 [INFO][105] status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default.datadog-agent-s5rmm", EndpointId:"eth0"}
2018-08-15 22:32:24.461 [INFO][105] route_table.go 447: Syncing routes: removing old route. dest=100.119.52.73/32 ifaceName="calice46a2b65f1" ipVersion=0x4
2018-08-15 22:32:24.461 [INFO][105] table.go 438: Loading current iptables state and checking it is correct. ipVersion=0x4 table="filter"
2018-08-15 22:32:24.461 [INFO][105] conntrack.go 78: Removing conntrack flows ip=100.119.52.73
2018-08-15 22:32:24.470 [INFO][105] status_combiner.go 86: Reporting endpoint removed. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default.datadog-agent-s5rmm", EndpointId:"eth0"}

Interestingly, Calico seems to use the conntrack command line utility to handle some of this cleanup: https://github.com/projectcalico/felix/blob/3d43b13f7cdb9e42bb98e828241dcd30a28ace74/conntrack/conntrack.go#L83. However, it appears to be based on deleting traffic routed to the IP address of the deleted DataDog pod. When I look at the session table before deleting, it looks like this:

udp      17 29 src=100.119.52.74 dst=172.21.98.62 sport=40001 dport=8125 [UNREPLIED] src=100.119.52.70 dst=100.119.52.74 sport=8125 dport=40001 mark=0 use=2

After deleting, it looks like this:

udp      17 29 src=100.119.52.74 dst=172.21.98.62 sport=40001 dport=8125 [UNREPLIED] src=172.21.98.62 dst=100.119.52.74 sport=8125 dport=40001 mark=0 use=2

Note that the IP changed from the IP of the old DataDog pod to the IP of the actual host node. Here's my revised theory:

The old DataDog pod is deleted
Calico deletes the session correctly routing traffic to the old pod
A new packet is sent by an application pod, as expected being routed to the host IP address
The new DataDog pod is not up and running with it's hostPort yet, so a new session is created by Linux that simply goes to the host IP without NAT
The new DataDog pod comes up and is assigned its hostPort and NAT routes are created by Calico
Traffic is continued to route to the host IP only, without NAT forwarding, because of the existing session

I think, based on this, that Calico needs to add a handling step when a hostPort is registered that clears any sessions running to the host IP (not NAT'd) on that port number. Just a theory, anyway.

truthbk · 2018-08-20T05:57:47Z

@brantburnett this is absolutely awesome and thorough, I really appreciate you taking the time to not only reproduce and debug, but also run that little cleanup experiment for us. It is, as I suspected, more of a CNI layer (calico) issue, more than anything on the library side - UDP being stateless/connectionless (in the general accepted sense for connection) gives us very few options to detect a failure on the other end, other than perhaps inspecting all ICMP responses which would not typically be something a client would do, and even then it would still be unreliable. TCP would be a different story.

I tend to agree with your theory, especially in the light of all the information you've provided. The only doubt I have there, is why would the host add tracking to a port that is not being listened to on the host. That's a bit surprising, but might be happening.

This is probably worth reporting with Calico, though it's not an easy problem to solve because it's hard to tell if those UDP packets in the race for the new pod to come up are actually for the host or for the container. 🤔 I can see Calico arguing it's not something the CNI should handle.

Regardless, thank you very much for all the time you've put to help us debug this.

brantburnett · 2018-08-20T14:24:04Z

I've filed an issue with Calico: https://github.com/projectcalico/felix/issues/1880, including a pure command line reproduction using netcat.

brantburnett · 2018-08-20T21:10:15Z

I've also found this command line workaround using kubectl. It must be run after all DataDog pods are completely replaced, and assumes you are using Calico as your CNI. Also, it's in PowerShell (sorry Bash people), but should be convertible.

kubectl get pod -n kube-system -l k8s-app=calico-node -o name | % {kubectl exec -n kube-system $_.Split('/')[1] -c calico-node -- conntrack -D -p udp --dport 8125}

It deletes all open UDP sessions running to port 8125 on every node in the Kubernetes cluster by running conntrack on each Calico pod.

brantburnett · 2018-08-24T13:02:59Z

Upon further research, this appears to not be an issue with the Calico CNI plugin but instead an issue with the portmap CNI plugin. I've reported it there.

containernetworking/plugins#123

truthbk · 2018-09-14T11:21:52Z

I'm going to close this issue as I believe it's not really an issue on our side. Thank you so much for the investigation and help @brantburnett.

Please feel free to ping us if you believe it should be re-opened.

Diggsey · 2019-01-03T17:32:46Z

For people on google cloud, this command works:

kubectl get pod -n kube-system -l component=kube-proxy -o name | cut -c5- | xargs -I{} kubectl exec -n kube-system {} -- conntrack -D -p udp --dport 8125

wyardley · 2019-04-29T23:28:16Z

We have this issue via other clients, but I wonder, how does this deal with the case where a container crashes / respawns other than from an intentional redeploy?

I get that udp is stateless, but why not have a finite lifetime after which the socket is closed and re-opened (say every couple of minutes) so that you don't end up in a case where the client is sending metrics to nowhere indefinitely?

Someone mentioned that the Ruby client may check the agent lifetime directly, but I haven't investigated it too deeply.

truthbk mentioned this issue Aug 9, 2018

Client doesn't reconnect if agent restarts DataDog/datadog-go#54

Closed

truthbk mentioned this issue Aug 20, 2018

Weird Behavior When Using status.hostIP as datadog hostname in kubernetes DataDog/java-dogstatsd-client#47

Closed

truthbk closed this as completed Sep 14, 2018

This was referenced Sep 14, 2018

Metrics not sent when the stasD server changes the IP DataDog/java-dogstatsd-client#35

Open

Does the client reconnect if I restart datadog Agent? DataDog/java-dogstatsd-client#30

Closed

sblumenthalAtBraze mentioned this issue Oct 13, 2021

Client doesn't reconnect if agent restarts DataDog/datadog-go#210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics stop updating after k8s agent is redeployed. #64

Metrics stop updating after k8s agent is redeployed. #64

tcrozier commented Jul 24, 2018 •

edited

Loading

brantburnett commented Jul 30, 2018

brantburnett commented Jul 30, 2018

brantburnett commented Jul 30, 2018

truthbk commented Aug 13, 2018 •

edited

Loading

brantburnett commented Aug 14, 2018

brantburnett commented Aug 15, 2018

truthbk commented Aug 20, 2018 •

edited

Loading

brantburnett commented Aug 20, 2018

brantburnett commented Aug 20, 2018

brantburnett commented Aug 24, 2018

truthbk commented Sep 14, 2018

Diggsey commented Jan 3, 2019 •

edited

Loading

wyardley commented Apr 29, 2019

Metrics stop updating after k8s agent is redeployed. #64

Metrics stop updating after k8s agent is redeployed. #64

Comments

tcrozier commented Jul 24, 2018 • edited Loading

brantburnett commented Jul 30, 2018

brantburnett commented Jul 30, 2018

brantburnett commented Jul 30, 2018

truthbk commented Aug 13, 2018 • edited Loading

brantburnett commented Aug 14, 2018

brantburnett commented Aug 15, 2018

truthbk commented Aug 20, 2018 • edited Loading

brantburnett commented Aug 20, 2018

brantburnett commented Aug 20, 2018

brantburnett commented Aug 24, 2018

truthbk commented Sep 14, 2018

Diggsey commented Jan 3, 2019 • edited Loading

wyardley commented Apr 29, 2019

tcrozier commented Jul 24, 2018 •

edited

Loading

truthbk commented Aug 13, 2018 •

edited

Loading

truthbk commented Aug 20, 2018 •

edited

Loading

Diggsey commented Jan 3, 2019 •

edited

Loading