-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics stop updating after k8s agent is redeployed. #64
Comments
Based on symptoms I've seen, I feel like this is a problem with the C# client not reconnecting if the DataDog agent is recycled. Deploying a new version of the agent causes the agent to go down and back up. After this, stats are no longer delivered from the application. However, manually delivered stats still work (via command line mentioned above). Additionally, recycling the application causes stats to start delivering again. |
To clarify a bit further, we're sending custom metrics via the DogStatsD server included with the DataDog agent, following these steps revolving around "hostPort": https://www.datadoghq.com/blog/monitor-kubernetes-docker/#bind-the-dogstatsd-port-to-a-host-port When we upgrade the DD agent in the K8S DaemonSet, it kills the pod on each node and replaces it with a new pod. This is when the problem occurs, and we have to recycle our application pods before custom metrics resume flowing. I've attempted to reproduce this issue in plain Docker on my local machine without success. Which makes me suspect that it might be at least partially related to some detail of Kubernetes networking. |
So I think I've tracked this down to the way Kubernetes (in this case on AWS with Calico CNI networking layer) is handling UDP connections.
The dogstatsd-csharp-client maintains a single .Net Socket class for the lifetime of the application. Therefore, the source port is always constant, thus triggering this situation so long as you keep trying to send stats. I was able to replicate this in my Kubernetes cluster using these steps on the command line:
I think the solution may be to occasionally dispose and recreate the Socket in the .NET SDK. However, that still wouldn't prevent the loss of some statistics. And since UDP sockets are stateless, I'm not sure how to detect the situation and fix it sooner. @crhairr suggested that it might be possible to use QoS to detect that a packet wasn't delivered, but I'm not sure about this idea's performance implications, difficulty, or compatibility with Kubernetes. |
@brantburnett did you try killing the This might be a calico bug more than anything else. |
@truthbk I did not. I would assume this would work, based on the symptoms, but it's not something I specifically tested. I'm traveling this week, so I doubt I'll have time to look at it, but I'll let you know once I try it. |
I can confirm that the conntrack command line you provided does indeed correct the problem (once I Unfortunately, I'm not seeing any error logs in Calico indicating a failure:
Interestingly, Calico seems to use the conntrack command line utility to handle some of this cleanup: https://github.com/projectcalico/felix/blob/3d43b13f7cdb9e42bb98e828241dcd30a28ace74/conntrack/conntrack.go#L83. However, it appears to be based on deleting traffic routed to the IP address of the deleted DataDog pod. When I look at the session table before deleting, it looks like this:
After deleting, it looks like this:
Note that the IP changed from the IP of the old DataDog pod to the IP of the actual host node. Here's my revised theory:
I think, based on this, that Calico needs to add a handling step when a hostPort is registered that clears any sessions running to the host IP (not NAT'd) on that port number. Just a theory, anyway. |
@brantburnett this is absolutely awesome and thorough, I really appreciate you taking the time to not only reproduce and debug, but also run that little cleanup experiment for us. It is, as I suspected, more of a CNI layer (calico) issue, more than anything on the library side - UDP being stateless/connectionless (in the general accepted sense for connection) gives us very few options to detect a failure on the other end, other than perhaps inspecting all ICMP responses which would not typically be something a client would do, and even then it would still be unreliable. TCP would be a different story. I tend to agree with your theory, especially in the light of all the information you've provided. The only doubt I have there, is why would the host add tracking to a port that is not being listened to on the host. That's a bit surprising, but might be happening. This is probably worth reporting with Calico, though it's not an easy problem to solve because it's hard to tell if those UDP packets in the race for the new pod to come up are actually for the host or for the container. 🤔 I can see Calico arguing it's not something the CNI should handle. Regardless, thank you very much for all the time you've put to help us debug this. |
I've filed an issue with Calico: https://github.com/projectcalico/felix/issues/1880, including a pure command line reproduction using netcat. |
I've also found this command line workaround using kubectl. It must be run after all DataDog pods are completely replaced, and assumes you are using Calico as your CNI. Also, it's in PowerShell (sorry Bash people), but should be convertible. kubectl get pod -n kube-system -l k8s-app=calico-node -o name | % {kubectl exec -n kube-system $_.Split('/')[1] -c calico-node -- conntrack -D -p udp --dport 8125} It deletes all open UDP sessions running to port 8125 on every node in the Kubernetes cluster by running conntrack on each Calico pod. |
Upon further research, this appears to not be an issue with the Calico CNI plugin but instead an issue with the portmap CNI plugin. I've reported it there. |
I'm going to close this issue as I believe it's not really an issue on our side. Thank you so much for the investigation and help @brantburnett. Please feel free to ping us if you believe it should be re-opened. |
For people on google cloud, this command works:
|
We have this issue via other clients, but I wonder, how does this deal with the case where a container crashes / respawns other than from an intentional redeploy? I get that udp is stateless, but why not have a finite lifetime after which the socket is closed and re-opened (say every couple of minutes) so that you don't end up in a case where the client is sending metrics to nowhere indefinitely? Someone mentioned that the Ruby client may check the agent lifetime directly, but I haven't investigated it too deeply. |
I am running a .net core client using dogstatsd-csharp-client to send custom metrics to a V6 datadog agent . The client is running in a kubernetes pod talking to the agent which is running on the same node. When everything is deployed the metrics are transmitted and received correctly. However, if I redeploy the datadog agent the metrics are no longer received. I can however run the following command from the pod the client is running on and still get custom metrics.
If I redeploy the client afterwards everything starts to communicate correctly once again.
The text was updated successfully, but these errors were encountered: