Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes agent stops sending data #4187

Closed
msiebuhr opened this issue Sep 20, 2019 · 6 comments
Closed

Kubernetes agent stops sending data #4187

msiebuhr opened this issue Sep 20, 2019 · 6 comments

Comments

@msiebuhr
Copy link

msiebuhr commented Sep 20, 2019

Output of the info page (if this is a bug)

kubectl exec -it datadog-agent-ltqjf agent status

Gives

Getting the status from the agent.

===============
Agent (v6.12.1)
===============

  Status date: 2019-09-20 14:28:04.929879 UTC
  Agent start: 2019-09-20 12:07:00.392791 UTC
  Pid: 337
  Python Version: 2.7.16
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: 1.953ms
    System UTC time: 2019-09-20 14:28:04.929879 UTC

  Host Info
  =========
    bootTime: 2019-09-19 20:58:46.000000 UTC
    kernelVersion: 4.14.127+
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 10.0
    procs: 65
    uptime: 15h8m36s

  Hostnames
  =========
    host_aliases: [xxx]
    hostname: xxx.internal
    socket-fqdn: datadog-agent-ltqjf
    socket-hostname: datadog-agent-ltqjf
    host tags:
      gke-tactile-default-9488e323-node
      zone:us-east1-b
      instance-type:n1-standard-4
      internal-hostname:xxx.internal
      instance-id:7332968292827419921
      project:tactile-webservices
      numeric_project_id:1006178009280
      instance-template:projects/1006178009280/global/instanceTemplates/gke-tactile-default-default-4-222c9b59
      cluster-location:us-east1-b
      cluster-name:tactile-default
      disable-legacy-endpoints:true
      gci-ensure-gke-docker:true
      enable-oslogin:false
      kubelet-config:apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    enabled: true
  x509:
    clientCAFile: /etc/srv/kubernetes/pki/ca-certificates.crt
authorization:
  mode: Webhook
cgroupRoot: /
clusterDNS:
- 10.15.240.10
clusterDomain: cluster.local
configMapAndSecretChangeDetectionStrategy: Cache
enableDebuggingHandlers: true
evictionHard:
  memory.available: 100Mi
  nodefs.available: 10%
  nodefs.inodesFree: 5%
featureGates:
  DynamicKubeletConfig: false
  ExperimentalCriticalPodAnnotation: true
  RotateKubeletServerCertificate: true
kind: KubeletConfiguration
kubeReserved:
  cpu: 80m
  ephemeral-storage: 41Gi
  memory: 2536Mi
readOnlyPort: 10255
serverTLSBootstrap: true
staticPodPath: /etc/kubernetes/manifests

      google-compute-enable-pcid:true
      kube-labels:beta.kubernetes.io/fluentd-ds-ready=true,cloud.google.com/gke-nodepool=default-4,cloud.google.com/gke-os-distribution=cos
      cluster-uid:xxx
      created-by:xxx
      gci-update-strategy:update_disabled
    hostname provider: gce
    unused hostname providers:
      configuration/environment: hostname is empty

=========
Collector
=========



  Running Checks
  ==============
    
    cpu
    ---
      Instance ID: cpu [�[32mOK�[0m]
      Total Runs: 563
      Metric Samples: Last Run: 6, Total: 3,372
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      
    
    disk (2.2.0)
    ------------
      Instance ID: disk:e5dffb8bef24336f [�[32mOK�[0m]
      Total Runs: 563
      Metric Samples: Last Run: 188, Total: 105,844
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 2.29s
      
    
    docker
    ------
      Instance ID: docker [�[32mOK�[0m]
      Total Runs: 562
      Metric Samples: Last Run: 1,327, Total: 759,115
      Events: Last Run: 0, Total: 9
      Service Checks: Last Run: 1, Total: 562
      Average Execution Time : 553ms
      
    
    file_handle
    -----------
      Instance ID: file_handle [�[32mOK�[0m]
      Total Runs: 563
      Metric Samples: Last Run: 5, Total: 2,815
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      
    
    io
    --
      Instance ID: io [�[32mOK�[0m]
      Total Runs: 562
      Metric Samples: Last Run: 130, Total: 72,970
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      
    
    kubelet (3.2.1)
    ---------------
      Instance ID: kubelet:d884b5186b651429 [�[32mOK�[0m]
      Total Runs: 563
      Metric Samples: Last Run: 1,205, Total: 689,303
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 4, Total: 2,252
      Average Execution Time : 4.222s
      
    
    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [�[32mOK�[0m]
      Total Runs: 563
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      
    
    load
    ----
      Instance ID: load [�[32mOK�[0m]
      Total Runs: 563
      Metric Samples: Last Run: 6, Total: 3,378
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      
    
    memory
    ------
      Instance ID: memory [�[32mOK�[0m]
      Total Runs: 563
      Metric Samples: Last Run: 17, Total: 9,571
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 1ms
      
    
    network (1.11.0)
    ----------------
      Instance ID: network:e0204ad63d43c949 [�[32mOK�[0m]
      Total Runs: 563
      Metric Samples: Last Run: 284, Total: 162,772
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 13ms
      
    
    ntp
    ---
      Instance ID: ntp:b4579e02d1981c12 [�[32mOK�[0m]
      Total Runs: 563
      Metric Samples: Last Run: 1, Total: 563
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 563
      Average Execution Time : 18ms
      
    
    uptime
    ------
      Instance ID: uptime [�[32mOK�[0m]
      Total Runs: 563
      Metric Samples: Last Run: 1, Total: 563
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      
========
JMXFetch
========

  Initialized checks
  ==================
    no checks
    
  Failed checks
  =============
    no checks
    
=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 563
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 53
    Metadata: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 1,179
    TimeseriesV1: 563

  API Keys status
  ===============
    API key ending with bcf96: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - bcf96

==========
Logs Agent
==========

  Logs Agent is not running

=========
Aggregator
=========
  Checks Metric Sample: 1.8 M
  Dogstatsd Metric Sample: 532,976
  Event: 10
  Events Flushed: 10
  Number Of Flushes: 563
  Series Flushed: 1.8 M
  Service Check: 9,578
  Service Checks Flushed: 10,128

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 532,975
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Packet Reading Errors: 0
  Udp Packets: 532,976
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0


Describe what happened:

The container/pod above has stopped sending data to Datadog. It looks to be related to a liveness-probe failing, causing a restart of the pod.

We also get a lot of the following in the logs, which I suspect may be correlated.

 2019-09-20 14:14:51 UTC | CORE | WARN | (pkg/util/docker/containers.go:223 in parseContainerNetworkAddresses) | Unable to parse IP:  for container: /k8s_POD_<pod name>-5dc78d46c5-d8fxg_default_c021f22a-db86-11e9-be56-42010a8e0219_0

Describe what you expected:
Less errors, more metrics.

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc):
Running the service agent (1.12.1) on Kubernetes (GKE) - with non-local Statsd enabled (DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true).

@msiebuhr
Copy link
Author

msiebuhr commented Sep 20, 2019

By comparison, here's some output from an agent that is sending metrics. It has been up for ~2x as long, but processed about ~100x metrics packets. (I haven't checked how skewed our load is ATM, but it is definitely not 100x...)

===============
Agent (v6.12.1)
===============

  Status date: 2019-09-20 14:50:17.701342 UTC
  Agent start: 2019-09-20 09:27:40.975672 UTC
  Pid: 337
 ...
=========
Aggregator
=========
  Checks Metric Sample: 2.2 M
  Dogstatsd Metric Sample: 63 M
  Event: 43
  Events Flushed: 43
  Number Of Flushes: 1,290
  Series Flushed: 11.4 M
  Service Check: 21,967
  Service Checks Flushed: 23,241

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 63 M
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Packet Reading Errors: 0
  Udp Packets: 63 M
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0

@msiebuhr msiebuhr changed the title Many parseContainerNetworkAddress()-errors in Kubernetes logs Kubernetes agent stops sending data Sep 23, 2019
@msiebuhr
Copy link
Author

So I found some of these in kubectl describe pod/datadog-agent-xxxxx

  Warning  Unhealthy  24m  kubelet, gke-tactile-default-default-4-222c9b59-pjcx  Liveness probe failed: Agent health: FAIL
=== 1 healthy components ===
healthcheck
=== 10 unhealthy components ===
ad-config-provider-docker, ad-config-provider-kubernetes, ad-dockerprovider, ad-kubeletlistener, ad-servicelistening, aggregator, dogstatsd-main, forwarder, tagger, tagger-docker
Error: found 10 unhealthy components

It varies a bit what is healthy and unhealthy, but we are seeing quite a few of them.

@hkaj
Copy link
Member

hkaj commented Sep 23, 2019

Hi @msiebuhr, thanks for reaching out
Couple of pointers there:

  • you may want to upgrade the agent to get rid of that kubelet-config tag. The config is passed as a node label and we used to collect it by default which was a mistake. Fixed in the latest version.
  • this warning log is unlikely to be the issue. This behavior (agent going unhealthy w/o crashing or spewing error logs) is usually due to the agent getting throttled on CPU or memory, maybe try increasing the limits till it stops choking, and turn them back down once you found what it's comfortable with?

If that doesn't help, feel free to reach out to support, they'll ask you to send a flare from one of these agents so we can troubleshoot deeper.

@msiebuhr
Copy link
Author

Thanks, @hkaj. I have updated and things haven't broken for a few hours (but we're not at peak traffic yet) and I'm in contact with support to move things further along.

But what about the error? We do see quite a few of them, and it would be nice to fix it...

@msiebuhr
Copy link
Author

msiebuhr commented Oct 8, 2019

Fixed by applying the following initContainer before starting the datadog agent: containernetworking/plugins#123 (comment)

@msiebuhr msiebuhr closed this as completed Oct 8, 2019
@msiebuhr
Copy link
Author

msiebuhr commented Oct 8, 2019

Upon closer inspection of the service, having an initContainer running conntrack ... didn't solve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants