container_oom_events_total always returns 0 #3015

darkl0rd · 2021-11-23T07:16:45Z

Running Docker (swarm), when OOM events occur the counter never increases. For reference, the node-exporter metric (node_vmstat_oom_kill) does increase.

Running cAdvisor v0.43.0.

darkl0rd · 2022-03-13T16:44:20Z

Just tested this with 0.44.0 - issue persists.

juan-ramirez-sp · 2022-05-25T14:44:24Z

I believe the problem is that we are updating the OOMEvent count on the container itself on this line.

cadvisor/manager/manager.go

Line 1256 in 24e7a98

atomic.AddUint64(&cont.oomEvents, 1)

To my understanding when an OOM event occurs the container is destroyed effectively removing it from the metric data.

An example from my testing in AWS
`
I0525 14:24:39.848012 1 manager.go:1223] Created an OOM event in container "/ecs/ID/CONTAINER_ID" at 2022-05-25 14:24:40.574306117 +0000 UTC m=+135.585319105

I0525 14:24:39.889468 1 manager.go:1044] Destroyed container: "/ecs/ID/CONTAINER_ID" (aliases: [alias], namespace: "docker")
`

So we increment the OOM metric then deregister the metric :(

Is my understanding correct of your implementation @kragniz ?

If the expectation is for the container to be restarted after OOM, this makes the metric unusable in environments where containers are always replaced rather than restarted. ( Such as ECS )

tasiotas · 2023-01-03T23:39:27Z

I have the same issue.

One of my container is running out of memory, I can see OOM event in syslog and kmsg, but container_oom_events_total is always 0. Any clues how to get it working, or other way to detect OOM in containers?

compose.yml

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.46.0
    container_name: cadvisor
    network_mode: host
    user: root
    privileged: true
    healthcheck:
      disable: true
    restart: unless-stopped
    ports:
      - '8080:8080'
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    command:
      - --url_base_prefix=/bRs47jH13fdsBFMQ93/cadvisor
      - --housekeeping_interval=15s
      - --docker_only=true
      - --store_container_labels=false
      - --enable_metrics=disk,diskIO,cpu,cpuLoad,process,memory,network,oom_event
    devices:
      - /dev/kmsg:/dev/kmsg

SebastienTolron · 2023-02-01T13:04:26Z

Having the same issue here.

Event not showing up in kubernetes also.

 Last State:     Terminated
      Reason:       Error
      Exit Code:    137

Any idea how to solve this ?

Thanks !

gfonseca-tc · 2023-02-01T15:17:01Z

Also having the same issue with cAdvisor 0.46.0

chengjoey · 2023-03-21T06:32:02Z

Has there been any progress on this bug? I am encountering the same issue on k8s.

describe go-demo pod:

chengjoey · 2023-03-22T14:40:00Z

In k8s, new containers are always created to replace oomkilled containers. Therefore, the container_oom_events_total metric will always be 0. I tried to keep the deleted containers due to oomkilled and was able to query the metrics.

In addition, if the cluster is running on minikube+docker, the containername obtained from /dev/kmsg will have an additional prefix of /docker/{{id}}, which does not match the containername watched. Therefore, the metrics will always be 0.

NesManrique · 2023-06-27T12:11:53Z

Hitting this issue in kubernetes as well. Commenting for visibility.

tsipo · 2024-04-26T22:15:36Z

I have done various tests of OOMKills under Kubernetes. I have - so far - seen only one use-case where I have observed container_oom_events_total > 0 (specifically container_oom_events_total == 1).
An OOMKill does not mean in all cases deletion of the container (which will deregister its container_oom_events_total). The main process of the container (aka pid 1) may fork one or more Linux processes (actually fork, not using the exec command). If one of these other processes gets OOMKilled and this does not cause pid 1 to exit as well (at least not until the next cAdvisor scrape), the container will continue to live and you'll see container_oom_events_total == 1.

pschichtel · 2024-04-29T10:00:55Z

@tsipo track the referenced PR #3278 or help pushing it over the line.

tsipo · 2024-04-29T20:47:49Z

@tsipo track the referenced PR #3278 or help pushing it over the line.

See #3278 (comment)

rd-yan-farba · 2024-08-07T13:43:39Z

We encountered this problem just now too.

frittentheke · 2024-08-16T07:58:49Z

@Creatone @bobbypage could I gently drag you to this very issue here about bugs with the OOM metrics?
Since there is a PR potentially fixing this issue, please see the recent comments there: #3278 (comment)

Is there anything to be done to get this problem addressed / the PR reviewed?

tsipo · 2024-08-20T18:26:39Z

I have done various tests of OOMKills under Kubernetes. I have - so far - seen only one use-case where I have observed container_oom_events_total > 0 (specifically container_oom_events_total == 1). An OOMKill does not mean in all cases deletion of the container (which will deregister its container_oom_events_total). The main process of the container (aka pid 1) may fork one or more Linux processes (actually fork, not using the exec command). If one of these other processes gets OOMKilled and this does not cause pid 1 to exit as well (at least not until the next cAdvisor scrape), the container will continue to live and you'll see container_oom_events_total == 1.

An update on my previous comment: k8s 1.28 has enabled cgroup grouping (assuming cgroups v2) - see here. That means the the use-case I have mentioned of "hidden OOMKills" (OOMKill of a container process which is not pid 1) should not happen anymore for k8s >= 1.28 and cgroups v2.

chengjoey · 2024-08-21T07:41:55Z

An update on my previous comment: k8s 1.28 has enabled cgroup grouping (assuming cgroups v2) - see here. That means the the use-case I have mentioned of "hidden OOMKills" (OOMKill of a container process which is not pid 1) should not happen anymore for k8s >= 1.28 and cgroups v2.

regardless of the k8s version or whether cgroups v2 is used, as long as OOM causes the container to be rebuilt, the OOM metric is lost. I think kubelet or container-manager will still monitor the system OOM events, then kill the container and rebuild it, then the OOM metric will be lost

tsipo · 2024-08-22T18:08:58Z

An update on my previous comment: k8s 1.28 has enabled cgroup grouping (assuming cgroups v2) - see here. That means the the use-case I have mentioned of "hidden OOMKills" (OOMKill of a container process which is not pid 1) should not happen anymore for k8s >= 1.28 and cgroups v2.

regardless of the k8s version or whether cgroups v2 is used, as long as OOM causes the container to be rebuilt, the OOM metric is lost. I think kubelet or container-manager will still monitor the system OOM events, then kill the container and rebuild it, then the OOM metric will be lost

This is exactly my point: the only use-case where I have seen that the OOM metric was not lost, was removed in k8s 1.28.

Whether or not cadvisor should provide the OOM metric is a separate discussion. It is only relevant if the container is not deleted after being OOMKilled, which doesn't make a lot of sense for any managed container environment, to be honest.
BTW OOM Kills can be monitored using kube-state-metrics metrics -kube_pod_container_status_last_terminated_exitcode (value of 137 is OOMKill) and the recently-added kube_pod_container_status_last_terminated_timestamp. These ones do not go away as the pod is not lost after the container was deleted and rebuilt.
On node level, node-exporter provides node_vmstat_oom_kill (which is a counter of all of the processes - not containers - which were OOM Killed).

SebastienTolron · 2024-08-22T18:23:09Z

On my side, upgrade conmon and solved the issue ( minimum debian 12)

sellers · 2024-11-26T21:05:52Z

Essentially, in a DevSecOps world, engineers want to be able to track container applications behavior when their memory usage exceeds the memory requests and kubelet has to step in to say, "whoah". That turns into an OOM (137 as stated above).
If that happens over X time during a period of Y, as an engineer, I want to know so that I can address performance issues or business SLA/continuity issues.

e.g. I want to have a query like
sum(increase((container_oom_events_total{namespace!="playground"}[5m]))>1) by (container_label_X, image_id)

or something similar

missing-container-metrics offers this exact metric, albeit 3 years old and littered with CVEs now.

to that, I am unsure what container_oom_events_total is intended for if not the above user story. Please do help all of us who are following this thread if we are misunderstanding the intention of this metric.

darkl0rd mentioned this issue Apr 20, 2022

Expose OOM event count to prometheus #2829

Merged

gfonseca-tc mentioned this issue Feb 1, 2023

Collect docker events vectordotdev/vector#1546

Open

chengjoey linked a pull request Mar 22, 2023 that will close this issue

fix container_oom_events_total always returns 0. #3278

Open

aprucolimartins mentioned this issue May 11, 2023

ARTESCA-4300 raise alert when pods get OOMKilled scality/metalk8s#4042

Merged

szymonpk mentioned this issue Oct 11, 2023

Create alert for OOMKill events inside containers kubernetes-monitoring/kubernetes-mixin#822

Closed

kevinnoel-be mentioned this issue Jul 25, 2024

Log something about OOMKilled containers kubernetes/kubernetes#69676

Open

jkleinlercher mentioned this issue Aug 14, 2024

[monitoring] OOMKilled doesn't increase container_oom_events_total suxess-it/kubriX#433

Open

rushilenekar20 mentioned this issue Sep 25, 2024

SELinux Denial Triggered by Unused Collectors in cAdvisor #3598

Open

cmelone mentioned this issue Oct 15, 2024

OOM Detection spack/spack-gantry#117

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container_oom_events_total always returns 0 #3015

container_oom_events_total always returns 0 #3015

darkl0rd commented Nov 23, 2021

darkl0rd commented Mar 13, 2022

juan-ramirez-sp commented May 25, 2022 •

edited

Loading

tasiotas commented Jan 3, 2023

SebastienTolron commented Feb 1, 2023

gfonseca-tc commented Feb 1, 2023

chengjoey commented Mar 21, 2023

chengjoey commented Mar 22, 2023

NesManrique commented Jun 27, 2023

tsipo commented Apr 26, 2024 •

edited

Loading

pschichtel commented Apr 29, 2024

tsipo commented Apr 29, 2024

rd-yan-farba commented Aug 7, 2024 •

edited

Loading

frittentheke commented Aug 16, 2024 •

edited

Loading

tsipo commented Aug 20, 2024

chengjoey commented Aug 21, 2024

tsipo commented Aug 22, 2024

SebastienTolron commented Aug 22, 2024

sellers commented Nov 26, 2024

container_oom_events_total always returns 0 #3015

container_oom_events_total always returns 0 #3015

Comments

darkl0rd commented Nov 23, 2021

darkl0rd commented Mar 13, 2022

juan-ramirez-sp commented May 25, 2022 • edited Loading

tasiotas commented Jan 3, 2023

SebastienTolron commented Feb 1, 2023

gfonseca-tc commented Feb 1, 2023

chengjoey commented Mar 21, 2023

chengjoey commented Mar 22, 2023

NesManrique commented Jun 27, 2023

tsipo commented Apr 26, 2024 • edited Loading

pschichtel commented Apr 29, 2024

tsipo commented Apr 29, 2024

rd-yan-farba commented Aug 7, 2024 • edited Loading

frittentheke commented Aug 16, 2024 • edited Loading

tsipo commented Aug 20, 2024

chengjoey commented Aug 21, 2024

tsipo commented Aug 22, 2024

SebastienTolron commented Aug 22, 2024

sellers commented Nov 26, 2024

juan-ramirez-sp commented May 25, 2022 •

edited

Loading

tsipo commented Apr 26, 2024 •

edited

Loading

rd-yan-farba commented Aug 7, 2024 •

edited

Loading

frittentheke commented Aug 16, 2024 •

edited

Loading