Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container_oom_events_total always returns 0 #3015

Open
darkl0rd opened this issue Nov 23, 2021 · 18 comments · May be fixed by #3278
Open

container_oom_events_total always returns 0 #3015

darkl0rd opened this issue Nov 23, 2021 · 18 comments · May be fixed by #3278

Comments

@darkl0rd
Copy link

Running Docker (swarm), when OOM events occur the counter never increases. For reference, the node-exporter metric (node_vmstat_oom_kill) does increase.

Running cAdvisor v0.43.0.

@darkl0rd
Copy link
Author

Just tested this with 0.44.0 - issue persists.

@juan-ramirez-sp
Copy link

juan-ramirez-sp commented May 25, 2022

I believe the problem is that we are updating the OOMEvent count on the container itself on this line.

atomic.AddUint64(&cont.oomEvents, 1)

To my understanding when an OOM event occurs the container is destroyed effectively removing it from the metric data.

An example from my testing in AWS
`
I0525 14:24:39.848012 1 manager.go:1223] Created an OOM event in container "/ecs/ID/CONTAINER_ID" at 2022-05-25 14:24:40.574306117 +0000 UTC m=+135.585319105

I0525 14:24:39.889468 1 manager.go:1044] Destroyed container: "/ecs/ID/CONTAINER_ID" (aliases: [alias], namespace: "docker")
`

So we increment the OOM metric then deregister the metric :(

Is my understanding correct of your implementation @kragniz ?

If the expectation is for the container to be restarted after OOM, this makes the metric unusable in environments where containers are always replaced rather than restarted. ( Such as ECS )

@tasiotas
Copy link

tasiotas commented Jan 3, 2023

I have the same issue.

One of my container is running out of memory, I can see OOM event in syslog and kmsg, but container_oom_events_total is always 0. Any clues how to get it working, or other way to detect OOM in containers?

compose.yml

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.46.0
    container_name: cadvisor
    network_mode: host
    user: root
    privileged: true
    healthcheck:
      disable: true
    restart: unless-stopped
    ports:
      - '8080:8080'
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    command:
      - --url_base_prefix=/bRs47jH13fdsBFMQ93/cadvisor
      - --housekeeping_interval=15s
      - --docker_only=true
      - --store_container_labels=false
      - --enable_metrics=disk,diskIO,cpu,cpuLoad,process,memory,network,oom_event
    devices:
      - /dev/kmsg:/dev/kmsg

@SebastienTolron
Copy link

Having the same issue here.

Event not showing up in kubernetes also.

 Last State:     Terminated
      Reason:       Error
      Exit Code:    137

Any idea how to solve this ?

Thanks !

@gfonseca-tc
Copy link

Also having the same issue with cAdvisor 0.46.0

@chengjoey
Copy link

Has there been any progress on this bug? I am encountering the same issue on k8s.

image

describe go-demo pod:
image

@chengjoey chengjoey linked a pull request Mar 22, 2023 that will close this issue
@chengjoey
Copy link

In k8s, new containers are always created to replace oomkilled containers. Therefore, the container_oom_events_total metric will always be 0. I tried to keep the deleted containers due to oomkilled and was able to query the metrics.

image

In addition, if the cluster is running on minikube+docker, the containername obtained from /dev/kmsg will have an additional prefix of /docker/{{id}}, which does not match the containername watched. Therefore, the metrics will always be 0.
image
image

@NesManrique
Copy link

Hitting this issue in kubernetes as well. Commenting for visibility.

@tsipo
Copy link

tsipo commented Apr 26, 2024

I have done various tests of OOMKills under Kubernetes. I have - so far - seen only one use-case where I have observed container_oom_events_total > 0 (specifically container_oom_events_total == 1).
An OOMKill does not mean in all cases deletion of the container (which will deregister its container_oom_events_total). The main process of the container (aka pid 1) may fork one or more Linux processes (actually fork, not using the exec command). If one of these other processes gets OOMKilled and this does not cause pid 1 to exit as well (at least not until the next cAdvisor scrape), the container will continue to live and you'll see container_oom_events_total == 1.

@pschichtel
Copy link

@tsipo track the referenced PR #3278 or help pushing it over the line.

@tsipo
Copy link

tsipo commented Apr 29, 2024

@tsipo track the referenced PR #3278 or help pushing it over the line.

See #3278 (comment)

@rd-yan-farba
Copy link

rd-yan-farba commented Aug 7, 2024

We encountered this problem just now too.

@frittentheke
Copy link

frittentheke commented Aug 16, 2024

@Creatone @bobbypage could I gently drag you to this very issue here about bugs with the OOM metrics?
Since there is a PR potentially fixing this issue, please see the recent comments there: #3278 (comment)

Is there anything to be done to get this problem addressed / the PR reviewed?

@tsipo
Copy link

tsipo commented Aug 20, 2024

I have done various tests of OOMKills under Kubernetes. I have - so far - seen only one use-case where I have observed container_oom_events_total > 0 (specifically container_oom_events_total == 1). An OOMKill does not mean in all cases deletion of the container (which will deregister its container_oom_events_total). The main process of the container (aka pid 1) may fork one or more Linux processes (actually fork, not using the exec command). If one of these other processes gets OOMKilled and this does not cause pid 1 to exit as well (at least not until the next cAdvisor scrape), the container will continue to live and you'll see container_oom_events_total == 1.

An update on my previous comment: k8s 1.28 has enabled cgroup grouping (assuming cgroups v2) - see here. That means the the use-case I have mentioned of "hidden OOMKills" (OOMKill of a container process which is not pid 1) should not happen anymore for k8s >= 1.28 and cgroups v2.

@chengjoey
Copy link

An update on my previous comment: k8s 1.28 has enabled cgroup grouping (assuming cgroups v2) - see here. That means the the use-case I have mentioned of "hidden OOMKills" (OOMKill of a container process which is not pid 1) should not happen anymore for k8s >= 1.28 and cgroups v2.

regardless of the k8s version or whether cgroups v2 is used, as long as OOM causes the container to be rebuilt, the OOM metric is lost. I think kubelet or container-manager will still monitor the system OOM events, then kill the container and rebuild it, then the OOM metric will be lost

@tsipo
Copy link

tsipo commented Aug 22, 2024

An update on my previous comment: k8s 1.28 has enabled cgroup grouping (assuming cgroups v2) - see here. That means the the use-case I have mentioned of "hidden OOMKills" (OOMKill of a container process which is not pid 1) should not happen anymore for k8s >= 1.28 and cgroups v2.

regardless of the k8s version or whether cgroups v2 is used, as long as OOM causes the container to be rebuilt, the OOM metric is lost. I think kubelet or container-manager will still monitor the system OOM events, then kill the container and rebuild it, then the OOM metric will be lost

This is exactly my point: the only use-case where I have seen that the OOM metric was not lost, was removed in k8s 1.28.

Whether or not cadvisor should provide the OOM metric is a separate discussion. It is only relevant if the container is not deleted after being OOMKilled, which doesn't make a lot of sense for any managed container environment, to be honest.
BTW OOM Kills can be monitored using kube-state-metrics metrics -kube_pod_container_status_last_terminated_exitcode (value of 137 is OOMKill) and the recently-added kube_pod_container_status_last_terminated_timestamp. These ones do not go away as the pod is not lost after the container was deleted and rebuilt.
On node level, node-exporter provides node_vmstat_oom_kill (which is a counter of all of the processes - not containers - which were OOM Killed).

@SebastienTolron
Copy link

On my side, upgrade conmon and solved the issue ( minimum debian 12)

@sellers
Copy link

sellers commented Nov 26, 2024

Essentially, in a DevSecOps world, engineers want to be able to track container applications behavior when their memory usage exceeds the memory requests and kubelet has to step in to say, "whoah". That turns into an OOM (137 as stated above).
If that happens over X time during a period of Y, as an engineer, I want to know so that I can address performance issues or business SLA/continuity issues.

e.g. I want to have a query like
sum(increase((container_oom_events_total{namespace!="playground"}[5m]))>1) by (container_label_X, image_id)

or something similar

missing-container-metrics offers this exact metric, albeit 3 years old and littered with CVEs now.

to that, I am unsure what container_oom_events_total is intended for if not the above user story. Please do help all of us who are following this thread if we are misunderstanding the intention of this metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.