Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment daemonset failed due to failed to create "memory_limiter" processor, permission denied #543

Closed
wadexu007 opened this issue Dec 7, 2022 · 27 comments
Assignees
Labels
bug Something isn't working chart:collector Issue related to opentelemetry-collector helm chart

Comments

@wadexu007
Copy link

wadexu007 commented Dec 7, 2022

Issue Summary:
Deployment daemonset failed due to failed to create "memory_limiter" processor, permission denied when enable hostMetrics.

On env:

  • Chart version opentelemetry-collector-0.40.7
  • APP verison 0.66.0
  • GKE 1.22
  • Helm v3.5.0

Steps to reproduce:

  1. vim test.yaml
fullnameOverride: "otel-collector-ds"

mode: daemonset

presets:
  hostMetrics:
    enabled: true

  1. helm install otel-collector open-telemetry/opentelemetry-collector -f test.yaml
  2. otel-collector pods CrashLoopBackOff
  3. check pod logs
Error: cannot build pipelines: failed to create "memory_limiter" processor, in pipeline "logs": failed to get total memory, use fixed memory settings (limit_mib): open /hostfs/var/lib/docker/overlay2/32e749532f5c2cb81779c4079637174f93b3a18c3220638fa7a43078d0360859/merged/sys/fs/cgroup/memory/memory.limit_in_bytes: permission denied
2022-12-07 09:15:09.499024 I | collector server run finished with error: cannot build pipelines: failed to create "memory_limiter" processor, in pipeline "logs": failed to get total memory, use fixed memory settings (limit_mib): open /hostfs/var/lib/docker/overlay2/32e749532f5c2cb81779c4079637174f93b3a18c3220638fa7a43078d0360859/merged/sys/fs/cgroup/memory/memory.limit_in_bytes: permission denied

Workaround is set a fix value for memory_limiter - limit_mib

config:
  processors:
    memory_limiter:
      check_interval: 5s
      limit_mib: 1600
      spike_limit_mib: 500

Can you help to take a look at this issue?

Thanks

@TylerHelmuth
Copy link
Member

@wadexu007 it appears your collector doesn't have permission to see how much memory it has been allocated. I've not encountered this error before, is there anything else in your environement you can share that would cause a container to have restricted permissions. @dmitryax have you see this error before?

I don't see any known issues with the collector not being able to access this information, but if we can find the root cause of the permission issue, and they are valid, then we may need to consider reverting back to hard-coded values by default. cc @puckpuck

@TylerHelmuth TylerHelmuth added bug Something isn't working chart:collector Issue related to opentelemetry-collector helm chart labels Dec 7, 2022
@puckpuck
Copy link
Contributor

puckpuck commented Dec 7, 2022

Please share details about the K8s environment this was deployed to

@wadexu007
Copy link
Author

wadexu007 commented Dec 8, 2022

Please share details about the K8s environment this was deployed to

@puckpuck
It's Google Kubernetes Engine 1.22

@TylerHelmuth
Copy link
Member

@wadexu007 does the issue happen on k8s 1.23+ ?

@povilasv
Copy link
Contributor

povilasv commented Dec 8, 2022

One of my colleagues mentioned similar issue, we are running 1.23.8 , I didn't dig this further but the resolution for now was to use limit_mib and spike_limit_mib

@wadexu007
Copy link
Author

The permission issue could be within node's operating system configuration, GKE's host node is running a version of the Container-Optimized OS, which by default comes with a lot of security features.
https://cloud.google.com/container-optimized-os/docs/concepts/security#security-hardened_kernel

@a-thaler
Copy link
Contributor

I was just running into the same problem on a Gardener cluster with k8s 1.23.13.
I was using the helm chart successful to deploy my otel-collector as daemonset till I enabled the hostMetrics preset. The mentioned exception states that the memory limiter is reading from /hostfs/.. Exactly that folder is mounted by the hostmetrics preset and it is pointing to the root of the host filesystem, so the memoryLimiter is not reading hostfs of the container but the root folder of the mounted host.

I assume that the hostmetrics preset should mount the hosts root folder into a custom directory and then configure the root_path to it to stay isolated from other functionality.

@TylerHelmuth If you think that is going into the right direction I could give that a try and work on a PR.

@a-thaler
Copy link
Contributor

a-thaler commented Dec 15, 2022

I changed the folder name where the root filesystem for the hostMetrics gets mounted to host-metrics-root, still the memoryLimiter tries to access it:

2022-12-15T21:00:38.600Z	debug	components/components.go:28	Beta component. May change in the future.	{"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "stability": "Beta"}
Error: cannot build pipelines: failed to create "memory_limiter" processor, in pipeline "logs": failed to get total memory, use fixed memory settings (limit_mib): open /host-metrics-root/run/containerd/io.containerd.runtime.v2.task/k8s.io/8e4219dc410b288b2976fd082b65d80037827e4fdd0317e2c08577621990a467/rootfs/sys/fs/cgroup/memory/memory.limit_in_bytes: permission denied
2022/12/15 21:00:38 collector server run finished with error: cannot build pipelines: failed to create "memory_limiter" processor, in pipeline "logs": failed to get total memory, use fixed memory settings (limit_mib): open /host-metrics-root/run/containerd/io.containerd.runtime.v2.task/k8s.io/8e4219dc410b288b2976fd082b65d80037827e4fdd0317e2c08577621990a467/rootfs/sys/fs/cgroup/memory/memory.limit_in_bytes: permission denied

When disabling the memoryLimiter, an extension seems to run into the same problem:

Error: failed to start extensions: open /host-metrics-root/run/containerd/io.containerd.runtime.v2.task/k8s.io/f7cfaf66715b7cc28f776c7a448fe6597909815482c402cfe5c83295b05b1aad/rootfs/sys/fs/cgroup/memory/memory.limit_in_bytes: permission denied
2022/12/15 20:29:16 collector server run finished with error: failed to start extensions: open /host-metrics-root/run/containerd/io.containerd.runtime.v2.task/k8s.io/f7cfaf66715b7cc28f776c7a448fe6597909815482c402cfe5c83295b05b1aad/rootfs/sys/fs/cgroup/memory/memory.limit_in_bytes: permission denied

@TylerHelmuth
Copy link
Member

@povilasv did you run into this issue when updating the hostmetrics preset recently?

@a-thaler
Copy link
Contributor

a-thaler commented Dec 15, 2022

It seems the switch from chart version 0.39.2 to 0.39.3 of chart opentelemetry-collector is breaking it
opentelemetry-collector-0.39.2...opentelemetry-collector-0.39.3

So probably that change by @puckpuck to use percentage based limiting is showing up the symptoms: #513

@TylerHelmuth
Copy link
Member

@a-thaler the problem is definitely caused by trying to use percents because that tells the memory limiter to go do some lookups and its the lookups that are failing. I am surprised that it is not working as expected though, I get no errors locally in kind.

For now the workaround is to configure the memory_limiter processor in .Values.config with hardcoded values.

@puckpuck
Copy link
Contributor

I'm confused why this only happens when the hostMetrics preset is used. Since the change in #513 doesn't touch that preset.

@povilasv
Copy link
Contributor

povilasv commented Dec 16, 2022

@povilasv did you run into this issue when updating the hostmetrics preset recently?

Nope, we had this error prior to hostMetrics change, and we didn't change any mounting logic in that PR only removed some envvars -> https://github.com/open-telemetry/opentelemetry-helm-charts/pull/549/files#diff-d3c8687b50b2f7b2ca10ff878367b16b76ead0cfdf62548091b5bcc507dc2d68

Maybe those envvars had some impact?

@povilasv
Copy link
Contributor

povilasv commented Dec 16, 2022

I did some more digging, I think I have a theory - I believe the issue is memoryLimitter code, basically what happens is:

Memory Limiter reads /proc/self/mountinfo to find cgroups mount. So if there are multiple cgroups mounts then it might won't work reliably? I.e. In the case of hostMetrics, it might read the one mounted from host?

Call path:

  1. https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/memorylimiterprocessor/memorylimiter.go#L145-L148
  2. https://github.com/open-telemetry/opentelemetry-collector/blob/main/internal/iruntime/total_memory_linux.go#L44-L52
  3. https://github.com/open-telemetry/opentelemetry-collector/blob/e63aed8fefe5e02249894960f050c5de4baa0350/internal/cgroups/cgroups.go#L121
  4. https://github.com/open-telemetry/opentelemetry-collector/blob/e63aed8fefe5e02249894960f050c5de4baa0350/internal/cgroups/cgroups.go#L86
  5. https://github.com/open-telemetry/opentelemetry-collector/blob/e63aed8fefe5e02249894960f050c5de4baa0350/internal/cgroups/subsys.go#L95-L120

Based on this, the issue affects only folks using cgroupvs1 and at the moment the workaround is to set MemoryLimitMB.

Next steps:

  • To verify this I can run a container with a host mount and check //proc/self/mountinfo
  • Copy the data and add unit test in collector, check if it fails
  • Figure out how to adopt the code? (this is the hard part)

@TylerHelmuth
Copy link
Member

@povilasv and @a-thaler thanks for digging in.

I am also curious why it isn't a consistent issue. I am able to use the memory_limiter with hostmetrics preset without issue.

@povilasv
Copy link
Contributor

povilasv commented Dec 16, 2022

@TylerHelmuth I believe you are on cgroupv2 (not v1) which does not have this parsing of mounts and should have no problems. Could you check?

@a-thaler
Copy link
Contributor

The workaround of configuring fixed sizes for memory-limiter and ballast extension works fine.

In the /sys/fs/cgroup it is written where the cgroupsV2 files are located which is then read by the limiter. By doing that re-mount of the hosts root filesystem.. the original path taken from the file becomes a soft link now? but there the limiter has no permissions to read?
Is it maybe required to explicitly mount all sub directories individual? Just guessing around

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Dec 16, 2022

@povilasv you're right, I've got cgroupv2

@jayasai470
Copy link

got the same issue in our k8 deployments, not sure if its cgroups v1 or v2

Error: cannot build pipelines: failed to create "memory_limiter" processor, in pipeline "traces": failed to get total memory, use fixed memory settings (limit_mib): open /hostfs/run/containerd/io.containerd.runtime.v2.task/k8s.io/76210864068d6c8ccefae3f68379c0e9d884dd770fcdcba366dd5f1c5777d73d/rootfs/sys/fs/cgroup/memory/memory.limit_in_bytes: permission denied
2023-01-20 15:44:16.050642 I | collector server run finished with error: cannot build pipelines: failed to create "memory_limiter" processor, in pipeline "traces": failed to get total memory, use fixed memory settings (limit_mib): open /hostfs/run/containerd/io.containerd.runtime.v2.task/k8s.io/76210864068d6c8ccefae3f68379c0e9d884dd770fcdcba366dd5f1c5777d73d/rootfs/sys/fs/cgroup/memory/memory.limit_in_bytes: permission denied

also used the static config of memroy_limiter with limit_mib but got the error like below

Error: failed to start extensions: open /hostfs/run/containerd/io.containerd.runtime.v2.task/k8s.io/5ab97d724bff8ff6c8a82bc5b204d46d8fe45eb6880ccf6a82cc11bba130ae36/rootfs/sys/fs/cgroup/memory/memory.limit_in_bytes: permission denied; failed to shutdown pipelines: no existing monitoring routine is running; no existing monitoring routine is running
2023/01/20 16:39:17 collector server run finished with error: failed to start extensions: open /hostfs/run/containerd/io.containerd.runtime.v2.task/k8s.io/5ab97d724bff8ff6c8a82bc5b204d46d8fe45eb6880ccf6a82cc11bba130ae36/rootfs/sys/fs/cgroup/memory/memory.limit_in_bytes: permission denied; failed to shutdown pipelines: no existing monitoring routine is running; no existing monitoring routine is running

we are running on

eks - 1.24
contianerd

@povilasv
Copy link
Contributor

povilasv commented Jan 22, 2023

It seems that the bug is in the opentelemetry-collector repository and we are still waiting for maintainers to take a look and give more feedback -> open-telemetry/opentelemetry-collector#6825

@ckt114
Copy link

ckt114 commented Feb 3, 2023

I use the same config and deployed the collector as StatefulSet and this error doesn't happen but it does throw this error when running as DaemonSet.

@povilasv
Copy link
Contributor

povilasv commented Feb 3, 2023

Please assign this to me for now, there is some work / discussions going on in open-telemetry/opentelemetry-collector#6825 :)

@povilasv
Copy link
Contributor

povilasv commented Feb 23, 2023

The newest release should have have the fix, so this should be solved with #655

@povilasv
Copy link
Contributor

povilasv commented Mar 2, 2023

Maybe someone can test the newest chart and check if issue is gone?

@wadexu007 or @a-thaler maybe?

@a-thaler
Copy link
Contributor

a-thaler commented Mar 3, 2023

@povilasv will test it today and will provide feedback

@a-thaler
Copy link
Contributor

a-thaler commented Mar 3, 2023

@povilasv I can confirm that the issue is solved. With chart version 0.49.1 the collector starts properly using the hostMetrics preset in combination with percentage based memory_limiter settings

@povilasv
Copy link
Contributor

povilasv commented Mar 3, 2023

Awesome, thanks for verifying. Closing the issue :)

@povilasv povilasv closed this as completed Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working chart:collector Issue related to opentelemetry-collector helm chart
Projects
None yet
Development

No branches or pull requests

7 participants