Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroups not accounting for relative paths that happen when reading from a container mount #132

Closed
fearful-symmetry opened this issue Mar 8, 2024 · 2 comments · Fixed by #136
Assignees
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team

Comments

@fearful-symmetry
Copy link
Contributor

The /proc/[pid]/cgroup file contains the paths to the cgroups used by the process. Most importantly, this path is relative to the cgroup of the process checking the cgroup. From the man page:

              [3]  This field contains the pathname of the control group
                   in the hierarchy to which the process belongs.  This
                   pathname is relative to the mount point of the
                   hierarchy.

This means that if you mount in a process from the host system into a container, you get a relative path:

 docker exec -it 5b938d5468d3 /bin/bash                                                                                                                                                                                           
metricbeat@motmot:~$ cat /hostfs/proc/114856/cgroup 
0::/../elastic-agent.service

Right now this library doesn't seem to cope with this, as it assumes a universal base path:

{"log.level":"debug","@timestamp":"2024-03-08T20:32:58.467Z","log.logger":"processes","log.origin":{"file.name":"process/process.go","file.line":173},"message":"Error fetching PID info for 1023161, skipping: cgroups.GetStatsForPid: error fetching cgroupV2 controllers for cgroup location '/hostfs/sys/fs/cgroup' and path line '0::/../../user.slice/user-1000.slice/session-212.scope': open /hostfs/sys/user.slice/user-1000.slice/session-212.scope: no such file or directory","service.name":"metricbeat","ecs.version":"1.6.0"}

I'm not sure why we didn't catch this before? Perhaps there's a conflux of different docker versions, bugs and configs that have escaped most people.

@fearful-symmetry
Copy link
Contributor Author

fearful-symmetry commented Mar 20, 2024

A temporary workaround is to set --cgroupns host, but that obviously has some downsides.

https://man7.org/linux/man-pages/man7/cgroup_namespaces.7.html

@pierrehilbert pierrehilbert added the bug Something isn't working label Mar 25, 2024
@pierrehilbert
Copy link

Blocked as required elastic/beats#38241 first

fearful-symmetry added a commit that referenced this issue Apr 4, 2024
## What does this PR do?

This makes it so cgroup-specific errors don't fail an entire PID, so the
PID will still be reported even if we have cgroup errors.

## Why is it important?

This is a sort of band-aid for
#132, as
I'm not sure how long a "proper" fix for that will take, as it might
require significant changes to the cgroup client, as well as a
considerable degree of testing.

## Checklist

- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] I have added an entry in `CHANGELOG.md`

---------

Co-authored-by: Lee E Hinman <[email protected]>
fearful-symmetry added a commit that referenced this issue Apr 18, 2024
…tainer (#140)

## What does this PR do?
Closes elastic/beats#38241

This adds a lightweight test framework that runs a set of system tests
under a container with the goal of monitoring the host system. The goal
with these tests is to catch the numerous edge cases that happen when
the system metrics function from a `/hostfs` path inside a container.

The tests have a fairly large matrix of configurations, as we need to
test both a wide variety of container permission settings, as well as
differences in how linux distros will configure cgroups.

The framework here was designed with the goal of being relatively
idiomatic; you can just run the framework with `go test` as you would
normally.

You can run the tests yourself with `go test -v ./tests`

As you may have noticed, there's a non-zero amount of TODO statements
here, since these tests were built to aggravate a bunch of existing
bugs, so certain parts of the tests will remain un-implemented until
those bugs are fixed.

## Why is it important?

See elastic/beats#38241, we really need test
for this particular case.

## List of bugs that are responsible for TODO statements in the tests:

- #141
- #135
- #139
- #132
- elastic/go-sysinfo#12

## Checklist

- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have added an entry in `CHANGELOG.md`
fearful-symmetry added a commit that referenced this issue May 1, 2024
## What does this PR do?

Fixes #132
and fixes
#139

This sanitizes the "relative" namespace mounts that we get when we
monitor the host system from within newer versions of docker.

This also adds much more complex logic for setting the v2 rootpath we
use for fetching metrics from v2 cgroups paths.

This also aggressively cleans up the unit tests in `cgroup` because they
were a mess

With this change, cgroups metric collection works properly on docker
configs where where the container is using a private namespace. This
value is configurable from docker 1.41+

## Checklist


- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have added an entry in `CHANGELOG.md`


## How to test this PR

- run `go test ./tests` from the repo root
- build metricbeat with this version of `elastic-agent-system-metrics`,
run `system/process` metrics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants