Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[memorylimiter] Fix cgroups parsing #7145

Conversation

povilasv
Copy link
Contributor

@povilasv povilasv commented Feb 7, 2023

#6825 was accidentally merged and reverted, this PR reverts the revert commit -> #7142

There are more discussion in the original PR #6825

Important (read before submitting)

Description:

Fixing Memory limiter bug, when container mount's hosts sys filesystem, cgroups might incorrectly parse current container's memory.

The only way I figure we can fix this is to limit the search to /sys ? But I'm not sure if there are kernel's that would mount this in a different place?

Link to tracking Issue: #6826
and open-telemetry/opentelemetry-helm-charts#543

Testing:

  • Added unit tests

Documentation: < Describe the documentation added.>

@povilasv povilasv changed the title Revert "Revert "[memorylimiter] Fix cgroups parsing"" [memorylimiter] Fix cgroups parsing Feb 7, 2023
@povilasv povilasv marked this pull request as ready for review February 7, 2023 07:21
@povilasv povilasv requested review from a team and djaglowski February 7, 2023 07:21
@povilasv
Copy link
Contributor Author

povilasv commented Feb 7, 2023

@jpkrohling let's continue the conversation from #6825 (comment) here

So in this case, unit test fails without the fix for you, as per comment #6825 (comment)


        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-/sys/fs/cgroup/memory/large
        	            	+/var/lib/docker/overlay2/9054a95f2cf7296867089e1bd37931742a17eb3308a795d51adb2654ee2276df/merged/sys/fs/cgroup/memory/large
        	Test:       	TestNewCGroups

Basically to get it's memory info the program is trying to read /var/lib/docker/... (hostmounted container's memory info) instead of /sys/fs/cgroup/memory/large

So it seems everything is okay and the fix is working? Or do you have any other issues?

@codecov
Copy link

codecov bot commented Feb 7, 2023

Codecov Report

Base: 90.84% // Head: 90.84% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (16970fa) compared to base (e2a6cd7).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7145   +/-   ##
=======================================
  Coverage   90.84%   90.84%           
=======================================
  Files         293      293           
  Lines       14302    14304    +2     
=======================================
+ Hits        12993    12995    +2     
  Misses       1041     1041           
  Partials      268      268           
Impacted Files Coverage Δ
internal/cgroups/cgroups.go 92.00% <100.00%> (+0.21%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@jpkrohling
Copy link
Member

jpkrohling commented Feb 7, 2023

True, perhaps something changed when I moved to Fedora 37?

Actually, it does not fail with the main branch:

jpkroehling@caju ~/P/g/o/o/i/cgroups (main)> git describe
processor/memorylimiterprocessor/v0.70.0-76-g1ca481b8
jpkroehling@caju ~/P/g/o/o/i/cgroups (main)> go test -v -run TestNewCGroups
=== RUN   TestNewCGroups
--- PASS: TestNewCGroups (0.00s)
=== RUN   TestNewCGroupsWithErrors
--- PASS: TestNewCGroupsWithErrors (0.00s)
PASS
ok  	go.opentelemetry.io/collector/internal/cgroups	0.002s
jpkroehling@caju ~/P/g/o/o/i/cgroups (main)> pwd
/home/jpkroehling/Projects/github.com/open-telemetry/opentelemetry-collector/internal/cgroups

@jpkrohling
Copy link
Member

I re-ran the tests based on #7105, after rebasing on main and these are the results, which look different than before:

> go test -v -run TestNewCGroups
=== RUN   TestNewCGroups
cgroupsSubsystems map[cpu:0xc00009d470 cpuacct:0xc00009d470 cpuset:0xc00009d4d0 memory:0xc00009d410]
mountPoint &{6 5 0:5 / /sys/fs/cgroup/cpuset [rw nosuid nodev noexec relatime] [shared:6] cgroup cgroup [rw cpuset]}
sybsys cpuset
cgroupPath /sys/fs/cgroup/cpuset
mountPoint &{7 5 0:6 /docker /sys/fs/cgroup/cpu,cpuacct [rw nosuid nodev noexec relatime] [shared:7] cgroup cgroup [rw cpu cpuacct]}
sybsys cpu
cgroupPath /sys/fs/cgroup/cpu,cpuacct
sybsys cpuacct
cgroupPath /sys/fs/cgroup/cpu,cpuacct
mountPoint &{8 5 0:7 /docker /sys/fs/cgroup/memory [rw nosuid nodev noexec relatime] [shared:8] cgroup cgroup [rw memory]}
sybsys memory
cgroupPath /sys/fs/cgroup/memory/large
--- PASS: TestNewCGroups (0.00s)
=== RUN   TestNewCGroupsWithErrors
cgroupsSubsystems map[]
cgroupsSubsystems map[]
cgroupsSubsystems map[cpu:0xc00009d890 cpuacct:0xc00009d8f0]
mountPoint &{31 23 0:24 / /sys/fs/cgroup/cpu [rw nosuid nodev noexec relatime] [shared:1] cgroup cgroup [rw cpu]}
sybsys cpu
cgroupPath /sys/fs/cgroup/cpu/docker
mountPoint &{32 23 0:25 /docker/0123456789abcdef /sys/fs/cgroup/cpuacct [rw nosuid nodev noexec relatime] [shared:2] cgroup cgroup [rw cpuacct]}
sybsys cpuacct
--- PASS: TestNewCGroupsWithErrors (0.00s)
PASS
ok  	go.opentelemetry.io/collector/internal/cgroups	0.002s

@povilasv
Copy link
Contributor Author

povilasv commented Feb 7, 2023

mountPoint &{6 5 0:5 / /sys/fs/cgroup/cpuset [rw nosuid nodev noexec relatime] [shared:6] cgroup cgroup [rw cpuset]}
sybsys cpuset

I had the same thing when I rebased to the main branch. I checked and for some reason rebasing got rid of the change in testdata/proc/cgroups/mountinfo. @jpkrohling can you check if your testdata/proc/cgroups/mountinfo has:

9 1 0:8 / /var/lib/docker/overlay2/9054a95f2cf7296867089e1bd37931742a17eb3308a795d51adb2654ee2276df/merged/sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,mode=755
10 9 0:9 /docker /var/lib/docker/overlay2/9054a95f2cf7296867089e1bd37931742a17eb3308a795d51adb2654ee2276df/merged/sys/fs/cgroup/memory ro,nosuid,nodev,noexec,relatime master:17 - cgroup cgroup rw,memory

(https://github.com/open-telemetry/opentelemetry-collector/blob/e154c51c2b1b7dd0155b5447dc17a194644dc63d/internal/cgroups/testdata/proc/cgroups/mountinfo) this change?

Maybe the merged commit and reverted commit made git rebase work weirdly?

I rebased this #7105 to main and added testdata change. I believe it should fail consistently? Atleast does it for me and CI (https://github.com/open-telemetry/opentelemetry-collector/actions/runs/4115675943/jobs/7104796263#step:6:214)

Could you do another run of #7105?

@jpkrohling
Copy link
Member

Re-running #7105 got me this:

> go test -v -run TestNewCGroups
=== RUN   TestNewCGroups
cgroupsSubsystems map[cpu:0xc00009d470 cpuacct:0xc00009d470 cpuset:0xc00009d4d0 memory:0xc00009d410]
mountPoint &{6 5 0:5 / /sys/fs/cgroup/cpuset [rw nosuid nodev noexec relatime] [shared:6] cgroup cgroup [rw cpuset]}
sybsys cpuset
cgroupPath /sys/fs/cgroup/cpuset
mountPoint &{7 5 0:6 /docker /sys/fs/cgroup/cpu,cpuacct [rw nosuid nodev noexec relatime] [shared:7] cgroup cgroup [rw cpu cpuacct]}
sybsys cpu
cgroupPath /sys/fs/cgroup/cpu,cpuacct
sybsys cpuacct
cgroupPath /sys/fs/cgroup/cpu,cpuacct
mountPoint &{8 5 0:7 /docker /sys/fs/cgroup/memory [rw nosuid nodev noexec relatime] [shared:8] cgroup cgroup [rw memory]}
sybsys memory
cgroupPath /sys/fs/cgroup/memory/large
mountPoint &{10 9 0:9 /docker /var/lib/docker/overlay2/9054a95f2cf7296867089e1bd37931742a17eb3308a795d51adb2654ee2276df/merged/sys/fs/cgroup/memory [ro nosuid nodev noexec relatime] [master:17] cgroup cgroup [rw memory]}
sybsys memory
cgroupPath /var/lib/docker/overlay2/9054a95f2cf7296867089e1bd37931742a17eb3308a795d51adb2654ee2276df/merged/sys/fs/cgroup/memory/large
    cgroups_test.go:70: 
        	Error Trace:	/home/jpkroehling/Projects/github.com/open-telemetry/opentelemetry-collector/internal/cgroups/cgroups_test.go:70
        	Error:      	Not equal: 
        	            	expected: "/sys/fs/cgroup/memory/large"
        	            	actual  : "/var/lib/docker/overlay2/9054a95f2cf7296867089e1bd37931742a17eb3308a795d51adb2654ee2276df/merged/sys/fs/cgroup/memory/large"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-/sys/fs/cgroup/memory/large
        	            	+/var/lib/docker/overlay2/9054a95f2cf7296867089e1bd37931742a17eb3308a795d51adb2654ee2276df/merged/sys/fs/cgroup/memory/large
        	Test:       	TestNewCGroups
        	Messages:   	"/sys/fs/cgroup/memory/large" expected for `cgroups["memory"].path`, got "/var/lib/docker/overlay2/9054a95f2cf7296867089e1bd37931742a17eb3308a795d51adb2654ee2276df/merged/sys/fs/cgroup/memory/large"
--- FAIL: TestNewCGroups (0.00s)
=== RUN   TestNewCGroupsWithErrors
cgroupsSubsystems map[]
cgroupsSubsystems map[]
cgroupsSubsystems map[cpu:0xc00009da40 cpuacct:0xc00009daa0]
mountPoint &{31 23 0:24 / /sys/fs/cgroup/cpu [rw nosuid nodev noexec relatime] [shared:1] cgroup cgroup [rw cpu]}
sybsys cpu
cgroupPath /sys/fs/cgroup/cpu/docker
mountPoint &{32 23 0:25 /docker/0123456789abcdef /sys/fs/cgroup/cpuacct [rw nosuid nodev noexec relatime] [shared:2] cgroup cgroup [rw cpuacct]}
sybsys cpuacct
--- PASS: TestNewCGroupsWithErrors (0.00s)
FAIL
exit status 1
FAIL	go.opentelemetry.io/collector/internal/cgroups	0.002s

This is how #7105 looks like for me:

> git describe
processor/memorylimiterprocessor/v0.70.0-77-ge154c51c

@povilasv
Copy link
Contributor Author

povilasv commented Feb 7, 2023

So fails without fix as expected, and we seem to need to filter only the /sys values as I'm trying to introduce here -> https://github.com/open-telemetry/opentelemetry-collector/pull/7145/files#diff-afab5068f65749242d0fc18e8f475af84d21d54515b3422a6e4e3eda368e582cR107-R109

@jpkrohling do you have any concerns about this change? or something I can take a look / resolve?

@jpkrohling
Copy link
Member

jpkrohling commented Feb 7, 2023

So fails without fix as expected

It's not failing on main though:

> go test -v -run TestNewCGroups
=== RUN   TestNewCGroups
--- PASS: TestNewCGroups (0.00s)
=== RUN   TestNewCGroupsWithErrors
--- PASS: TestNewCGroupsWithErrors (0.00s)
PASS
ok  	go.opentelemetry.io/collector/internal/cgroups	0.002s

> git log -1
commit 1ca481b86e1856ad25dbe2598340cc9066d7f950 (HEAD -> main, upstream/main, origin/main, origin/HEAD)
Author: Alex Boten <[email protected]>
Date:   Mon Feb 6 12:56:10 2023 -0800

    [chore] update chlog update call (#7144)
    
    The previous change only modified the PR comment, instead of the call to `chlog-update` itself.
    
    Signed-off-by: Alex Boten <[email protected]>

I guess there's something I'm not fully understanding here? Would you like to do a live session to sort this out? Ping me on Slack and we can find a time that works for both of us.

@@ -6,3 +6,5 @@
6 5 0:5 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,cpuset
7 5 0:6 /docker /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:7 - cgroup cgroup rw,cpu,cpuacct
8 5 0:7 /docker /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:8 - cgroup cgroup rw,memory
9 1 0:8 / /var/lib/docker/overlay2/9054a95f2cf7296867089e1bd37931742a17eb3308a795d51adb2654ee2276df/merged/sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,mode=755
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change I was looking for!

@povilasv
Copy link
Contributor Author

povilasv commented Feb 8, 2023

This is probably good to merge, as previous (incidentally merged and reverted) was Approved by three other folks #6825 :)

@jpkrohling
Copy link
Member

@codeboten, would you do the honors?

@bogdandrutu
Copy link
Member

Please rebase

Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase to unblock the 1.20 unit test check :)

@povilasv povilasv force-pushed the revert-7142-revert-6825-fix-memory-limiter branch from 47f73b7 to 16970fa Compare February 9, 2023 07:21
@povilasv
Copy link
Contributor Author

povilasv commented Feb 9, 2023

Done :)

@bogdandrutu bogdandrutu merged commit 2476ab9 into open-telemetry:main Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants