reduce resource usage #103

florianl · 2023-10-26T08:48:04Z

What does this PR do?

Reduce resource usage by preallocating memory and using caches.

Why is it important?

elastic/elastic-agent-system-metrics is a direct dependency of beats like metricbeat.

As shown in the above screenshot, metricbeat spends most of its time in elastic/elastic-agent-system-metrics doing syscalls to walk directories.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have added tests that prove my fix is effective or that my feature works~~
~~I have added an entry in CHANGELOG.md~~

florianl · 2023-10-26T11:16:05Z

The impact of this proposed change is shown in the following screenshot:

The "regular" metricbeat (left side) represents the current status that is shipped with https://snapshots.elastic.co/8.12.0-f9668451/summary-8.12.0-SNAPSHOT.html. On the right sight of the screenshot the proposed change with me metricbeat-own - notice here the less purple vmlinux frames. This translates directly to less syscalls and therefore less CPU usage of metricbeat-own.

florianl · 2023-10-26T12:25:14Z

Differential function view comparing metricbeat (as of https://snapshots.elastic.co/8.12.0-f9668451/summary-8.12.0-SNAPSHOT.html) with metricbeat-own (metricbeat with this patch applied):

metric/system/cgroup/util.go

approved by mistake

belimawr

LGTM, but I'm not an expert in the system metrics, while I do not see any issue in caching the information for 5min, I'd like to have a second option.

@fearful-symmetry do you see any issues with caching the cgroup paths for 5min?

fearful-symmetry · 2023-10-31T16:38:06Z

caching the paths seems reasonable, I think? Particularly for V2, when we're just reading everything out of the same unified hierarchy.

ycombinator · 2023-11-03T00:52:22Z

metric/system/cgroup/util.go

+				cacheEntry, ok := tmp.(pathListWithTime)
+				if ok {
+					// If the cached entry for controllerPath is not older than 5 minutes,
+					// return the cached entry.
+					if time.Since(cacheEntry.added) < time.Duration(5*time.Minute) {
+						cPaths.V2 = cacheEntry.pathList.V2
+						continue
+					}
+				}
+				// Consider the existing entry for controllerPath invalid. The entry can
+				// (1) not be casted to pathListWithTime or (2) is older than 5 minutes.
+				r.v2ControllerPathCache.Delete(controllerPath)


Instead of using a sync.Map why not use a struct wrapping regular map and a mutex? This would give us better type safety and we wouldn't need the type assertion in this code here. It should also make the logic simpler here, as it would be just about whether the cache entry has expired or not.

I did introduce the synchronization mechanism as it is not guaranteed that Reader is only used in a sequential order. And as multiple Go routine can use Reader and a regular map is not Go thread safe, I went with sync.Map. I did choose sync.Map over a struct with a regular map and a mutex, as it is harder to abuse and therefore more safe. While I agree on the added complexity around sync.Map handling, it prevents direct accessing a map in a struct, without holding the sync mutex.
Is this ok for you?

Totally get the need for a synchronization here! It's just that the documentation of sync.Map itself doesn't recommend using it over a regular map with a mutex, citing type safety as one of the reasons :)

https://pkg.go.dev/sync#Map

The Map type is specialized. Most code should use a plain Go map instead, with separate locking or coordination, for better type safety and to make it easier to maintain other invariants along with the map content.

I also searched through the existing elastic-agent-system-metrics and elastic-agent codebases and I'm not seeing any uses of sync.Map, probably for this same reason. So I'm hesitant to break that pattern here.

While sync.Map is not used in the two named repositories, other beats and Elastic projects are using it - https://github.com/search?type=code&q=org%3Aelastic+lang%3Ago+sync.Map

I updated the PR to keep the consistency of elastic-agent-system-metrics

metric/system/cgroup/util.go

Reduce GC load by preallocating memory. Signed-off-by: Florian Lehner <[email protected]>

As multiple PIDs can use the same v2 cgroup, cache the result to reduce CPU usage and number of syscalls.

Make sure v2 entries do not stay forever in the cache. Signed-off-by: Florian Lehner <[email protected]>

Signed-off-by: Florian Lehner <[email protected]>

This reverts commit 7e588f5.

## What does this PR do? Reintroduce the improvements from #103. This PR got reverted with #113 because of #109. The issue, that got reported in #109, was fixed with #116. So its time to bring back the performance improvements by reintroducing the cache for cgroup v2. ## Why is it important?  ## Checklist  - [x] My code follows the style guidelines of this project - [x] I have commented my code, particularly in hard-to-understand areas - [ ] ~~I have added tests that prove my fix is effective or that my feature works~~ - [ ] ~~I have added an entry in `CHANGELOG.md`~~ --------- Signed-off-by: Florian Lehner <[email protected]>

florianl added the enhancement New feature or request label Oct 26, 2023

florianl requested a review from a team as a code owner October 26, 2023 08:48

florianl requested review from ycombinator and belimawr and removed request for a team October 26, 2023 08:48

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Oct 26, 2023

pierrehilbert requested a review from fearful-symmetry October 26, 2023 11:48

kruskall previously approved these changes Oct 26, 2023

View reviewed changes

metric/system/cgroup/util.go Outdated Show resolved Hide resolved

florianl force-pushed the flo-optimyze branch from 0cb8e6e to 80dde66 Compare October 31, 2023 08:34

kruskall approved these changes Oct 31, 2023

View reviewed changes

belimawr reviewed Oct 31, 2023

View reviewed changes

belimawr approved these changes Oct 31, 2023

View reviewed changes

ycombinator reviewed Nov 3, 2023

View reviewed changes

metric/system/cgroup/util.go Outdated Show resolved Hide resolved

florianl added 4 commits November 6, 2023 13:28

Go: preallocate memory

ff5dfc9

Reduce GC load by preallocating memory. Signed-off-by: Florian Lehner <[email protected]>

metric/system/cgroup: add cache for v2 entries

7ac91fe

As multiple PIDs can use the same v2 cgroup, cache the result to reduce CPU usage and number of syscalls.

metric/system/cgroup: add timestamp to cache for v2 entries

9cb7cc2

Make sure v2 entries do not stay forever in the cache. Signed-off-by: Florian Lehner <[email protected]>

replace sync.Map with custom struct with sync.RWMutex

210c199

Signed-off-by: Florian Lehner <[email protected]>

florianl force-pushed the flo-optimyze branch from de8db9b to 210c199 Compare November 6, 2023 12:48

florianl merged commit 7e588f5 into main Nov 9, 2023
1 of 2 checks passed

leehinman mentioned this pull request Nov 16, 2023

broken on darwin & windows #109

Closed

belimawr mentioned this pull request Nov 16, 2023

Run unit-tests on CI and block PRs if tests fails #110

Closed

belimawr added a commit that referenced this pull request Nov 16, 2023

Revert "reduce resource usage (#103)"

792e231

This reverts commit 7e588f5.

belimawr mentioned this pull request Nov 16, 2023

Revert "reduce resource usage" #113

Merged

florianl mentioned this pull request Nov 24, 2023

metric/system/cgroup: introduce cache for cgroup v2 #117

Merged

4 tasks

codefromthecrypt deleted the flo-optimyze branch October 3, 2024 00:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce resource usage #103

reduce resource usage #103

florianl commented Oct 26, 2023

florianl commented Oct 26, 2023

florianl commented Oct 26, 2023 •

edited

Loading

belimawr left a comment

fearful-symmetry commented Oct 31, 2023

ycombinator Nov 3, 2023

florianl Nov 3, 2023

ycombinator Nov 3, 2023

florianl Nov 6, 2023

reduce resource usage #103

reduce resource usage #103

Conversation

florianl commented Oct 26, 2023

What does this PR do?

Why is it important?

Checklist

florianl commented Oct 26, 2023

florianl commented Oct 26, 2023 • edited Loading

belimawr left a comment

Choose a reason for hiding this comment

fearful-symmetry commented Oct 31, 2023

ycombinator Nov 3, 2023

Choose a reason for hiding this comment

florianl Nov 3, 2023

Choose a reason for hiding this comment

ycombinator Nov 3, 2023

Choose a reason for hiding this comment

florianl Nov 6, 2023

Choose a reason for hiding this comment

florianl commented Oct 26, 2023 •

edited

Loading