Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce resource usage #103

Merged
merged 4 commits into from
Nov 9, 2023
Merged

reduce resource usage #103

merged 4 commits into from
Nov 9, 2023

Conversation

florianl
Copy link
Member

What does this PR do?

Reduce resource usage by preallocating memory and using caches.

Why is it important?

elastic/elastic-agent-system-metrics is a direct dependency of beats like metricbeat.

20231026-095255

As shown in the above screenshot, metricbeat spends most of its time in elastic/elastic-agent-system-metrics doing syscalls to walk directories.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.md

@florianl florianl added the enhancement New feature or request label Oct 26, 2023
@florianl florianl requested a review from a team as a code owner October 26, 2023 08:48
@florianl florianl requested review from ycombinator and belimawr and removed request for a team October 26, 2023 08:48
@florianl
Copy link
Member Author

The impact of this proposed change is shown in the following screenshot:

20231026-130827

The "regular" metricbeat (left side) represents the current status that is shipped with https://snapshots.elastic.co/8.12.0-f9668451/summary-8.12.0-SNAPSHOT.html. On the right sight of the screenshot the proposed change with me metricbeat-own - notice here the less purple vmlinux frames. This translates directly to less syscalls and therefore less CPU usage of metricbeat-own.

@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Oct 26, 2023
@florianl
Copy link
Member Author

florianl commented Oct 26, 2023

Differential function view comparing metricbeat (as of https://snapshots.elastic.co/8.12.0-f9668451/summary-8.12.0-SNAPSHOT.html) with metricbeat-own (metricbeat with this patch applied):
20231026-142318

kruskall
kruskall previously approved these changes Oct 26, 2023
metric/system/cgroup/util.go Outdated Show resolved Hide resolved
@kruskall kruskall dismissed their stale review October 26, 2023 20:46

approved by mistake

Copy link
Contributor

@belimawr belimawr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'm not an expert in the system metrics, while I do not see any issue in caching the information for 5min, I'd like to have a second option.

@fearful-symmetry do you see any issues with caching the cgroup paths for 5min?

@fearful-symmetry
Copy link
Contributor

caching the paths seems reasonable, I think? Particularly for V2, when we're just reading everything out of the same unified hierarchy.

Comment on lines 282 to 293
cacheEntry, ok := tmp.(pathListWithTime)
if ok {
// If the cached entry for controllerPath is not older than 5 minutes,
// return the cached entry.
if time.Since(cacheEntry.added) < time.Duration(5*time.Minute) {
cPaths.V2 = cacheEntry.pathList.V2
continue
}
}
// Consider the existing entry for controllerPath invalid. The entry can
// (1) not be casted to pathListWithTime or (2) is older than 5 minutes.
r.v2ControllerPathCache.Delete(controllerPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a sync.Map why not use a struct wrapping regular map and a mutex? This would give us better type safety and we wouldn't need the type assertion in this code here. It should also make the logic simpler here, as it would be just about whether the cache entry has expired or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did introduce the synchronization mechanism as it is not guaranteed that Reader is only used in a sequential order. And as multiple Go routine can use Reader and a regular map is not Go thread safe, I went with sync.Map. I did choose sync.Map over a struct with a regular map and a mutex, as it is harder to abuse and therefore more safe. While I agree on the added complexity around sync.Map handling, it prevents direct accessing a map in a struct, without holding the sync mutex.
Is this ok for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally get the need for a synchronization here! It's just that the documentation of sync.Map itself doesn't recommend using it over a regular map with a mutex, citing type safety as one of the reasons :)

https://pkg.go.dev/sync#Map

The Map type is specialized. Most code should use a plain Go map instead, with separate locking or coordination, for better type safety and to make it easier to maintain other invariants along with the map content.

I also searched through the existing elastic-agent-system-metrics and elastic-agent codebases and I'm not seeing any uses of sync.Map, probably for this same reason. So I'm hesitant to break that pattern here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While sync.Map is not used in the two named repositories, other beats and Elastic projects are using it - https://github.com/search?type=code&q=org%3Aelastic+lang%3Ago+sync.Map

I updated the PR to keep the consistency of elastic-agent-system-metrics

Reduce GC load by preallocating memory.

Signed-off-by: Florian Lehner <[email protected]>
As multiple PIDs can use the same v2 cgroup, cache the result to reduce CPU usage and number of syscalls.
Make sure v2 entries do not stay forever in the cache.

Signed-off-by: Florian Lehner <[email protected]>
@florianl florianl merged commit 7e588f5 into main Nov 9, 2023
1 of 2 checks passed
belimawr added a commit that referenced this pull request Nov 16, 2023
florianl added a commit that referenced this pull request Dec 21, 2023
## What does this PR do?

Reintroduce the improvements from
#103. This
PR got reverted with
#113 because
of #109.
The issue, that got reported in
#109, was
fixed with
#116.

So its time to bring back the performance improvements by reintroducing
the cache for cgroup v2.

## Why is it important?

<!-- Mandatory
Explain here the WHY, or the rationale/motivation for the changes.
-->

## Checklist

<!-- Mandatory
Add a checklist of things that are required to be reviewed in order to
have the PR approved

List here all the items you have verified BEFORE sending this PR. Please
DO NOT remove any item, striking through those that do not apply. (Just
in case, strikethrough uses two tildes. ~~Scratch this.~~)
-->

- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] ~~I have added tests that prove my fix is effective or that my
feature works~~
- [ ] ~~I have added an entry in `CHANGELOG.md`~~

---------

Signed-off-by: Florian Lehner <[email protected]>
@codefromthecrypt codefromthecrypt deleted the flo-optimyze branch October 3, 2024 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants