Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reporter: don't expire actively used executables #247

Merged
merged 4 commits into from
Nov 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ module go.opentelemetry.io/ebpf-profiler
go 1.22.2

require (
github.com/aws/aws-sdk-go-v2 v1.30.5
github.com/aws/aws-sdk-go-v2/config v1.27.35
github.com/aws/aws-sdk-go-v2/feature/s3/manager v1.17.21
github.com/aws/aws-sdk-go-v2/service/s3 v1.62.0
github.com/cespare/xxhash/v2 v2.3.0
github.com/cilium/ebpf v0.16.0
github.com/elastic/go-freelru v0.15.0
github.com/elastic/go-freelru v0.16.0
github.com/elastic/go-perf v0.0.0-20241016160959-1342461adb4a
github.com/google/uuid v1.6.0
github.com/jsimonetti/rtnetlink v1.4.2
Expand All @@ -29,7 +29,6 @@ require (
)

require (
github.com/aws/aws-sdk-go-v2 v1.30.5 // indirect
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.6.4 // indirect
github.com/aws/aws-sdk-go-v2/credentials v1.17.33 // indirect
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.16.13 // indirect
Expand Down
8 changes: 2 additions & 6 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ github.com/aws/aws-sdk-go-v2/credentials v1.17.33 h1:lBHAQQznENv0gLHAZ73ONiTSkCt
github.com/aws/aws-sdk-go-v2/credentials v1.17.33/go.mod h1:MBuqCUOT3ChfLuxNDGyra67eskx7ge9e3YKYBce7wpI=
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.16.13 h1:pfQ2sqNpMVK6xz2RbqLEL0GH87JOwSxPV2rzm8Zsb74=
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.16.13/go.mod h1:NG7RXPUlqfsCLLFfi0+IpKN4sCB9D9fw/qTaSB+xRoU=
github.com/aws/aws-sdk-go-v2/feature/s3/manager v1.17.21 h1:sV0doPPsRT7gMP0BnDPwSsysVTV/nKpB/nFmMnz8goE=
github.com/aws/aws-sdk-go-v2/feature/s3/manager v1.17.21/go.mod h1:ictvfJWqE2gkUFDRJVp5VU/TrytuzK88DYcpan7UYuA=
github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.17 h1:pI7Bzt0BJtYA0N/JEC6B8fJ4RBrEMi1LBrkMdFYNSnQ=
github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.17/go.mod h1:Dh5zzJYMtxfIjYW+/evjQ8uj2OyR/ve2KROHGHlSFqE=
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.17 h1:Mqr/V5gvrhA2gvgnF42Zh5iMiQNcOYthFYwCyrnuWlc=
Expand Down Expand Up @@ -43,10 +41,8 @@ github.com/cilium/ebpf v0.16.0/go.mod h1:L7u2Blt2jMM/vLAVgjxluxtBKlz3/GWjB0dMOEn
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/elastic/go-freelru v0.13.0 h1:TKKY6yCfNNNky7Pj9xZAOEpBcdNgZJfihEftOb55omg=
github.com/elastic/go-freelru v0.13.0/go.mod h1:bSdWT4M0lW79K8QbX6XY2heQYSCqD7THoYf82pT/H3I=
github.com/elastic/go-freelru v0.15.0 h1:Jo1aY8JAvpyxbTDJEudrsBfjFDaALpfVv8mxuh9sfvI=
github.com/elastic/go-freelru v0.15.0/go.mod h1:bSdWT4M0lW79K8QbX6XY2heQYSCqD7THoYf82pT/H3I=
github.com/elastic/go-freelru v0.16.0 h1:gG2HJ1WXN2tNl5/p40JS/l59HjvjRhjyAa+oFTRArYs=
github.com/elastic/go-freelru v0.16.0/go.mod h1:bSdWT4M0lW79K8QbX6XY2heQYSCqD7THoYf82pT/H3I=
github.com/elastic/go-perf v0.0.0-20241016160959-1342461adb4a h1:ymmtaN4bVCmKKeu4XEf6JEWNZKRXPMng1zjpKd+8rCU=
github.com/elastic/go-perf v0.0.0-20241016160959-1342461adb4a/go.mod h1:Nt+pnRYvf0POC+7pXsrv8ubsEOSsaipJP0zlz1Ms1RM=
github.com/go-quicktest/qt v1.101.0 h1:O1K29Txy5P2OK0dGo59b7b0LR6wKfIhttaAhHUyn7eI=
Expand Down
2 changes: 1 addition & 1 deletion main.go
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ func mainWithExitCode() exitCode {
GRPCStartupBackoffTime: intervals.GRPCStartupBackoffTime(),
GRPCConnectionTimeout: intervals.GRPCConnectionTimeout(),
ReportInterval: intervals.ReportInterval(),
ExecutablesCacheElements: 4096,
ExecutablesCacheElements: 16384,
// Next step: Calculate FramesCacheElements from numCores and samplingRate.
FramesCacheElements: 65536,
CGroupCacheElements: 1024,
Expand Down
10 changes: 8 additions & 2 deletions reporter/otlp_reporter.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,10 @@ import (
"go.opentelemetry.io/ebpf-profiler/libpf/xsync"
)

const (
executableCacheLifetime = 1 * time.Hour
)

// Assert that we implement the full Reporter interface.
var _ Reporter = (*OTLPReporter)(nil)

Expand Down Expand Up @@ -147,7 +151,7 @@ func NewOTLP(cfg *Config) (*OTLPReporter, error) {
if err != nil {
return nil, err
}
executables.SetLifetime(1 * time.Hour) // Allow GC to clean stale items.
executables.SetLifetime(executableCacheLifetime) // Allow GC to clean stale items.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also increase the size of this cache as 4096 seems small given we still haven't solved the core issue here, which is that items may be dropped from this cache (whether through time-based expiration or due to the LRU being full) and there's no control or guarantee as to when they'll be re-inserted.

I'd set the cache size to 16384 until we really solve the problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion on the cache size (4096 seems quite large to me).
Ideally, we should have metrics for expiration and eviction, then get numbers from production systems.

We can also make the cache sizes and lifetimes configurable.

And we can create an LRU wrapper that automatically resizes the LRUs, as suggested at #248 (comment)

Maybe better continue at #244 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also not forget that (currently), the reporter implementation in this repository is just for demo/example purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion on my side either - went ahead and bumped it to 16384 in bf83645


frames, err := lru.NewSynced[libpf.FileID,
*xsync.RWMutex[map[libpf.AddressOrLineno]sourceInfo]](
Expand Down Expand Up @@ -574,7 +578,9 @@ func (r *OTLPReporter) getProfile() (profile *profiles.Profile, startTS, endTS u
fileIDtoMapping[traceInfo.files[i]] = idx
locationMappingIndex = idx

execInfo, exists := r.executables.Get(traceInfo.files[i])
// Ensure that actively used executables do not expire.
execInfo, exists := r.executables.GetAndRefresh(traceInfo.files[i],
executableCacheLifetime)
Comment on lines -577 to +583
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference here? Does not Get automatically reset the TTL? If not, the other usage of the lru should be reviewed and this same change probably applies to (potential multiple) other locations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get updates the recentness of an item, which covers eviction when the LRU is full.
The lifetime is only updated/set with Add and is not touched by Get.
The lifetime is a concept that allows purging unused items from an LRU (call to PurgeExpired) to avoid long-term resource leaks.

GetAndRefresh, updates both, recentness and lifetime of an item.

FreeLRU does not actively purge expired items. For the reporter caches, we recently added a ticker to call to PurgeExpired regularly. For other caches we don't have an active purging.


// Next step: Select a proper default value,
// if the name of the executable is not known yet.
Expand Down
Loading