-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace string interner with an LRU and per-origin cache up top. #20943
Conversation
3b69508
to
50b0fa7
Compare
Go Package Import DifferencesBaseline: 3f5d700
|
package cache | ||
|
||
import ( | ||
"fmt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's how we usually sort imports in this project
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just let the IDE do it. Is there a doc for what the preference is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/DataDog/datadog-agent/blob/main/docs/dev/imports.md?plain=1
Note that this is not well-known and not enforced (but setting up your IDE to do it automatically is quick and helps having a consistent import order).
@@ -188,7 +189,7 @@ func (p *parser) parseMetricSample(message []byte) (dogstatsdMetricSample, error | |||
} | |||
|
|||
return dogstatsdMetricSample{ | |||
name: p.interner.LoadOrStore(name), | |||
name: p.interner.LoadOrStore(name, "", nil), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If LoadOrStore()
is always called with , "", nil)
is probably worth adding an helper that doesn't require those arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The third PR in the series has the plumbing - that includes useful values for all 3 args. Use the lally/exp-mem-metrefs
branch for reference: https://github.com/DataDog/datadog-agent/blob/lally/exp-mem-metrefs/comp/dogstatsd/server/parse.go#L190
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i'd be better to only add it when we need it no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we could start passing in the origin IDs now (I'll have to start integrating the plumbing for this bit) so that we get per-origin (e.g. per-container) tracking now with this PR. How does that sound?
|
||
// backingBytesPerInCoreEntry is the number of bytes to allocate in the mmap file per | ||
// element in our LRU. E.g., some value of initialInternerSize * POW(growthFactor, N). | ||
const backingBytesPerInCoreEntry = 4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want this to be configurable no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is set to match the page size on the platform. For the MMUs in x86 and ARM, thats 4k (hugepages goes bigger).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI os.Getpagesize()
is available
) | ||
|
||
// initialInternerSize is the size of the LRU cache (in #strings). This is HEAP, so | ||
// don't let this get too big compared to the MMAP region. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't mention "MMAP" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a series of 3 PRs, staged. Do you expect me to rewrite the first two as if there are no successors?
comp/dogstatsd/server/parse.go
Outdated
@@ -38,7 +39,7 @@ var ( | |||
// parser parses dogstatsd messages | |||
// not safe for concurent use | |||
type parser struct { | |||
interner *stringInterner | |||
interner *cache.KeyedInterner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't remove the old interner implementation, do we want instead to add an interface that will (for a time) let us choose between one implementation or the other ?
This will make sure all this code stays optional until we know it makes performances better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will want to provide a write-up of how this change impacts low and high-end use cases, see this notebook as an example. If we are going to swap out the implementation like this I'd be interested in seeing the regression detector run here but I also think if the swap were configurable it'd be easier to rig experiments to understand the user implications of the work being proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone can pass me appropriate API keys (the back-to-back summits have made this tricky) for ddev, I can get the benchmark to start uploading data to the benchmark notebook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can help you run this in the aml benchmark cluster, but I'd also like to see this run in the SMP regression detector.
once this passes enough of the lint/test/package build steps, the regression detector should run automatically on this PR.
const noFileCache = "" | ||
|
||
// OriginTimeSampler marks allocations to the Time Sampler. | ||
const OriginTimeSampler = "!Timesampler" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this really be defined here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are for internal diagnostic use. I don't care where they go, but it's easier to see the full list if they're all in one place. I can move them to their use sites if preferred.
) | ||
|
||
// MaxValueSize is the largest possible value we can store. | ||
const MaxValueSize = 4080 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does it come from ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied from the mmap_hash_linux
, which is based on hardware constraints & datastructure decisions that I don't expect to change across platforms.
High level comment before going further: can you remove all the mmap code and just keep the in-memory part ? I think we want to validate performances and behavior with the in-memory version before adding the mmap complexity. Also I don't think we want to shard by origin with origin being |
|
||
// LoadOrStore interns a byte-array to a string, for an origin | ||
func (i *KeyedInterner) LoadOrStore(key []byte, origin string, retainer InternRetainer) string { | ||
sGlobalQueryCount.Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me that this needs to be atomic. We increment here only -- and you've got it atomic to skip taking the lock that's implicit in loadOrStore
-- but the sum is only used in a debug log. I guess, how accurate does this need to be? This is going to be a sequentially consistent operation: we're forcing all CPUs to sync, then we skip some takes of the mutex, then we force all the CPUs to sync again (on x86, better on ARM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using a guideline of atomic operations on MESIF 'E' states being ~8ns in the common (l1 uncontested) case https://arxiv.org/pdf/2010.09852.pdf
I think a good part of that 8ns is still the regular unexclusive op.
Generally, I just wanted rough statistics without too much bother :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, but in a large piece of software like the Agent as core counts ramp it's not always clear that atomics cheap in the micro are cheap in the macro. A nit flag since we don't have data one way or the other yet and there's other low hanging sync issues in the Agent yet.
if Check(s) { | ||
return i.LoadOrStore(unsafe.Slice(unsafe.StringData(s), len(s)), origin, retainer) | ||
} | ||
sFailedInternalCount.Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagging another potential seqcst here. Although presumably rare if most strings are valid, pushing known invalid strings does ramp cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way they fail the Check() is if there's a fairly severe bug. I've been catching them with this function. sFailedInternalCount
is zero on my internal testing. But I thought to leave the diagnostics in for any future changes that might need it.
I'm happy to take them out, or to hide behind a config flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, both are nit flags and this instance is less likely to fire than the other.
This is the in-memory part. The mmap code is (a) disabled by default and (b) stubbed out here. I reserved the |
Hi I still have to figure out what has happened in this |
ff93ef4
to
6468dc7
Compare
AFAIK the OOM killer uses Working Set Memory for decisions (at least in |
Sorry, can you point to any docs where working set includes |
I don't have extensive knowledge of how it treats the different types. I do know the |
This fixes an unhappy dep, we don't depend on a comp dep within pkg/util.
Ah! I think that's where we're missing each other. The regression detector is a performance regression detector - AFAIK it's a series of benchmarks to detect negative changes in performance.
It's all in https://github.com/DataDog/datadog-agent/tree/lally/exp-mem-metrefs - I've developed it against the stress-test benchmark. I agree it's necessary to verify the behavior against the memory-heavy situations. Especially when the current agent OOMs. I've been, frankly, bashing my head against the wall on the
The mmap changes aren't in this PR. There's some hooks for the mmap changes to go in, but the mmap changes aren't there. See pkg/util/cache/mmap_hash.go - it's a stub that'll be used for non-linux platforms. The linux impl isn't in this PR.
That's a fine question. Generally, we want to find ways to throttle the component making us use excessive amounts of memory. I know nothing about the python checks. For them, the questions are: what controls do we have on them? If the checks' results linger in memory for a while, we can try reducing frequency. Or we can try to find ways to contain its usage (e.g., move parallel ops to serial, split up the work into batches, run in a separate container with a mem limit and control over an IPC bridge, etc). |
3992888
to
c7a9567
Compare
Signed-off-by: Brian L. Troutwine <[email protected]>
Signed-off-by: Brian L. Troutwine <[email protected]>
Yes I also consider this scenario as the main candidate for benefits of using mmap as a backend for string interning. One of the main gains (subject to PoC'ing) is avoiding OOMs caused by k8s hard memory limits and GO's gc behavior. For reference: in System-Probe, we are using this pkg for string interning (in heap, no LRU) |
What does this PR do?
This is 1 of 3 PRs for moving strings out of the heap and into MMAP'd files.
Together they will provide explicit tracking of primary (strings in contexts
and metrics) memory use per container. Additionally, each container
will have a separate allocation pool drawing from temporary files on disk.
This should alleviate primary OOM concerns, as the primary driver for memory
use is taken out of the process's RSS. Memory pressure will result in IOPs
instead of agent termination. Each container's usage is separated and can be used
as a basis for throttling / backpressure decisions when IOPs go above desired
levels.
Motivation
This PR in isolation provides a better interner than we previously had, by gradual eviction
instead of dumping the entire map. It also provides tracking mechanisms for per-container
usage. It also provides a stub implementation of the mmap backend, to show how the
actual one would integrate. The next PR will show a linux implementation - although it should
be usable completely for a macos implementation. The interface exposed should allow a
windows
CreateFileMapping()
based implementation if desired.Additional Notes
The agent should scale to handle higher workloads, but has the same deployment
for tiny and gigantic machines. The machines are almost exclusively idle and the
ample available resources are left unused by the agent due to its constant memory
cap and the workloads that can't use the machine for fear of overwhelming the agent.
Possible Drawbacks / Trade-offs
Describe how to test/QA your changes
Testcaess were provided, and the full test suite
inv test
was run on a linux-arm64 VM.Reviewer's Checklist
Triage
milestone is set.major_change
label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.changelog/no-changelog
label has been applied.qa/skip-qa
label is not applied.team/..
label has been applied, indicating the team(s) that should QA this change.need-change/operator
andneed-change/helm
labels have been applied.k8s/<min-version>
label, indicating the lowest Kubernetes version compatible with this feature.