-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: for kv0 overload test, make jemalloc release memory to OS more… #126930
Conversation
… aggressively kv0/enc=false/nodes=1/size=64kb/conc=4096 is sometimes OOMing on arm64 on AWS. It is possible that this is due to higher memory usage in the cgo allocations. This change reduces the difference betweem jemalloc resident and allocated bytes by releasing more aggressively to the OS. Testing this change reduced "9.2 GiB/11 GiB CGO alloc/total" to "9.2 GiB/9.8 GiB CGO alloc/total". Informs cockroachdb#125769 Epic: none Release note: None
@@ -96,6 +99,10 @@ func registerKV(r registry.Registry) { | |||
if opts.globalMVCCRangeTombstone { | |||
settings.Env = append(settings.Env, "COCKROACH_GLOBAL_MVCC_RANGE_TOMBSTONE=true") | |||
} | |||
if opts.jemallocReleaseFaster { | |||
settings.Env = append(settings.Env, | |||
"MALLOC_CONF=background_thread:true,dirty_decay_ms:2000,muzzy_decay_ms:0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe cap the number of background threads? By default it's number of cpus.
// from the 80KB size class in jemalloc. So the allocated bytes from jemalloc | ||
// by the block cache are 20% higher than configured. By setting this flag to true, | ||
// we reduce the (resident-allocated) size in jemalloc. | ||
{nodes: 1, cpus: 8, readPercent: 0, concMultiplier: 4096, blockSize: 1 << 16 /* 64 KB */, jemallocReleaseFaster: true}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect the throughput to drop? Either way, a roachperf annotation would probably be insightful.
Would be nice to test this change by running this particular kv0 benchmark 10 or 20 times on this branch -- it was failing at least once (generally twice) every time when run with a Meta question: if it turns out that the way for this test to pass is to tweak malloc, do we want to continue testing this particular configuration? I'm also curious if we ever found out what started causing these failures. The Pebble bump mentioned in the original issue does seem to have been the point in which they started to happen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to test this change by running this particular kv0 benchmark 10 or 20 times on this branch -- it was failing at least once (generally twice) every time when run with a
--count 10
flag.
Done. Zero failures in 10 runs with this change, and 7 failures in 20 runs without.
Meta question: if it turns out that the way for this test to pass is to tweak malloc, do we want to continue testing this particular configuration?
Yes, this overload configuration is a test for admission control. Node liveness used to suffer pre-AC, so it tests something.
I'm also curious if we ever found out what started causing these failures. The Pebble bump mentioned in the original issue does seem to have been the point in which they started to happen.
The Pebble-bump is not quite explainable (see the third bullet in #125769 (comment)), but there are so many layers involved that there is guesswork.
TFTRs!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)
pkg/cmd/roachtest/tests/kv.go
line 104 at r1 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Should we maybe cap the number of background threads? By default it's number of cpus.
I have no experience with tuning the number of such threads, so I'd prefer to just start with this simple thing.
pkg/cmd/roachtest/tests/kv.go
line 208 at r1 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Do we expect the throughput to drop? Either way, a roachperf annotation would probably be insightful.
Good point. I forgot to check that, though this test is not CPU bound, so would be surprising. I can merge this and add an annotation if I see throughput drop over the next couple of days.
bors r=srosenberg,renatolabs |
I checked roachperf and there was no throughput drop. |
blathers backport 24.2 |
… aggressively
kv0/enc=false/nodes=1/size=64kb/conc=4096 is sometimes OOMing on arm64 on AWS. It is possible that this is due to higher memory usage in the cgo allocations. This change reduces the difference betweem jemalloc resident and allocated bytes by releasing more aggressively to the OS.
Testing this change reduced "9.2 GiB/11 GiB CGO alloc/total" to "9.2 GiB/9.8 GiB CGO alloc/total".
Informs #125769
Epic: none
Release note: None