Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: for kv0 overload test, make jemalloc release memory to OS more… #126930

Merged
merged 1 commit into from
Jul 12, 2024

Conversation

sumeerbhola
Copy link
Collaborator

… aggressively

kv0/enc=false/nodes=1/size=64kb/conc=4096 is sometimes OOMing on arm64 on AWS. It is possible that this is due to higher memory usage in the cgo allocations. This change reduces the difference betweem jemalloc resident and allocated bytes by releasing more aggressively to the OS.

Testing this change reduced "9.2 GiB/11 GiB CGO alloc/total" to "9.2 GiB/9.8 GiB CGO alloc/total".

Informs #125769

Epic: none

Release note: None

… aggressively

kv0/enc=false/nodes=1/size=64kb/conc=4096 is sometimes OOMing on arm64
on AWS. It is possible that this is due to higher memory usage in the
cgo allocations. This change reduces the difference betweem jemalloc
resident and allocated bytes by releasing more aggressively to the OS.

Testing this change reduced "9.2 GiB/11 GiB CGO alloc/total" to
"9.2 GiB/9.8 GiB CGO alloc/total".

Informs cockroachdb#125769

Epic: none

Release note: None
@sumeerbhola sumeerbhola requested a review from renatolabs July 10, 2024 14:12
@sumeerbhola sumeerbhola requested a review from a team as a code owner July 10, 2024 14:12
@sumeerbhola sumeerbhola requested review from herkolategan and removed request for a team July 10, 2024 14:12
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -96,6 +99,10 @@ func registerKV(r registry.Registry) {
if opts.globalMVCCRangeTombstone {
settings.Env = append(settings.Env, "COCKROACH_GLOBAL_MVCC_RANGE_TOMBSTONE=true")
}
if opts.jemallocReleaseFaster {
settings.Env = append(settings.Env,
"MALLOC_CONF=background_thread:true,dirty_decay_ms:2000,muzzy_decay_ms:0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe cap the number of background threads? By default it's number of cpus.

// from the 80KB size class in jemalloc. So the allocated bytes from jemalloc
// by the block cache are 20% higher than configured. By setting this flag to true,
// we reduce the (resident-allocated) size in jemalloc.
{nodes: 1, cpus: 8, readPercent: 0, concMultiplier: 4096, blockSize: 1 << 16 /* 64 KB */, jemallocReleaseFaster: true},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect the throughput to drop? Either way, a roachperf annotation would probably be insightful.

@renatolabs
Copy link
Contributor

Would be nice to test this change by running this particular kv0 benchmark 10 or 20 times on this branch -- it was failing at least once (generally twice) every time when run with a --count 10 flag.

Meta question: if it turns out that the way for this test to pass is to tweak malloc, do we want to continue testing this particular configuration?

I'm also curious if we ever found out what started causing these failures. The Pebble bump mentioned in the original issue does seem to have been the point in which they started to happen.

@sumeerbhola sumeerbhola requested a review from srosenberg July 11, 2024 13:26
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to test this change by running this particular kv0 benchmark 10 or 20 times on this branch -- it was failing at least once (generally twice) every time when run with a --count 10 flag.

Done. Zero failures in 10 runs with this change, and 7 failures in 20 runs without.

Meta question: if it turns out that the way for this test to pass is to tweak malloc, do we want to continue testing this particular configuration?

Yes, this overload configuration is a test for admission control. Node liveness used to suffer pre-AC, so it tests something.

I'm also curious if we ever found out what started causing these failures. The Pebble bump mentioned in the original issue does seem to have been the point in which they started to happen.

The Pebble-bump is not quite explainable (see the third bullet in #125769 (comment)), but there are so many layers involved that there is guesswork.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)


pkg/cmd/roachtest/tests/kv.go line 104 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Should we maybe cap the number of background threads? By default it's number of cpus.

I have no experience with tuning the number of such threads, so I'd prefer to just start with this simple thing.


pkg/cmd/roachtest/tests/kv.go line 208 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Do we expect the throughput to drop? Either way, a roachperf annotation would probably be insightful.

Good point. I forgot to check that, though this test is not CPU bound, so would be surprising. I can merge this and add an annotation if I see throughput drop over the next couple of days.

@sumeerbhola
Copy link
Collaborator Author

bors r=srosenberg,renatolabs

@craig craig bot merged commit 23ab279 into cockroachdb:master Jul 12, 2024
22 checks passed
@sumeerbhola
Copy link
Collaborator Author

I can merge this and add an annotation if I see throughput drop over the next couple of days.

I checked roachperf and there was no throughput drop.

@renatolabs
Copy link
Contributor

blathers backport 24.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants