-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: kv0/enc=false/nodes=1/size=64kb/conc=4096 failed #125769
Comments
roachtest.kv0/enc=false/nodes=1/size=64kb/conc=4096 failed with artifacts on master @ 67b4af76ba20ed2f6935da31eda7dfe8fc0b63e2:
Parameters:
Help
See: roachtest README See: How To Investigate (internal) Grafana is not yet available for aws clusters |
roachtest.kv0/enc=false/nodes=1/size=64kb/conc=4096 failed with artifacts on master @ 065b94022a26d223b0a19809e510a6dca337f155:
Parameters:
Help
See: roachtest README See: How To Investigate (internal) Grafana is not yet available for aws clusters |
These failures are OOMs of the (single) node in this test. Logs are composed entirely of the following error:
Doesn't seem to reproduce easily on master or the SHA of the last failure, but we have some memory profiles in the artifacts that should help. Tagging @cockroachdb/kv. |
Spoke with @arulajmani in person. Quick update: this test is designed to overload the cluster (see previous failure of this test that has similar discussion: #106570). Another thing that is similar from previous failures is that this seems to only be happening on AWS (and, in this case, only in ARM builds). I'll try to reproduce this specifically using AWS + ARM64 to see if we have more success with a reproduction. The artifacts alone don't provide enough to debug this further (latest memory profile is taken on average ~10mins before OOM). |
@renatolabs I'll remove the T-KV label for now so that it doesn't show up in our test triage queue. If this ends up needing some KV involvement, feel free to ping the issue again and add the label. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
I'll reopen this one because there was issue being investigated with this test before the flakes. |
Update -- I've confirmed that the OOM is reproducible by running this test on ARM64 on AWS. That said, I also saw failures on GCE, so it's not cloud specific (it didn't fail on GCE previously because it never runs in the nightlies -- #126425). The test doesn't seem to fail when running amd64 binaries. A bisection points to a Pebble bump (#125541) as being the first commit where this test started to fail. Note that it's not deterministic -- it generally only fails ~20% of the times. I could debug this further but I'm also juggling a few other items at the moment so I don't know when I'll have time to come back to this. I'm tagging @cockroachdb/storage for further analysis. Let me know if I can assist with anything. For reference, this is how you run the test on ARM64:
I'm using |
roachtest.kv0/enc=false/nodes=1/size=64kb/conc=4096 failed with artifacts on master @ 7461a85f839c50edbdea502fe09794739a99632e:
Parameters:
Help
See: roachtest README See: How To Investigate (internal) Grafana is not yet available for aws clusters |
#125769 (comment) doesn't seem to be an OOM. (the latest failure) #125769 (comment) is an OOM, but the heap profile is not helpful.
There is nothing in the logs about what the Pebble block cache was configured to be, but there doesn't seem to be an override in
|
#125769 (comment) is also an OOM
There is a big difference between Go alloc and total in this case. |
There were a couple of flakes that affected this test. Marked them as off-topic. The remaining failures in this thread should be the OOM. |
roachtest.kv0/enc=false/nodes=1/size=64kb/conc=4096 failed with artifacts on master @ c557fb59f6aec659d364e9002fc083c59c6392b6:
Parameters:
Help
See: roachtest README See: How To Investigate (internal) Grafana is not yet available for aws clusters |
#125769 (comment) is also an OOM [Sat Jul 6 06:36:06 2024] Memory cgroup out of memory: Killed process 3857 (cockroach) total-vm:44398208kB, anon-rss:30453076kB, file-rss:147584kB, shmem-rss:0kB, UID:1000 pgtables:79868kB oom_score_adj:0 I240706 06:24:27.899766 423 2@server/status/runtime_log.go:47 â‹® [T1,Vsystem,n1] 210 runtime stats: 27 GiB RSS, 16301 goroutines (stacks: 824 MiB), 10 GiB/16 GiB Go alloc/total (heap fragmentation: 1017 MiB, heap reserved: 4.1 GiB, heap released: 2.3 GiB), 9.1 GiB/11 GiB CGO alloc/total (10103.9 CGO/sec), 286.5/89.7 %(u/s)time, 0.0 %gc (67x), 320 MiB/2.0 MiB (r/w)net I240706 06:35:58.011119 423 2@server/status/runtime_log.go:47 â‹® [T1,Vsystem,n1] 360 runtime stats: 28 GiB RSS, 16563 goroutines (stacks: 836 MiB), 6.0 GiB/16 GiB Go alloc/total (heap fragmentation: 1.1 GiB, heap reserved: 7.6 GiB, heap released: 3.5 GiB), 9.1 GiB/13 GiB CGO alloc/total (9121.7 CGO/sec), 301.0/101.9 %(u/s)time, 0.0 %gc (107x), 182 MiB/1.7 MiB (r/w)net |
… aggressively kv0/enc=false/nodes=1/size=64kb/conc=4096 is sometimes OOMing on arm64 on AWS. It is possible that this is due to higher memory usage in the cgo allocations. This change reduces the difference betweem jemalloc resident and allocated bytes by releasing more aggressively to the OS. Testing this change reduced "9.2 GiB/11 GiB CGO alloc/total" to "9.2 GiB/9.8 GiB CGO alloc/total". Informs cockroachdb#125769 Epic: none Release note: None
126520: logical: add eval logical replication options r=Jeremyyang920 a=Jeremyyang920 This commit adds the ability for the plan hook to be able to evaluate the incoming options from the sql statement. As of right now, the function will only evaluate the options, but nothing will be done with them. If there is an error in the format of an option, the associated error will be returned. See individual commits for details. Epic: None Release note: None 126930: tests: for kv0 overload test, make jemalloc release memory to OS more… r=srosenberg,renatolabs a=sumeerbhola … aggressively kv0/enc=false/nodes=1/size=64kb/conc=4096 is sometimes OOMing on arm64 on AWS. It is possible that this is due to higher memory usage in the cgo allocations. This change reduces the difference betweem jemalloc resident and allocated bytes by releasing more aggressively to the OS. Testing this change reduced "9.2 GiB/11 GiB CGO alloc/total" to "9.2 GiB/9.8 GiB CGO alloc/total". Informs #125769 Epic: none Release note: None 127041: roachtest: fix multi-store-remove r=itsbilal a=nicktrav The intention of the test is to compare the number of ranges (multiplied by the replication factor) to the _sum_ of replicas across all stores. The current implementation is incorrect, as it compares range count to store count. Fix the test by using a `sum` of replicas across each store, rather than a `count`, which will return the number of stores. Fix #123989. Release note: None. Co-authored-by: Jeremy Yang <[email protected]> Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: Nick Travers <[email protected]>
fixed by #126930 |
… aggressively kv0/enc=false/nodes=1/size=64kb/conc=4096 is sometimes OOMing on arm64 on AWS. It is possible that this is due to higher memory usage in the cgo allocations. This change reduces the difference betweem jemalloc resident and allocated bytes by releasing more aggressively to the OS. Testing this change reduced "9.2 GiB/11 GiB CGO alloc/total" to "9.2 GiB/9.8 GiB CGO alloc/total". Informs #125769 Epic: none Release note: None
roachtest.kv0/enc=false/nodes=1/size=64kb/conc=4096 failed with artifacts on master @ efc894bfc167fe48af2ab99d1e1ecbedf8110a10:
Parameters:
ROACHTEST_arch=arm64
ROACHTEST_cloud=aws
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=8
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for aws clusters
This test on roachdash | Improve this report!
Jira issue: CRDB-39600
The text was updated successfully, but these errors were encountered: