Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv0/enc=false/nodes=1/size=64kb/conc=4096 failed #125769

Closed
cockroach-teamcity opened this issue Jun 17, 2024 · 16 comments
Closed

roachtest: kv0/enc=false/nodes=1/size=64kb/conc=4096 failed #125769

cockroach-teamcity opened this issue Jun 17, 2024 · 16 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-storage Storage Team T-testeng TestEng Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 17, 2024

roachtest.kv0/enc=false/nodes=1/size=64kb/conc=4096 failed with artifacts on master @ efc894bfc167fe48af2ab99d1e1ecbedf8110a10:

(cluster.go:2400).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/kv0/enc=false/nodes=1/size=64kb/conc=4096/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-39600

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team labels Jun 17, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.kv0/enc=false/nodes=1/size=64kb/conc=4096 failed with artifacts on master @ 67b4af76ba20ed2f6935da31eda7dfe8fc0b63e2:

(cluster.go:2400).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/kv0/enc=false/nodes=1/size=64kb/conc=4096/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.kv0/enc=false/nodes=1/size=64kb/conc=4096 failed with artifacts on master @ 065b94022a26d223b0a19809e510a6dca337f155:

(cluster.go:2400).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/kv0/enc=false/nodes=1/size=64kb/conc=4096/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

@renatolabs
Copy link
Contributor

These failures are OOMs of the (single) node in this test. Logs are composed entirely of the following error:

W240621 08:38:24.896240 1807344 kv/kvclient/kvcoord/dist_sender.go:2785 ⋮ [T1,Vsystem,n1,client=10.12.15.200:60062,hostssl,user=‹roachprod›] 335621 slow replica RPC: have been waiting 13.61s (0 attempts) for RPC Put [/Table/106/1/‹-4475947314600438026›/‹0›], EndTxn(parallel commit) [/Table/106/1/‹-4475947314600438026›/‹0›], [txn: bcdcd569], [can-forward-ts] to replica (n1,s1):1; resp: ‹(err: <nil>), *kvpb.PutResponse, *kvpb.EndTxnResponse›

Doesn't seem to reproduce easily on master or the SHA of the last failure, but we have some memory profiles in the artifacts that should help. Tagging @cockroachdb/kv.

@renatolabs renatolabs added the T-kv KV Team label Jun 25, 2024
@renatolabs
Copy link
Contributor

Spoke with @arulajmani in person. Quick update: this test is designed to overload the cluster (see previous failure of this test that has similar discussion: #106570). Another thing that is similar from previous failures is that this seems to only be happening on AWS (and, in this case, only in ARM builds).

I'll try to reproduce this specifically using AWS + ARM64 to see if we have more success with a reproduction. The artifacts alone don't provide enough to debug this further (latest memory profile is taken on average ~10mins before OOM).

@arulajmani
Copy link
Collaborator

@renatolabs I'll remove the T-KV label for now so that it doesn't show up in our test triage queue. If this ends up needing some KV involvement, feel free to ping the issue again and add the label.

@arulajmani arulajmani removed the T-kv KV Team label Jun 26, 2024
@cockroach-teamcity

This comment was marked as off-topic.

@cockroach-teamcity

This comment was marked as off-topic.

@DarrylWong DarrylWong marked this as a duplicate of #126378 Jun 28, 2024
@renatolabs
Copy link
Contributor

I'll reopen this one because there was issue being investigated with this test before the flakes.

@renatolabs renatolabs reopened this Jun 28, 2024
@renatolabs
Copy link
Contributor

renatolabs commented Jun 28, 2024

Update -- I've confirmed that the OOM is reproducible by running this test on ARM64 on AWS. That said, I also saw failures on GCE, so it's not cloud specific (it didn't fail on GCE previously because it never runs in the nightlies -- #126425). The test doesn't seem to fail when running amd64 binaries.

A bisection points to a Pebble bump (#125541) as being the first commit where this test started to fail. Note that it's not deterministic -- it generally only fails ~20% of the times.

I could debug this further but I'm also juggling a few other items at the moment so I don't know when I'll have time to come back to this. I'm tagging @cockroachdb/storage for further analysis. Let me know if I can assist with anything.

For reference, this is how you run the test on ARM64:

roachtest run --cloud aws --metamorphic-arm64-probability 1.0 --count 10 'kv0/enc=false/nodes=1/size=64kb/conc=4096'

I'm using --count 10 here because, as I said, we typically only see a 20% failure rate.

@renatolabs renatolabs added the T-storage Storage Team label Jun 28, 2024
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Jun 28, 2024
@nicktrav nicktrav moved this from Incoming to Tests (failures, skipped, flakes) in [Deprecated] Storage Jul 2, 2024
@nicktrav nicktrav added the P-2 Issues/test failures with a fix SLA of 3 months label Jul 2, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.kv0/enc=false/nodes=1/size=64kb/conc=4096 failed with artifacts on master @ 7461a85f839c50edbdea502fe09794739a99632e:

(cluster.go:2417).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/kv0/enc=false/nodes=1/size=64kb/conc=4096/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

@sumeerbhola
Copy link
Collaborator

#125769 (comment) doesn't seem to be an OOM.

(the latest failure) #125769 (comment) is an OOM, but the heap profile is not helpful.
Around the time of the OOM we have:

I240703 06:29:23.943507 376 2@server/status/runtime_log.go:47 ⋮ [T1,Vsystem,n1] 273  runtime stats: 29 GiB RSS, 16560 goroutines (stacks: 948 MiB), 14 GiB/16 GiB Go alloc/total (heap fragmentation: 522 MiB, heap reserved: 593 MiB, heap released: 2.6 GiB), 9.1 GiB/13 GiB CGO alloc/total (11743.2 CGO/sec), 243.8/96.9 %(u/s)time, 0.0 %gc (88x), 185 MiB/1.8 MiB (r/w)net
[Wed Jul  3 06:29:34 2024] memory: usage 30641436kB, limit 30641432kB, failcnt 86758565
...
[Wed Jul  3 06:29:34 2024] Memory cgroup out of memory: Killed process 4165 (cockroach) total-vm:42140324kB, anon-rss:30478136kB, file-rss:146688kB, shmem-rss:0kB, UID:1000 pgtables:75824kB oom_score_adj:0

There is nothing in the logs about what the Pebble block cache was configured to be, but there doesn't seem to be an override in kv.go so I'll assume the usual 25% of memory, so .25 * 29.22GiB = 7.3 GiB.

@sumeerbhola
Copy link
Collaborator

#125769 (comment) is also an OOM

I240621 08:39:27.556121 316 2@server/status/runtime_log.go:47 â‹® [T1,Vsystem,n1] 347  runtime stats: 28 GiB RSS, 16564 goroutines (stacks: 831 MiB), 6.2 GiB/16 GiB Go alloc/total (heap fragmentation: 1.0 GiB, heap reserved: 7.4 GiB, heap released: 3.6 GiB), 9.1 GiB/13 GiB CGO alloc/total (14820.0 CGO/sec), 229.3/97.1 %(u/s)time, 0.0 %gc (108x), 172 MiB/1.7 MiB (r/w)net

There is a big difference between Go alloc and total in this case.

@sumeerbhola sumeerbhola removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 3, 2024
@renatolabs
Copy link
Contributor

#125769 (comment) doesn't seem to be an OOM.

There were a couple of flakes that affected this test. Marked them as off-topic. The remaining failures in this thread should be the OOM.

@cockroach-teamcity
Copy link
Member Author

roachtest.kv0/enc=false/nodes=1/size=64kb/conc=4096 failed with artifacts on master @ c557fb59f6aec659d364e9002fc083c59c6392b6:

(cluster.go:2417).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/kv0/enc=false/nodes=1/size=64kb/conc=4096/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

@sumeerbhola
Copy link
Collaborator

#125769 (comment) is also an OOM

[Sat Jul 6 06:36:06 2024] Memory cgroup out of memory: Killed process 3857 (cockroach) total-vm:44398208kB, anon-rss:30453076kB, file-rss:147584kB, shmem-rss:0kB, UID:1000 pgtables:79868kB oom_score_adj:0

I240706 06:24:27.899766 423 2@server/status/runtime_log.go:47 â‹® [T1,Vsystem,n1] 210 runtime stats: 27 GiB RSS, 16301 goroutines (stacks: 824 MiB), 10 GiB/16 GiB Go alloc/total (heap fragmentation: 1017 MiB, heap reserved: 4.1 GiB, heap released: 2.3 GiB), 9.1 GiB/11 GiB CGO alloc/total (10103.9 CGO/sec), 286.5/89.7 %(u/s)time, 0.0 %gc (67x), 320 MiB/2.0 MiB (r/w)net

I240706 06:35:58.011119 423 2@server/status/runtime_log.go:47 â‹® [T1,Vsystem,n1] 360 runtime stats: 28 GiB RSS, 16563 goroutines (stacks: 836 MiB), 6.0 GiB/16 GiB Go alloc/total (heap fragmentation: 1.1 GiB, heap reserved: 7.6 GiB, heap released: 3.5 GiB), 9.1 GiB/13 GiB CGO alloc/total (9121.7 CGO/sec), 301.0/101.9 %(u/s)time, 0.0 %gc (107x), 182 MiB/1.7 MiB (r/w)net

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jul 10, 2024
… aggressively

kv0/enc=false/nodes=1/size=64kb/conc=4096 is sometimes OOMing on arm64
on AWS. It is possible that this is due to higher memory usage in the
cgo allocations. This change reduces the difference betweem jemalloc
resident and allocated bytes by releasing more aggressively to the OS.

Testing this change reduced "9.2 GiB/11 GiB CGO alloc/total" to
"9.2 GiB/9.8 GiB CGO alloc/total".

Informs cockroachdb#125769

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Jul 12, 2024
126520: logical: add eval logical replication options r=Jeremyyang920 a=Jeremyyang920

This commit adds the ability for the plan hook to be able to evaluate the incoming options from the sql statement. As of right now, the function will only evaluate the options, but nothing will be done with them. If there is an error in the format of an option, the associated error will be returned.

See individual commits for details.

Epic: None
Release note: None

126930: tests: for kv0 overload test, make jemalloc release memory to OS more… r=srosenberg,renatolabs a=sumeerbhola

… aggressively

kv0/enc=false/nodes=1/size=64kb/conc=4096 is sometimes OOMing on arm64 on AWS. It is possible that this is due to higher memory usage in the cgo allocations. This change reduces the difference betweem jemalloc resident and allocated bytes by releasing more aggressively to the OS.

Testing this change reduced "9.2 GiB/11 GiB CGO alloc/total" to "9.2 GiB/9.8 GiB CGO alloc/total".

Informs #125769

Epic: none

Release note: None

127041: roachtest: fix multi-store-remove r=itsbilal a=nicktrav

The intention of the test is to compare the number of ranges (multiplied by the replication factor) to the _sum_ of replicas across all stores. The current implementation is incorrect, as it compares range count to store count.

Fix the test by using a `sum` of replicas across each store, rather than a `count`, which will return the number of stores.

Fix #123989.

Release note: None.

Co-authored-by: Jeremy Yang <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Nick Travers <[email protected]>
@sumeerbhola
Copy link
Collaborator

fixed by #126930

@github-project-automation github-project-automation bot moved this from Tests (failures, skipped, flakes) to Done in [Deprecated] Storage Jul 12, 2024
blathers-crl bot pushed a commit that referenced this issue Jul 29, 2024
… aggressively

kv0/enc=false/nodes=1/size=64kb/conc=4096 is sometimes OOMing on arm64
on AWS. It is possible that this is due to higher memory usage in the
cgo allocations. This change reduces the difference betweem jemalloc
resident and allocated bytes by releasing more aggressively to the OS.

Testing this change reduced "9.2 GiB/11 GiB CGO alloc/total" to
"9.2 GiB/9.8 GiB CGO alloc/total".

Informs #125769

Epic: none

Release note: None
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-storage Storage Team T-testeng TestEng Team
Projects
Archived in project
Status: Incoming
Status: Done
Development

No branches or pull requests

6 participants