-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #97019
Comments
Error is
I believe this is already part of the default retryable errors. We don't set custom timeouts or http clients anywhere in We can look to have a custom http client in s3_storage with longer timeouts, but I'm going to remove the release blocker for now. |
roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 64a867acd25c0a214209957eefb6483d1158b4f0:
Parameters: |
note that in last night's run, I accidentally changed the test to restore from 48 incremental backups instead of 11. It still may be interesting to investigate this timeout. I'm fixing that here: |
roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 3d054f37c7c87f53cb56fac4e5500f0d1130d09a:
Parameters: |
roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ dd2749ae4ab61eed2f99238acb74e8d3c6b4cb1d:
Parameters: |
roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 286b3e235171a39b8f9910555affcc7ce310741a:
Parameters: |
Seeing a bunch of range splits to accommodate the import, up to 16:41: At 16:40, a stream of SST ingestions started. 239 messages between 16:40-16:48 (when crash happened) like this:
Shortly, started seeing these errors in the logs:
The default uncommitted entry size limit is 16 MiB, which is also soft, so a single SST always fits it, but here we see that an attempt to add another one exceeds the limit and gets dropped. This indicates that there is probably a build-up of |
Given that there are ~4400 replicas ingesting, and each SST is ~8.3 MiB (also 3.2 MiB appears in logs less frequently), the build-up can easily reach tens of GiB. |
Looks like #73376. |
|
It's also reporting slow heartbeats (also seen the corresponding messages in logs): Liveness heartbeats do a local store write, which is probably slow here (given the log commit latency of the same O(second)). |
Draft notes while investigating with @tbg:
|
@pavelkalinnikov this is a new test that has a added much better restore workload coverage. This test uses a much better workload (tpce >>>> bank), and restores more data (8TB > 2 TB). @rhu713 landed a few PRs that significantly changed restore, summarized in this backport: #97210, latest of which merged last week. My only guess to a correlation between his PR's and this failure is that his may make restore more prone to unevenly distributing work. @rhu713 any thoughts here? |
@msbutler @rhu713 One of the key observations was this: You can barely see it here (the yellow line), but:
You can also see that nodes have relatively similar loads. But for some reason only this node is capped at 250 MiB/s. Can this be related to AWS node/disk types provisioned for this test? What type of instance is this? |
Not a failing run, but there's a good correlation between unstable log build-up and memory use, though it doesn't explain all of the memory usage. Looking at the last peak, for example, allocbytes (middle) peaked at 3.47GiB, and the corresponding peaks in raft receive queue (top) and unstable log (bottom) explain 170mb and 485mb of that, respectively. Unfortunately this still leaves a large chunk unaccounted for and there could be artifacts, like the metrics for top and bottom now displaying the "true" peaks thanks to the 10s poll interval. |
Relabeling as GA-blocker for now. Known issue, and it's unclear if there's something we can/should do here for 23.1. |
This is a WIP because the behavior when machine types with local SSD are used is unclear. For example, on AWS, roachtest prefers the c5d family, which all come with local SST storage. But looking into `awsStartupScriptTemplate`, it seems unclear how to make sure that the EBS disk(s) get mounted as /mnt/data1 (which is probably what the default should be). We could also entertain straight-up preventing combinations that would lead to an inhomogeneous RAID0. I imagine we'd have to take a round of failures to find all of the places in which it happens, but perhaps a "snitch" can be inserted instead so that we can detect all such callers and fix them up before arming the check. By the way, EBS disks on AWS come with a default of 125mb/s which is less than this RAID0 gets "most of the time" - so we can expect some tests to behave differently after this change. I still believe this is worth it - debugging is so much harder when you're on top of a storage that's hard to predict and doesn't resemble any production deployment. ---- I wasted weeks of my life on this before, and it almost happened again! When you run a roachtest that asks for an AWS cXd machine (i.e. compute optimized with NVMe local disk), and you specify a VolumeSize, you also get an EBS volume. Prior to these commit, these would be RAID0'ed together. This isn't something sane - the resulting gp3 EBS volume is very different from the local NVMe volume in every way, and it lead to hard-to-understand write throughput behavior. This commit defaults to *not* using RAID0. Touches cockroachdb#98767. Touches cockroachdb#98576. Touches cockroachdb#97019. Epic: none Release note: None
Wrapping up my investigation of this. Main findings:
I'm removing the GA-blocker label since there isn't a "new" issue here (but feel free to re-add it should the non-OOM problems here warrant it). OOMs in this test are expected until #98783 is addressed or the hacky version specific to this test in #98767 is merged (it's the DR team's call to make, please let me know). But, after that, the test is going to continue to occasionally fail with the two problems pointed out in #99206, so DR should look into those as well. Footnotes |
@tbg thank you for a great investigation and for new analysis of potentially a new restore perf bug. We're going to merge this hack to the roachtest while @srosenberg works on the long term fix. |
…tests A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance. Informs cockroachdb#98783 Fixes cockroachdb#97019 Release note: none
98509: sql: unskip TestExecBuild_sql_activity_stats_compaction r=ericharmeling a=ericharmeling This commit unskips TestExecBuild_sql_activity_stats_compaction in local configuration. 0 failures after 15000+ runs Fixes #91600. Epic: none Release note: None 99723: backupccl: avoid RAID0ing local NVMe and GP3 storage in restore roachtests r=srosenberg a=msbutler A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance. Informs #98783 Fixes #97019 Release note: none 99843: kvserver: Add a metric for in-progress snapshots r=kvoli a=andrewbaptist Fixes: #98242 Knowing how many delegate snapshot requests are currently in-progress will be useful for detecting problems. This change adds a metric for this. It also updates the names of the previous stats to have the prefix `range.snapshots` vs `range.snapshot` to be consistent with other stats. Epic: none Release note: None 99867: backupccl: lower the buffer size of doneScatterCh in gen split and scatter r=rhu713 a=rhu713 Previously, doneScatterCh in GenerativeSplitAndScatterProcessor had a large enough buffer size to never block, which was equal to the number of import spans in the restore job. This can cause restore to buffer all restore span entries in memory at the same time. Lower the limit to be numNodes * maxConcurrentRestoreWorkers, which is the max number of entries that can be processed in parallel downstream. Release note: None 100099: leaktest: ignore the opencensus worker r=pavelkalinnikov,herkolategan a=knz Fixes #100098. Release note: None Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Rui Hu <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
…tests A long restore roachtest perf investigation revealed that roachprod can RAID0 local storage and AWS GP3 storage, a configuration that does not mix well with CRDB and does not reflect a reasonable customer environment. This patch avoids this RAID0ing in the restore roachtests, stabilizing test performance. Informs #98783 Fixes #97019 Release note: none
roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 2a7edbeb0737b1309064c25c641a309c2980d9ba:
Parameters:
ROACHTEST_cloud=aws
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-24470
The text was updated successfully, but these errors were encountered: