-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: disk-stalled/fuse/log=false,data=false failed #99372
Comments
This is almost identical to #99215, which also saw n1 unexpectedly disk stall during a test run that's not supposed to disk stall. (cc #97968)
There's a goroutine dump from 13:59:27 (~17s prior to the fatal). There are some goroutines with I/O in progress at the time: the WAL log writer:
A SyncData during a flush:
The
|
Some questions: In #99215 you say:
I understand we have a goroutine dump from before the stall that shows we are doing a Related: For all the test failures we have seen, do we always see a goroutine dump that shows we are blocked in the expected operation some time before the stall is detected?
Are we seeing a relatively high rate of failure in case of tests where don't use that FUSE filesystem? I figure we are seeing failures without the FUSE filesystem, based on the fact that #97968 is open. Do we only see disk stalls on GCP? Do we run this test on AWS? I guess there are four possibilities that I can think of:
|
I notice that we are using local SSDs, not proper block storage, in all these tests, and that most of time calls to |
#98202 was also failing similarly, and it didn't use the fuse file system. |
Yeah, it's true we'd reasonably expect to see in-progress syncs. The cockroach logs show the effects of the stall in advance of the fatal (eg, node liveness heartbeats timing out), indicating for some reason writes aren't completing. The fact that goroutine count increases triggering the goroutine dump is also an indication that things are queueing as a result of the stall. |
Ack! So more likely to be either a real write stall or a crazy bug in |
I ran what is below 30 times & didn't reproduce a stall during the period of the test where we don't expect a stall. Then I ran out of CPU quota. Would like to dig more but oncall tomorrow and next week...
|
One more note: I looked at all issues linked from #97968. They all are either syncdata or sync (which call fsync IIUC) or no info about write type since o11y change hadn’t landed yet. At least five syncs! This increases my suspicious infra is at least partly at fault… but just a hunch. |
Currently, the `disk-stall` tests use local SSDs. When run on GCE VMs, a higher test flake rate is observed due to known issues with fsync latency for local SSDs. Switch the test to use persistent disks instead. Touches: cockroachdb#99372. Release note: None.
…99752 #99774 99433: opt: fixup CTE stats on placeholder queries r=cucaroach a=cucaroach During optbuilder phase we copy the initial expressions stats into the fake-rel but this value can change when placeholders are assigned so add code in AssignPlaceholders to rebuild the cte if the stats change. Fixes: #99389 Epic: none Release note: none 99516: metrics: improve ux around _status/vars output r=aadityasondhi a=dhartunian Previously, the addition of the `tenant` metric label was applied uniformly and could result in confusion for customers who never enable multi-tenancy or c2c. The `tenant="system"` label carries little meaning when there's no tenancy in use. This change modifies the system tenant label application to only happen when a non- sytem in-process tenant is created. Additionally, an environment variable: `COCKROACH_DISABLE_NODE_AND_TENANT_METRIC_LABELS` can be set to `false` to disable the new `tenant` and `node_id` labels. This can be used on single-process tenants to disable the `tenant` label. Resolves: #94668 Epic: CRDB-18798 Release note (ops change): The `COCKROACH_DISABLE_NODE_AND_TENANT_METRIC_LABELS` env var can be used to disable the newly introduced metric labels in the `_status/vars` output if they conflict with a customer's scrape configuration. 99522: jobsprofiler: store DistSQL diagram of jobs in job info r=dt a=adityamaru This change teaches import, cdc, backup and restore to store their DistSQL plans in the job_info table under a timestamped info key. The generation and writing of the plan diagram is done asynchronously so as to not slow down the execution of the job. A new plan will be stored everytime the job sets up its DistSQL flow. Release note: None Epic: [CRDB-8964](https://cockroachlabs.atlassian.net/browse/CRDB-8964) Informs: #99729 99574: streamingccl: skip acceptance/c2c on remote cluster setup r=stevendanna a=msbutler acceptance/c2c currently fails when run on a remote cluster. This patch ensures the test gets skipped when run on a remote cluster. There's no need to run the test on a remote cluster because the other c2c roachtests provide sufficient coverage. Fixes #99553 Release note: none 99691: codeowners: update sql obs to cluster obs r=maryliag a=maryliag Update mentions of `sql-observability` to `cluster-observability`. Epic: none Release note: None 99712: ui: connect metrics provider to metrics timescale object r=xinhaoz a=dhartunian Previously, the `MetricsDataProvider` component queried the redux store for the `TimeScale` object which contained details of the currently active time window. This piece of state was assumed to update to account for the "live" moving window that metrics show when pre-set lookback time windows are selected. A recent PR: #98331 removed the feature that polled new data from SQL pages, which also disabled polling on metrics pages due to the re-use of `TimeScale`. This commit modifies the `MetricsDataProvider` to instead read the `metricsTime` field of the `TimeScaleState` object. This object was constructed for use by the `MetricsDataProvider` but was not wired up to the component. Resolves #99524 Epic: None Release note: None 99733: telemetry: add FIPS-specific channel r=knz a=rail Previously, all official builds were reporting using the same telemetry channel. This PR adds an new telemetry channel for the FIPS build target. Fixes: CC-24110 Epic: DEVINF-478 Release note: None 99745: spanconfigsqlwatcher: deflake TestSQLWatcherOnEventError r=arulajmani a=arulajmani Previously, this test was setting the no-op checkpoint duration to be every hour to effectively disable checkpoints. Doing so is integral to what the test is testing. However, this was a lie, given how `util.Every` works -- A call to `ShouldProcess` returns true the very first time. This patch achieves the original goal by introducing a new testing knob. Previously, the test would fail in < 40 runs locally. Have this running strong for ~1000 runs. Fixes #76765 Release note: None 99747: roachtest: use persistent disks for disk-stall tests r=jbowens a=nicktrav Currently, the `disk-stall` tests use local SSDs. When run on GCE VMs, a higher test flake rate is observed due to known issues with fsync latency for local SSDs. Switch the test to use persistent disks instead. Touches: #99372. Release note: None. Epic: CRDB-20293 99752: kvserver: bump tolerance more r=ajwerner a=ajwerner I'm not digging into this more, but the test is flakey. Epic: none https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_BazelUnitTests/9161972?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true Release note: None 99774: *: identify remaining uses of TODOSQLCodec r=stevendanna a=knz The `TODOSQLCodec` was a bug waiting to happen. The only reasonable remaining purpose is for use in tests. As such, this change moves its definition to a test-only package (we have a linter that verifies that `testutils` is not included in non-test code). This change then identifies the one non-reasonable remaining purposes and identifies it properly as a bug linked to #48123. Release note: None Epic: None Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: David Hartunian <[email protected]> Co-authored-by: adityamaru <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: maryliag <[email protected]> Co-authored-by: Rail Aliiev <[email protected]> Co-authored-by: Arul Ajmani <[email protected]> Co-authored-by: Nick Travers <[email protected]> Co-authored-by: ajwerner <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
Currently, the `disk-stall` tests use local SSDs. When run on GCE VMs, a higher test flake rate is observed due to known issues with fsync latency for local SSDs. Switch the test to use persistent disks instead. Touches: #99372. Release note: None.
Currently, the `disk-stall` tests use local SSDs. When run on GCE VMs, a higher test flake rate is observed due to known issues with fsync latency for local SSDs. Switch the test to use persistent disks instead. Touches: #99372. Release note: None.
Currently, the `disk-stall` tests use local SSDs. When run on GCE VMs, a higher test flake rate is observed due to known issues with fsync latency for local SSDs. Switch the test to use persistent disks instead. Touches: #99372. Release note: None.
Currently, the `disk-stall` tests use local SSDs. When run on GCE VMs, a higher test flake rate is observed due to known issues with fsync latency for local SSDs. Switch the test to use persistent disks instead. Touches: cockroachdb#99372. Release note: None.
Currently, the `disk-stall` tests use local SSDs. When run on GCE VMs, a higher test flake rate is observed due to known issues with fsync latency for local SSDs. Switch the test to use persistent disks instead. Touches: #99372. Release note: None.
Merged #99747 and associated backports that should help here, using PDs instead of SSDs. I'm going to close this for now. |
roachtest.disk-stalled/fuse/log=false,data=false failed with artifacts on master @ 5493fdfec4e1762c4502fb2f5d42fd28292c9c9d:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=true
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-25851
The text was updated successfully, but these errors were encountered: