-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: disk-stalled/dmsetup failed #102946
Comments
roachtest.disk-stalled/dmsetup failed with artifacts on release-23.1 @ 6c6d6657a21dd94d98e4c99c8d6c64f6353a5774:
Parameters: |
roachtest.disk-stalled/dmsetup failed with artifacts on release-23.1 @ 6ceddbd9dc6b987add91ea93a665088e7928cb88:
Parameters: |
roachtest.disk-stalled/dmsetup failed with artifacts on release-23.1 @ bc953426b8cc479406c3b54b99154eb15fa4107b:
Parameters: |
We don't get any artifacts during these failures, because the stalled disk prevents collection of artifacts:
It looks like the deferred
|
Previously, if the disk-stalled/* raochtests failed while the disk was stalled, a defer would attempt to unstall the disk before completing the test. This could fail if the context was cancelled. The stalled disk would then prevent collection of artifacts. This change updates the defer'd Unstall call to use a background context that will never be cancelled. Epic: none Informs cockroachdb#102946 Release note: none
Previously, if the disk-stalled/* raochtests failed while the disk was stalled, a defer would attempt to unstall the disk before completing the test. This could fail if the context was cancelled. The stalled disk would then prevent collection of artifacts. This change updates the defer'd Unstall call to use a background context that will never be cancelled. Epic: none Informs cockroachdb#102946 Release note: none
I have not been able to get a reproduction on
In n1's logs, the first sign of a stall is a failure to update node liveness:
Things appear to still be stalled ~15-17s after the stall was induced:
The test failed at
The failure unstalled the disk (this test was run with #103198). Shortly afterwards, the logs show the node's writes progressing:
I'm going to edit the roachtest to wait longer before failing the test so that I can get a stack dump during the stall. |
It appears eventually we did trigger the emergency func:
|
roachtest.disk-stalled/dmsetup failed with artifacts on release-23.1 @ b4533bdbc4b478f0ad311bad80b62bd072cf61cf:
Parameters: |
roachtest.disk-stalled/dmsetup failed with artifacts on release-23.1 @ b4533bdbc4b478f0ad311bad80b62bd072cf61cf:
Parameters: |
roachtest.disk-stalled/dmsetup failed with artifacts on release-23.1 @ 27ea509a79d5cdbcf21c53649146d893139e631a:
Parameters: |
``` 9f426d5c Revert "vfs: include size of write in DiskSlowInfo" 949d9808 Revert "vfs: mark file basename as safe to avoid log redaction" ``` Fixes: cockroachdb#103185 Fixes: cockroachdb#102946 Fixes: cockroachdb#102944 Fixes: cockroachdb#102940 Release note: none
When addressing an interface change in a recent vendor bump (cockroachdb#102882), a closure over the existing event listener (passed in via `opts`) was removed in favor of instantiating Pebble's event listener directly. This had the effect of dropping disk slow events. Revert to using a function that closed over the `opts`, making use of the existing event listener. Fixes: cockroachdb#103185 Fixes: cockroachdb#102946 Fixes: cockroachdb#102944 Fixes: cockroachdb#102940 Release note: None.
Done in #103339. |
103198: roachtest: always unstall on disk-stall failure r=RaduBerinde a=jbowens Previously, if the disk-stalled/* raochtests failed while the disk was stalled, a defer would attempt to unstall the disk before completing the test. This could fail if the context was cancelled. The stalled disk would then prevent collection of artifacts. This change updates the defer'd Unstall call to use a background context that will never be cancelled. Epic: none Informs #102946 Release note: none 103246: kv: tolerate write skew under weak isolation levels r=arulajmani a=nvanbenschoten _or "support kv-level snapshot isolation"._ Fixes #100131. This commit adds support for weak transaction isolation levels (Snapshot and Read Committed) to commit even when their read and write timestamps are skewed. Thanks to prior cleanup and refactoring, this change is limited to an update to the transaction commit condition and a clarification of the one-phase commit requirements, along with a collection of updates to tests. Release note: None 103548: roachtest: add point-tombstone/heterogeneous-value-sizes roachtest r=jbowens a=jbowens Introduce a new roachtest that exercises a current gap in Pebble's point tombstone heuristics. Rows with a heterogeneous size distribution can be problematic, because Pebble's existing heuristics rely on average value sizes in order to encourage disk-space reclaiming compactions. With heterogeneous value sizes, these heuristics can dramatically over-or-under estimate the amount of disk space reclaimed. This new roachtest runs a kv0 workload for a fixed number of rows, all with 1MiB values. It then runs another a kv0 workload for 4x the number of rows, all with 4KiB values. Then it deletes all the 1MiB-valued rows, with a reduced TTL, and expects that a reasonable amount of disk space is reclaimed. Currently, this roachtest is skipped. With a recent nightly build of Cockroach, the test times out, stalled with the approximate disk-bytes size of the `kv` table stagnant at 92 GiB, despite the MVCC logical size of the table totalling just 1.4 GiB. ``` databaseID: 104, tableID: 106, rangeCount: 3003, approxDiskBytes: 92 GiB, liveBytes: 1.4 GiB, totalBytes: 1.4 GiB, livePercentage: 1.0 ``` Examining one store's sstable properties reveals the point tombstones remain uncompacted in levels L3, L4 and L5. ``` L0 L1 L2 L3 L4 L5 L6 TOTAL count 0 0 0 23 122 545 1192 1882 seq num smallest 0 0 0 2973624 2291108 460760 145525 145525 largest 0 0 0 6014495 4886656 4063619 2833606 6014495 size data 0 B 0 B 0 B 62 M 480 M 3.6 G 26 G 31 G blocks 0 0 0 2513 7859 8292 29255 47919 index 0 B 0 B 0 B 123 K 344 K 356 K 1.2 M 2.0 M blocks 0 0 0 23 122 545 1192 1882 top-level 0 B 0 B 0 B 0 B 0 B 0 B 0 B 0 B filter 0 B 0 B 0 B 116 K 104 K 123 K 164 K 508 K raw-key 0 B 0 B 0 B 5.1 M 2.6 M 2.5 M 3.2 M 13 M raw-value 0 B 0 B 0 B 74 M 485 M 3.6 G 26 G 31 G pinned-key 0 B 0 B 0 B 0 B 0 B 0 B 0 B 0 B pinned-value 0 B 0 B 0 B 0 B 0 B 0 B 0 B 0 B records set 0 0 0 33 K 67 K 51 K 91 K 243 K delete 0 0 0 119 K 8.2 K 9.1 K 0 136 K range-delete 0 0 0 0 0 0 0 0 range-key-sets 0 0 0 0 0 0 0 0 range-key-unsets 0 0 0 0 0 0 0 0 range-key-deletes 0 0 0 0 0 0 0 0 merge 0 0 0 6.8 K 6.8 K 14 K 6.7 K 34 K pinned 0 0 0 0 0 0 0 0 ``` <img width="729" alt="Screenshot 2023-05-17 at 4 27 17 PM" src="https://github.com/cockroachdb/cockroach/assets/867352/d8e3188a-75fb-4670-9c61-e5ff8b369894"> <img width="719" alt="Screenshot 2023-05-17 at 4 27 08 PM" src="https://github.com/cockroachdb/cockroach/assets/867352/cd787935-9ea0-4515-a314-2a279243ac0f"> Informs cockroachdb/pebble#2340. Epic: CRDB-25405 Release note: none Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
roachtest.disk-stalled/dmsetup failed with artifacts on release-23.1 @ 5376e479204d6e8243f67c30aea3d031df529afd:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=true
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-27751
The text was updated successfully, but these errors were encountered: