-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: clearrange/checks=true failed #44845
Comments
(roachtest).clearrange/checks=true failed on release-19.1@ffbadbb6e8ac7d7376611e9487f505428a24d90d:
More
Artifacts: /clearrange/checks=true
See this test on roachdash |
(roachtest).clearrange/checks=true failed on release-19.1@1fcf7104d19c5c7634cfb52c4302bc9e70c4b9ea:
More
Artifacts: /clearrange/checks=true
See this test on roachdash |
(roachtest).clearrange/checks=true failed on release-19.1@ca235a18adac0241b4e3baf144c7ff7689d952c9:
More
Artifacts: /clearrange/checks=true See this test on roachdash |
(roachtest).clearrange/checks=true failed on release-19.1@c406bb10543ca97010c64cc230a3c45690a7eb6c:
More
Artifacts: /clearrange/checks=true See this test on roachdash |
(roachtest).clearrange/checks=true failed on release-19.1@d556976a57c52e188157469ec9a64d8f388a79e9:
More
Artifacts: /clearrange/checks=true See this test on roachdash |
In the most recent failure, node 7 died with:
Looks like this happened during the import phase of the test, which is surprising. The last compaction stats output to the logs show:
That seems reasonable, and not terribly different from another node:
Not sure what happened here. Perhaps a lot of disk space is being used elsewhere. |
(roachtest).clearrange/checks=true failed on release-19.1@cd9ecd90d2ce0f5caf362d6ffa6f782e91640837:
More
Artifacts: /clearrange/checks=true
See this test on roachdash |
(roachtest).clearrange/checks=true failed on release-19.1@73a373fb8c138c8ef6e4a05d7c1757207efa0a8d:
More
Artifacts: /clearrange/checks=true
See this test on roachdash |
Similar to what was reported in #44845 (comment), one of the nodes died during the import due to being out of space:
|
@jlinder what was the plan to deal with out-of-disk errors? |
Was a plan ever discussed? We might just be pushing up too close to the cluster capacity with this setup. We could reduce the size of the import to provide more breathing room. Or we could switch to using EBS and larger volumes to provide more breathing room. |
I don't remember discussion of such a plan. The obvious fixes to me are to increase disk size for the tests in question or change the tests to be more considerate of how they are using disk (if that's an option). Since roachprod can be told the machine type and amount of disk to use in cluster nodes, would updating roachtest to use different machine types or more disk work? |
Reducing the size of the import used by Cc @dt in case you know of a recent change that could have affected disk imbalances during IMPORTs. |
I don't know of anything that has changed there -- I don't think we've touched anything on bulk side. How recent is as of late? Pebble compaction differences could have changed it, or, going back a lot further the switch to larger ranges could be relevant. In IMPORT we issue splits and scatter the follow range any time the data producer process has sent out 48mb of data without hitting a range boundary i.e. when it has sent that much to a single range. This was picked back when the range size was 64mb, since it meant the range was 75% full. We left it that way with the move to larger ranges and just let merges clean up afterwards since we were already fighting with hotspots and the inverted LSMs and didn't want to make it any worse at the time. The normal kv background splitting and rebalancing is also enabled throughout the IMPORT ranges that fill bit-by-bit over time from separate small flushes. That said, we've seen frequent cases of the allocator just doing nothing when we ask it scatter a range, even when disk space usage is not balanced or load is not balanced, sometimes because it looks at mvcc byte counts and not actual storage bytes. |
The failures predate the switch to using Pebble as the default. For example, my message on May 3 was before that switch. It might be unreasonable to assume the first message on this issue is due to out-of-disk, but that might put a bound on it. The switch to larger ranges landed on Feb 19. The first failure on this issue was Feb 7, but many more failures have occurred since then. |
@jbowens You've been running the |
(roachtest).clearrange/checks=true failed on release-19.1@0c04a92ba19eedd4762ca7feb8361433682f3ded:
More
Artifacts: /clearrange/checks=true
See this test on roachdash |
From the debug.zip, node 7's last reported capacity used is 381.9 GB and capacity available 1.34 GB and compactor queue shows:
I'll try to reproduce this with instrumentation on the release-19.1 branch tomorrow. |
(roachtest).clearrange/checks=true failed on release-19.1@8ecf958ac06ee10391ceb108ba11a745de8ff4b1:
More
Artifacts: /clearrange/checks=true
See this test on roachdash |
(roachtest).clearrange/checks=true failed on release-19.1@7c03505d8daa19dee7f5f0268c9e728e38d4ba6d:
More
Artifacts: /clearrange/checks=true
See this test on roachdash |
(roachtest).clearrange/checks=true failed on release-19.1@86b7271623ad797e9c42d5f7900a5cb424fed436:
More
Artifacts: /clearrange/checks=true
See this test on roachdash |
(roachtest).clearrange/checks=true failed on release-19.1@efeb30fcc83c76819a832e7f12c91c891dbe0e68:
More
Artifacts: /clearrange/checks=true
See this test on roachdash |
@itsbilal got a reproduction
|
@itsbilal noticed this test failing often on AWS and never on GCP while trying to reproduce #52720. I never noticed that all the failures were specifically on AWS, and I only tried to reproduce it on GCP. Oops. None of the nodes had very much disk space headroom around when n1 ran out of space.
On AWS, this test uses a On the dead n1:
The earliest of these sideload sstables r2113 appears in the logs here:
The node's last log line before panicking was at 14:32:06.691529. Is it expected for a sideloaded sstable to be sitting around for > 20 minutes? |
Fixed in #53572. |
(roachtest).clearrange/checks=true failed on release-19.1@407017cad14dfa63f19578055082dc10f3283cc4:
More
Artifacts: /clearrange/checks=true
Related:
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: