-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: write directly to fs in disk stall detection, pull RocksDB stats #34224
Comments
The node crashed
|
Thanks @awoods187. cc @dt/@petermattis -- could this be stalled writes because of L0 growing too large? I don't think so because I can't see this at all in the logs (though the 60s log interval means I might've missed it):
(this is just one I picked, the numbers vary but the last line is always all zeroes. Note that I didn't get the one of the node that died because all I have is the debug zip which doesn't contain dead nodes' logs) Are these stats bad at all? L0 seems to be empty which is good. What's |
We should also rework the disk stall detection to distinguish between RocksDB slowness and I/O slowness. At the very least by including the rocksdb log in the crash report, but perhaps it's also worth not crashing at all when RocksDB is blocking writes but the disk still works. |
I'm not sure what I don't think anything problematic in these stats. No stalls. No L0 files. Most files in L6 (as expected with dynamic_level_bytes).
The RocksDB event listener stuff includes an
Seems worthwhile to log this info at the very least. |
Repurposing the issue as retitled. |
Make it a solid 3/3 |
Even with running
on 3 of my 4 clusters |
Discussed in private, I gave the wrong env var - needs to be
COCKROACH_ENGINE_.... not COCKROACH_LOG...
|
So i largely got this working today (7 success/8) on the restore but I did it one more time even with |
What was the crash? It couldn't be the same one, though it could be a similar-looking one from the log directory. What provider/machine was that on? |
I hit this again today (but I didn't use -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h) |
Describe the problem
I set up a six node tpc-c cluster with load based splitting turned off and it resulted in a dead node during import.
To Reproduce
What did you do? Describe in your own words.
If possible, provide steps to reproduce the behavior:
Expected behavior
I expect the import to complete without resulting in a dead node.
Additional data / screenshots
Environment:
Jira issue: CRDB-4660
The text was updated successfully, but these errors were encountered: