-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: synctest failed #31948
Comments
@petermattis a cockroach/pkg/cli/debug_synctest.go Line 201 in 01d6bda
I guess this is mostly just good to know. But this indicates a bug in RocksDB (if it isn't a bug in the test). RocksDB shouldn't explode on I/O errors, right? |
For posterity (when the teamcity artifacts get deleted):
RocksDB should not explode on I/O errors. It could be a bug in our C++ code, though. |
This is interesting:
An address that is in the first page of the process's address space usually indicates a NULL pointer dereference. |
We probably want to get a core dump then. I think this is easy to set up on the Go side ( |
SHA: https://github.com/cockroachdb/cockroach/commits/bbc646fc6de90b59c0253fd682667715959fb657 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=993605&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/5b5738084b8cdc769d5e7973921de5cae4457380 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=995412&tab=buildLog
|
I was able to reproduce this by creating a single node cluster and running
|
Reproduced again on the 30th run (~5h), this time with a core. I'm not seeing anything interesting here.
I've done serious debugging with gdb and core files before, but never with Go. Not sure what I should be looking for. The above stack is for the thread which caught the signal which should be the thread which generated the signal. But the stack is corrupt? I looked at the stacks of the other threads (18 in total) and nothing jumped out as suspicious. About half seemed to be threads created by the goroutine and half are threads created by RocksDB. For example:
Hmm, maybe |
Ok, this time with
Just as we saw from the panic, the crash occurred during a flush:
This only shows the Go side. I think the segfault happened in C++ land. Looking at the backtrace of the thread that caught the signals shows something similar to
It's late. I'll pick this up again tomorrow. |
Seems likely this is a bug in RocksDB. Worth tracking down, though not super urgent. I'm going to put this on the back burner. |
This reverts commit a321424. We have a report with credible testing that using FlushWAL instead of SyncWAL causes data loss in disk full situations. Presumably there is some error that is not being propagated correctly. Possibly related to cockroachdb#31948. See cockroachdb#25173. Possibly unrelated, but the symptom is the same. Release note (bug fix): Fix a node data loss bug that occurs when a disk becomes temporarily full.
32605: Revert "libroach: use FlushWAL instead of SyncWAL" r=bdarnell a=petermattis This reverts commit a321424. We have a report with credible testing that using FlushWAL instead of SyncWAL causes data loss in disk full situations. Presumably there is some error that is not being propagated correctly. Possibly related to #31948. See #25173. Possibly unrelated, but the symptom is the same. Release note (bug fix): Fix a node data loss bug that occurs when a disk becomes temporarily full. Co-authored-by: Peter Mattis <[email protected]>
This reverts commit a321424. We have a report with credible testing that using FlushWAL instead of SyncWAL causes data loss in disk full situations. Presumably there is some error that is not being propagated correctly. Possibly related to cockroachdb#31948. See cockroachdb#25173. Possibly unrelated, but the symptom is the same. Release note (bug fix): Fix a node data loss bug that occurs when a disk becomes temporarily full.
SHA: https://github.com/cockroachdb/cockroach/commits/e6cb0c5c329617b560eee37527248171b5e06382 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1038478&tab=buildLog
|
The failure about was appropriate given I was able to reproduce a failure on top of the revert of It's interesting that all of the failures have the same stack trace, ending with |
SHA: https://github.com/cockroachdb/cockroach/commits/06d2222fd9010f01a8cdf6a6c24597bbed181f36 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1044826&tab=buildLog
|
The previous failure was before #32931. I've since tried to reproduce on top of #32931 and have had 118 successful runs and 0 failures over the course of almost 20 hours. I'm going to leave this running for a while longer, and will declare success later today if I don't see a failure. The theory is that whatever crash was occurring in RocksDB was fixed between 5.13 and 5.17.2. |
182 runs and 0 failures over the course of 30 hours. Previously I was able to see a failure every 5-6 hours. |
SHA: https://github.com/cockroachdb/cockroach/commits/5ef4d2c8621fc5465f73a96221b0bd0bc5cd27aa
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=990073&tab=buildLog
The text was updated successfully, but these errors were encountered: