roachtest: synctest failed #31948

cockroach-teamcity · 2018-10-27T20:26:09Z

SHA: https://github.com/cockroachdb/cockroach/commits/5ef4d2c8621fc5465f73a96221b0bd0bc5cd27aa

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=synctest PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=990073&tab=buildLog

The test failed on master:
	test.go:639,cluster.go:1110,synctest.go:68: /home/agent/work/.go/bin/roachprod run teamcity-990073-synctest:1 -- ./cockroach debug synctest {store-dir}/faulty ./nemesis returned:
		stderr:
		
		stdout:
		0480ba0, 0xd)
			/usr/local/go/src/runtime/sema.go:510 +0x10b
		sync.(*Cond).Wait(0xc420480b90)
			/usr/local/go/src/sync/cond.go:56 +0x80
		github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).syncLoop(0xc420480a80)
			/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:656 +0x72
		created by github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).open
			/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:644 +0x418
		
		goroutine 23 [chan receive]:
		github.com/cockroachdb/cockroach/pkg/cli.runSyncer.func3(0x31a9de0, 0xc420714090)
			/go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:176 +0xc4
		github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4202ae640, 0xc4201cec60, 0xc4207872c0)
			/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
		created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
			/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad
		Error:  exit status 2
		: exit status 1

The text was updated successfully, but these errors were encountered:

tbg · 2018-10-27T20:39:51Z

@petermattis a db.Flush straight-up panicked, likely due to an error that was inserted.

cockroach/pkg/cli/debug_synctest.go

Line 201 in 01d6bda

if err := db.Flush(); err != nil {

I guess this is mostly just good to know. But this indicates a bug in RocksDB (if it isn't a bug in the test). RocksDB shouldn't explode on I/O errors, right?

petermattis · 2018-10-27T20:49:14Z

For posterity (when the teamcity artifacts get deleted):

non-Go function
	pc=0x7f485d12dcc0
runtime.cgocall(0x2108c30, 0xc4208636a0, 0x0)
	/usr/local/go/src/runtime/cgocall.go:128 +0x64 fp=0xc420863670 sp=0xc420863638 pc=0x6b38d4
github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBFlush(0x7f485c659070, 0x0, 0x0)
	_cgo_gotypes.go:571 +0x49 fp=0xc4208636a0 sp=0xc420863670 pc=0x11ab279
github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).Flush.func1(0x7f485c659070, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:922 +0x5b fp=0xc4208636e8 sp=0xc4208636a0 pc=0x11c312b
github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).Flush(0xc420480a80, 0xc4203060f8, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:922 +0x32 fp=0xc420863728 sp=0xc4208636e8 pc=0x11b1722
github.com/cockroachdb/cockroach/pkg/cli.runSyncer.func4(0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:201 +0x25f fp=0xc420863868 sp=0xc420863728 pc=0x20f0b7f
github.com/cockroachdb/cockroach/pkg/cli.runSyncer(0x31a9da0, 0xc4207d14a0, 0xc42038c020, 0x1d, 0x0, 0x3188ce0, 0xc420082e80, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:220 +0x7ad fp=0xc420863bd0 sp=0xc420863868 pc=0x20b96ad
github.com/cockroachdb/cockroach/pkg/cli.runDebugSyncTest(0x444b3a0, 0xc42037a1e0, 0x2, 0x2, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:97 +0x21c fp=0xc420863ca8 sp=0xc420863bd0 pc=0x20b8ddc
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute(0x444b3a0, 0xc42037a1a0, 0x2, 0x2, 0x444b3a0, 0xc42037a1a0)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:698 +0x46d fp=0xc420863d50 sp=0xc420863ca8 pc=0x1df57dd
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x4448b40, 0x8, 0x0, 0xc420863ec8)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:783 +0x2e4 fp=0xc420863e80 sp=0xc420863d50 pc=0x1df5f54
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute(0x4448b40, 0x18, 0xc420863ee0)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:736 +0x2b fp=0xc420863eb0 sp=0xc420863e80 pc=0x1df5c4b
github.com/cockroachdb/cockroach/pkg/cli.Run(0xc42004c150, 0x4, 0x4, 0xc420052088, 0xc42008d980)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:193 +0x6d fp=0xc420863ed8 sp=0xc420863eb0 pc=0x20ad32d
github.com/cockroachdb/cockroach/pkg/cli.Main()
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:69 +0x223 fp=0xc420863f78 sp=0xc420863ed8 pc=0x20acc53
main.main()
	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/cockroach/main.go:30 +0x20 fp=0xc420863f88 sp=0xc420863f78 pc=0x2108180
runtime.main()
	/usr/local/go/src/runtime/proc.go:198 +0x212 fp=0xc420863fe0 sp=0xc420863f88 pc=0x6deeb2
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc420863fe8 sp=0xc420863fe0 pc=0x70cd01

But this indicates a bug in RocksDB (if it isn't a bug in the test). RocksDB shouldn't explode on I/O errors, right?

RocksDB should not explode on I/O errors. It could be a bug in our C++ code, though.

petermattis · 2018-10-27T20:51:04Z

This is interesting:

[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x7f485d12dcc0]

An address that is in the first page of the process's address space usually indicates a NULL pointer dereference.

tbg · 2018-10-30T14:24:38Z

We probably want to get a core dump then. I think this is easy to set up on the Go side (GOTRACEBACK=crash) but will involve the usual juggle to actually get the OS to dump the core correctly. Bothersome, but straightforward. And we've solved that issue on our roachprod machines already if memory serves correctly.

cockroach-teamcity · 2018-10-30T22:25:07Z

SHA: https://github.com/cockroachdb/cockroach/commits/bbc646fc6de90b59c0253fd682667715959fb657

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=synctest PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=993605&tab=buildLog

The test failed on master:
	test.go:639,test.go:651: newCluster interrupted

cockroach-teamcity · 2018-10-31T22:49:41Z

SHA: https://github.com/cockroachdb/cockroach/commits/5b5738084b8cdc769d5e7973921de5cae4457380

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=synctest PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=995412&tab=buildLog

The test failed on master:
	test.go:639,test.go:651: newCluster interrupted

petermattis · 2018-11-16T21:03:37Z

I was able to reproduce this by creating a single node cluster and running roachtest run -c <cluster> --count=100 synctest. Each synctest run takes ~10m and the failure occurred on the 13th run. Almost exactly the same symptoms:

[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x7ff132d2ecc0]
...

goroutine 1 [syscall]:
non-Go function
        pc=0x7ff132d2ecc0
runtime.cgocall(0x21b8db0, 0xc420b8f698, 0x0)
        /usr/local/go/src/runtime/cgocall.go:128 +0x64 fp=0xc420b8f668 sp=0xc420b8f630 pc=0x6b3a04
github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBFlush(0x7ff12765f7d0, 0x0, 0x0)
        _cgo_gotypes.go:573 +0x49 fp=0xc420b8f698 sp=0xc420b8f668 pc=0x11be969
github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).Flush.func1(0x7ff12765f7d0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:922 +0x5b fp=0xc420b8f6e0 sp=0xc420b8f698
 pc=0x11d78db
github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).Flush(0xc420286000, 0xc4202b7ca8, 0x3)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:922 +0x32 fp=0xc420b8f720 sp=0xc420b8f6e0
 pc=0x11c5062
github.com/cockroachdb/cockroach/pkg/cli.runSyncer.func4(0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:201 +0x25f fp=0xc420b8f868 sp=0xc420b8f720 pc
=0x21a0b9f
...

petermattis · 2018-11-17T03:15:52Z

Reproduced again on the 30th run (~5h), this time with a core. I'm not seeing anything interesting here.

(gdb) bt
#0  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:159
#1  0x00000000006f321b in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:422
#2  0x00000000006f37d8 in runtime.sigfwdgo (ctx=0xc420009ac0, info=0xc420009bf0, sig=6, ~r3=false)
    at /usr/local/go/src/runtime/signal_unix.go:631
#3  0x00000000006f2ad8 in runtime.sigtrampgo (ctx=0xc420009ac0, info=0xc420009bf0, sig=6)
    at /usr/local/go/src/runtime/signal_unix.go:287
#4  0x000000000070e163 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:352
#5  0x00000000021baa01 in x_cgo_callers (sig=6, info=0xc420009bf0, context=0x20, cgoTraceback=<optimized out>,
    cgoCallers=<optimized out>, sigtramp=0xc420009918) at gcc_traceback.c:30
#6  <signal handler called>
#7  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:159
#8  0x00000000006f321b in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:422
#9  0x00000000006f33e9 in runtime.crash () at /usr/local/go/src/runtime/signal_unix.go:520
#10 0x00000000006ddd22 in runtime.dopanic_m (gp=0x4b43640 <runtime.g0>, pc=7198577, sp=140732356293904)
    at /usr/local/go/src/runtime/panic.go:751
#11 0x0000000000708bbc in runtime.dopanic.func1 () at /usr/local/go/src/runtime/panic.go:598
#12 0x00000000006dd69a in runtime.dopanic (unused=0) at /usr/local/go/src/runtime/panic.go:597
#13 0x00000000006dd771 in runtime.throw (s=...) at /usr/local/go/src/runtime/panic.go:616
#14 0x00000000006f31ce in runtime.sigpanic () at /usr/local/go/src/runtime/signal_unix.go:372
#15 0x00007f73a55b8cc0 in _IO_vfprintf_internal (s=0x20, format=<optimized out>, ap=0x0) at vfprintf.c:1632
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

I've done serious debugging with gdb and core files before, but never with Go. Not sure what I should be looking for. The above stack is for the thread which caught the signal which should be the thread which generated the signal. But the stack is corrupt? I looked at the stacks of the other threads (18 in total) and nothing jumped out as suspicious. About half seemed to be threads created by the goroutine and half are threads created by RocksDB. For example:

(gdb) thread 11
[Switching to thread 11 (Thread 0x7f739dbfe700 (LWP 15508))]
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185	../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
(gdb) where
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000000000262642e in __gthread_cond_wait(pthread_cond_t*, pthread_mutex_t*) ()
#2  0x00000000026264ec in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
#3  0x00000000025092ef in rocksdb::ThreadPoolImpl::Impl::BGThread ()
    at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/util/threadpool_imp.cc:195
#4  0x00000000025096cf in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper ()
    at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/util/threadpool_imp.cc:302
#5  0x000000000267c300 in execute_native_thread_routine ()
#6  0x00007f73a5e666ba in start_thread (arg=0x7f739dbfe700) at pthread_create.c:333
#7  0x00007f73a567141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Hmm, maybe dlv will be better...

petermattis · 2018-11-17T03:25:18Z

Ok, this time with dlv. The process has 19 goroutines when it crashed:

(dlv) goroutines
[19 goroutines]
* Goroutine 1 - User: _cgo_gotypes.go:573 github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBFlush (0x11be969)
  Goroutine 2 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 3 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 4 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 5 - User: /usr/local/go/src/runtime/lock_futex.go:227 runtime.notetsleepg (0x6c1d32)
  Goroutine 6 - User: /usr/local/go/src/runtime/sigqueue.go:139 os/signal.signal_recv (0x6f4056)
  Goroutine 7 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 8 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 18 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 19 - User: /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1203 github.com/cockroachdb/cockroach/pkg/util/log.flushDaemon (0xceaeb1)
  Goroutine 20 - User: /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:626 github.com/cockroachdb/cockroach/pkg/util/log.signalFlusher (0xce836b)
  Goroutine 29 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 34 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 35 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 38 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 50 - User: /usr/local/go/src/runtime/lock_futex.go:227 runtime.notetsleepg (0x6c1d32)
  Goroutine 51 - User: /go/src/github.com/cockroachdb/cockroach/vendor/go.opencensus.io/stats/view/worker.go:144 github.com/cockroachdb/cockroach/vendor/go.opencensus.io/stats/view.(*worker).start (0x133e4dd)
  Goroutine 235 - User: /usr/local/go/src/runtime/sema.go:510 sync.runtime_notifyListWait (0x6f0efb)
  Goroutine 236 - User: /go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:176 github.com/cockroachdb/cockroach/pkg/cli.runSyncer.func3 (0x21a07a4)

Just as we saw from the panic, the crash occurred during a flush:

(dlv) bt
 0  0x000000000070bad2 in runtime.asmcgocall
    at /usr/local/go/src/runtime/asm_amd64.s:674
 1  0x00000000006b3a28 in runtime.cgocall
    at /usr/local/go/src/runtime/cgocall.go:131
 2  0x00000000011be969 in github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBFlush
    at _cgo_gotypes.go:573
 3  0x00000000011d78db in github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).Flush.func1
    at /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:922
 4  0x00000000011c5062 in github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).Flush
    at /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:922
 5  0x00000000021a0b9f in github.com/cockroachdb/cockroach/pkg/cli.runSyncer.func4
    at /go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:201
 6  0x00000000021691ad in github.com/cockroachdb/cockroach/pkg/cli.runSyncer
    at /go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:220
 7  0x00000000021688dc in github.com/cockroachdb/cockroach/pkg/cli.runDebugSyncTest
    at /go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:97
 8  0x0000000001f668bd in github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
    at /go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:698
 9  0x0000000001f67034 in github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
    at /go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:783
10  0x0000000001f66d2b in github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
    at /go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:736
11  0x000000000215cdad in github.com/cockroachdb/cockroach/pkg/cli.Run
    at /go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:194
12  0x000000000215c6d3 in github.com/cockroachdb/cockroach/pkg/cli.Main
    at /go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:69
13  0x00000000021b8300 in main.main
    at /go/src/github.com/cockroachdb/cockroach/pkg/cmd/cockroach/main.go:30
14  0x00000000006defe2 in runtime.main
    at /usr/local/go/src/runtime/proc.go:198
15  0x000000000070ce31 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:2361

This only shows the Go side. I think the segfault happened in C++ land. Looking at the backtrace of the thread that caught the signals shows something similar to gdb:

(dlv) bt
0  0x000000000070dec4 in runtime.raise
   at /usr/local/go/src/runtime/sys_linux_amd64.s:159
1  0x00000000006f321b in runtime.dieFromSignal
   at /usr/local/go/src/runtime/signal_unix.go:422
2  0x00000000006f37d8 in runtime.sigfwdgo
   at /usr/local/go/src/runtime/signal_unix.go:631
3  0x00000000006f2ad8 in runtime.sigtrampgo
   at /usr/local/go/src/runtime/signal_unix.go:287
4  0x000000000070e163 in runtime.sigtramp
   at /usr/local/go/src/runtime/sys_linux_amd64.s:352
5  0x00000000021baa01 in C.x_cgo_callers
   at /usr/local/go/src/runtime/cgo/gcc_traceback.c:30
6  0x894848246c8d4848 in ???
   at ?:-1
   error: EOF
(truncated)

It's late. I'll pick this up again tomorrow.

petermattis · 2018-11-19T13:13:08Z

Seems likely this is a bug in RocksDB. Worth tracking down, though not super urgent. I'm going to put this on the back burner.

This reverts commit a321424. We have a report with credible testing that using FlushWAL instead of SyncWAL causes data loss in disk full situations. Presumably there is some error that is not being propagated correctly. Possibly related to cockroachdb#31948. See cockroachdb#25173. Possibly unrelated, but the symptom is the same. Release note (bug fix): Fix a node data loss bug that occurs when a disk becomes temporarily full.

32605: Revert "libroach: use FlushWAL instead of SyncWAL" r=bdarnell a=petermattis This reverts commit a321424. We have a report with credible testing that using FlushWAL instead of SyncWAL causes data loss in disk full situations. Presumably there is some error that is not being propagated correctly. Possibly related to #31948. See #25173. Possibly unrelated, but the symptom is the same. Release note (bug fix): Fix a node data loss bug that occurs when a disk becomes temporarily full. Co-authored-by: Peter Mattis <[email protected]>

This reverts commit a321424. We have a report with credible testing that using FlushWAL instead of SyncWAL causes data loss in disk full situations. Presumably there is some error that is not being propagated correctly. Possibly related to cockroachdb#31948. See cockroachdb#25173. Possibly unrelated, but the symptom is the same. Release note (bug fix): Fix a node data loss bug that occurs when a disk becomes temporarily full.

cockroach-teamcity · 2018-12-04T16:49:49Z

SHA: https://github.com/cockroachdb/cockroach/commits/e6cb0c5c329617b560eee37527248171b5e06382

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=synctest PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1038478&tab=buildLog

The test failed on master:
	test.go:630,cluster.go:1141,synctest.go:67: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1038478-synctest:1 -- ./cockroach debug synctest {store-dir}/faulty ./nemesis returned:
		stderr:
		
		stdout:
		47d1a0, 0x2c)
			/usr/local/go/src/runtime/sema.go:510 +0x10b
		sync.(*Cond).Wait(0xc42047d190)
			/usr/local/go/src/sync/cond.go:56 +0x80
		github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).syncLoop(0xc42047d080)
			/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:656 +0x72
		created by github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).open
			/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:644 +0x418
		
		goroutine 25 [chan receive]:
		github.com/cockroachdb/cockroach/pkg/cli.runSyncer.func3(0x32c7700, 0xc4207ac300)
			/go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:176 +0xc4
		github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4202f8390, 0xc4201cec60, 0xc42078a240)
			/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
		created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
			/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad
		Error:  exit status 2
		: exit status 1

petermattis · 2018-12-04T20:29:21Z

The failure about was appropriate given I was able to reproduce a failure on top of the revert of FlushWAL. That is, the FlushWAL bug is not the culprit here.

It's interesting that all of the failures have the same stack trace, ending with _Cfunc_DBFlush. That almost certainly means something.

cockroach-teamcity · 2018-12-07T17:25:46Z

SHA: https://github.com/cockroachdb/cockroach/commits/06d2222fd9010f01a8cdf6a6c24597bbed181f36

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=synctest PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1044826&tab=buildLog

The test failed on master:
	test.go:630,cluster.go:1139,synctest.go:67: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1044826-synctest:1 -- ./cockroach debug synctest {store-dir}/faulty ./nemesis returned:
		stderr:
		
		stdout:
		7df20, 0x3a)
			/usr/local/go/src/runtime/sema.go:510 +0x10b
		sync.(*Cond).Wait(0xc42047df10)
			/usr/local/go/src/sync/cond.go:56 +0x80
		github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).syncLoop(0xc42047de00)
			/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:655 +0x72
		created by github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).open
			/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:643 +0x418
		
		goroutine 116 [chan receive]:
		github.com/cockroachdb/cockroach/pkg/cli.runSyncer.func3(0x336eee0, 0xc4208171a0)
			/go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:175 +0xc4
		github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4203162c0, 0xc420616360, 0xc4207bdaa0)
			/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:198 +0xe9
		created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
			/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:191 +0xad
		Error:  exit status 2
		: exit status 1

petermattis · 2018-12-08T13:51:36Z

The previous failure was before #32931. I've since tried to reproduce on top of #32931 and have had 118 successful runs and 0 failures over the course of almost 20 hours. I'm going to leave this running for a while longer, and will declare success later today if I don't see a failure. The theory is that whatever crash was occurring in RocksDB was fixed between 5.13 and 5.17.2.

petermattis · 2018-12-09T00:50:18Z

182 runs and 0 failures over the course of 30 hours. Previously I was able to see a failure every 5-6 hours.

cockroach-teamcity added this to the 2.2 milestone Oct 27, 2018

cockroach-teamcity assigned tbg Oct 27, 2018

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Oct 27, 2018

tbg assigned petermattis Oct 27, 2018

petermattis mentioned this issue Nov 26, 2018

Revert "libroach: use FlushWAL instead of SyncWAL" #32605

Merged

petermattis mentioned this issue Nov 27, 2018

release-2.1: Revert "libroach: use FlushWAL instead of SyncWAL" #32633

Merged

petermattis closed this as completed Dec 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: synctest failed #31948

roachtest: synctest failed #31948

cockroach-teamcity commented Oct 27, 2018

tbg commented Oct 27, 2018

petermattis commented Oct 27, 2018

petermattis commented Oct 27, 2018

tbg commented Oct 30, 2018

cockroach-teamcity commented Oct 30, 2018

cockroach-teamcity commented Oct 31, 2018

petermattis commented Nov 16, 2018

petermattis commented Nov 17, 2018

petermattis commented Nov 17, 2018

petermattis commented Nov 19, 2018

cockroach-teamcity commented Dec 4, 2018

petermattis commented Dec 4, 2018

cockroach-teamcity commented Dec 7, 2018

petermattis commented Dec 8, 2018

petermattis commented Dec 9, 2018

roachtest: synctest failed #31948

roachtest: synctest failed #31948

Comments

cockroach-teamcity commented Oct 27, 2018

tbg commented Oct 27, 2018

petermattis commented Oct 27, 2018

petermattis commented Oct 27, 2018

tbg commented Oct 30, 2018

cockroach-teamcity commented Oct 30, 2018

cockroach-teamcity commented Oct 31, 2018

petermattis commented Nov 16, 2018

petermattis commented Nov 17, 2018

petermattis commented Nov 17, 2018

petermattis commented Nov 19, 2018

cockroach-teamcity commented Dec 4, 2018

petermattis commented Dec 4, 2018

cockroach-teamcity commented Dec 7, 2018

petermattis commented Dec 8, 2018

petermattis commented Dec 9, 2018