Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: synctest failed #31948

Closed
cockroach-teamcity opened this issue Oct 27, 2018 · 15 comments
Closed

roachtest: synctest failed #31948

cockroach-teamcity opened this issue Oct 27, 2018 · 15 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/5ef4d2c8621fc5465f73a96221b0bd0bc5cd27aa

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=synctest PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=990073&tab=buildLog

The test failed on master:
	test.go:639,cluster.go:1110,synctest.go:68: /home/agent/work/.go/bin/roachprod run teamcity-990073-synctest:1 -- ./cockroach debug synctest {store-dir}/faulty ./nemesis returned:
		stderr:
		
		stdout:
		0480ba0, 0xd)
			/usr/local/go/src/runtime/sema.go:510 +0x10b
		sync.(*Cond).Wait(0xc420480b90)
			/usr/local/go/src/sync/cond.go:56 +0x80
		github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).syncLoop(0xc420480a80)
			/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:656 +0x72
		created by github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).open
			/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:644 +0x418
		
		goroutine 23 [chan receive]:
		github.com/cockroachdb/cockroach/pkg/cli.runSyncer.func3(0x31a9de0, 0xc420714090)
			/go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:176 +0xc4
		github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4202ae640, 0xc4201cec60, 0xc4207872c0)
			/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
		created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
			/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad
		Error:  exit status 2
		: exit status 1

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Oct 27, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Oct 27, 2018
@tbg
Copy link
Member

tbg commented Oct 27, 2018

@petermattis a db.Flush straight-up panicked, likely due to an error that was inserted.

if err := db.Flush(); err != nil {

I guess this is mostly just good to know. But this indicates a bug in RocksDB (if it isn't a bug in the test). RocksDB shouldn't explode on I/O errors, right?

@petermattis
Copy link
Collaborator

For posterity (when the teamcity artifacts get deleted):

non-Go function
	pc=0x7f485d12dcc0
runtime.cgocall(0x2108c30, 0xc4208636a0, 0x0)
	/usr/local/go/src/runtime/cgocall.go:128 +0x64 fp=0xc420863670 sp=0xc420863638 pc=0x6b38d4
github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBFlush(0x7f485c659070, 0x0, 0x0)
	_cgo_gotypes.go:571 +0x49 fp=0xc4208636a0 sp=0xc420863670 pc=0x11ab279
github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).Flush.func1(0x7f485c659070, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:922 +0x5b fp=0xc4208636e8 sp=0xc4208636a0 pc=0x11c312b
github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).Flush(0xc420480a80, 0xc4203060f8, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:922 +0x32 fp=0xc420863728 sp=0xc4208636e8 pc=0x11b1722
github.com/cockroachdb/cockroach/pkg/cli.runSyncer.func4(0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:201 +0x25f fp=0xc420863868 sp=0xc420863728 pc=0x20f0b7f
github.com/cockroachdb/cockroach/pkg/cli.runSyncer(0x31a9da0, 0xc4207d14a0, 0xc42038c020, 0x1d, 0x0, 0x3188ce0, 0xc420082e80, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:220 +0x7ad fp=0xc420863bd0 sp=0xc420863868 pc=0x20b96ad
github.com/cockroachdb/cockroach/pkg/cli.runDebugSyncTest(0x444b3a0, 0xc42037a1e0, 0x2, 0x2, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:97 +0x21c fp=0xc420863ca8 sp=0xc420863bd0 pc=0x20b8ddc
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute(0x444b3a0, 0xc42037a1a0, 0x2, 0x2, 0x444b3a0, 0xc42037a1a0)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:698 +0x46d fp=0xc420863d50 sp=0xc420863ca8 pc=0x1df57dd
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x4448b40, 0x8, 0x0, 0xc420863ec8)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:783 +0x2e4 fp=0xc420863e80 sp=0xc420863d50 pc=0x1df5f54
github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute(0x4448b40, 0x18, 0xc420863ee0)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:736 +0x2b fp=0xc420863eb0 sp=0xc420863e80 pc=0x1df5c4b
github.com/cockroachdb/cockroach/pkg/cli.Run(0xc42004c150, 0x4, 0x4, 0xc420052088, 0xc42008d980)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:193 +0x6d fp=0xc420863ed8 sp=0xc420863eb0 pc=0x20ad32d
github.com/cockroachdb/cockroach/pkg/cli.Main()
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:69 +0x223 fp=0xc420863f78 sp=0xc420863ed8 pc=0x20acc53
main.main()
	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/cockroach/main.go:30 +0x20 fp=0xc420863f88 sp=0xc420863f78 pc=0x2108180
runtime.main()
	/usr/local/go/src/runtime/proc.go:198 +0x212 fp=0xc420863fe0 sp=0xc420863f88 pc=0x6deeb2
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc420863fe8 sp=0xc420863fe0 pc=0x70cd01

But this indicates a bug in RocksDB (if it isn't a bug in the test). RocksDB shouldn't explode on I/O errors, right?

RocksDB should not explode on I/O errors. It could be a bug in our C++ code, though.

@petermattis
Copy link
Collaborator

This is interesting:

[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x7f485d12dcc0]

An address that is in the first page of the process's address space usually indicates a NULL pointer dereference.

@tbg
Copy link
Member

tbg commented Oct 30, 2018

We probably want to get a core dump then. I think this is easy to set up on the Go side (GOTRACEBACK=crash) but will involve the usual juggle to actually get the OS to dump the core correctly. Bothersome, but straightforward. And we've solved that issue on our roachprod machines already if memory serves correctly.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/bbc646fc6de90b59c0253fd682667715959fb657

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=synctest PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=993605&tab=buildLog

The test failed on master:
	test.go:639,test.go:651: newCluster interrupted

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5b5738084b8cdc769d5e7973921de5cae4457380

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=synctest PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=995412&tab=buildLog

The test failed on master:
	test.go:639,test.go:651: newCluster interrupted

@petermattis
Copy link
Collaborator

I was able to reproduce this by creating a single node cluster and running roachtest run -c <cluster> --count=100 synctest. Each synctest run takes ~10m and the failure occurred on the 13th run. Almost exactly the same symptoms:

[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x7ff132d2ecc0]
...

goroutine 1 [syscall]:
non-Go function
        pc=0x7ff132d2ecc0
runtime.cgocall(0x21b8db0, 0xc420b8f698, 0x0)
        /usr/local/go/src/runtime/cgocall.go:128 +0x64 fp=0xc420b8f668 sp=0xc420b8f630 pc=0x6b3a04
github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBFlush(0x7ff12765f7d0, 0x0, 0x0)
        _cgo_gotypes.go:573 +0x49 fp=0xc420b8f698 sp=0xc420b8f668 pc=0x11be969
github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).Flush.func1(0x7ff12765f7d0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:922 +0x5b fp=0xc420b8f6e0 sp=0xc420b8f698
 pc=0x11d78db
github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).Flush(0xc420286000, 0xc4202b7ca8, 0x3)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:922 +0x32 fp=0xc420b8f720 sp=0xc420b8f6e0
 pc=0x11c5062
github.com/cockroachdb/cockroach/pkg/cli.runSyncer.func4(0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:201 +0x25f fp=0xc420b8f868 sp=0xc420b8f720 pc
=0x21a0b9f
...

@petermattis
Copy link
Collaborator

Reproduced again on the 30th run (~5h), this time with a core. I'm not seeing anything interesting here.

(gdb) bt
#0  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:159
#1  0x00000000006f321b in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:422
#2  0x00000000006f37d8 in runtime.sigfwdgo (ctx=0xc420009ac0, info=0xc420009bf0, sig=6, ~r3=false)
    at /usr/local/go/src/runtime/signal_unix.go:631
#3  0x00000000006f2ad8 in runtime.sigtrampgo (ctx=0xc420009ac0, info=0xc420009bf0, sig=6)
    at /usr/local/go/src/runtime/signal_unix.go:287
#4  0x000000000070e163 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:352
#5  0x00000000021baa01 in x_cgo_callers (sig=6, info=0xc420009bf0, context=0x20, cgoTraceback=<optimized out>,
    cgoCallers=<optimized out>, sigtramp=0xc420009918) at gcc_traceback.c:30
#6  <signal handler called>
#7  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:159
#8  0x00000000006f321b in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:422
#9  0x00000000006f33e9 in runtime.crash () at /usr/local/go/src/runtime/signal_unix.go:520
#10 0x00000000006ddd22 in runtime.dopanic_m (gp=0x4b43640 <runtime.g0>, pc=7198577, sp=140732356293904)
    at /usr/local/go/src/runtime/panic.go:751
#11 0x0000000000708bbc in runtime.dopanic.func1 () at /usr/local/go/src/runtime/panic.go:598
#12 0x00000000006dd69a in runtime.dopanic (unused=0) at /usr/local/go/src/runtime/panic.go:597
#13 0x00000000006dd771 in runtime.throw (s=...) at /usr/local/go/src/runtime/panic.go:616
#14 0x00000000006f31ce in runtime.sigpanic () at /usr/local/go/src/runtime/signal_unix.go:372
#15 0x00007f73a55b8cc0 in _IO_vfprintf_internal (s=0x20, format=<optimized out>, ap=0x0) at vfprintf.c:1632
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

I've done serious debugging with gdb and core files before, but never with Go. Not sure what I should be looking for. The above stack is for the thread which caught the signal which should be the thread which generated the signal. But the stack is corrupt? I looked at the stacks of the other threads (18 in total) and nothing jumped out as suspicious. About half seemed to be threads created by the goroutine and half are threads created by RocksDB. For example:

(gdb) thread 11
[Switching to thread 11 (Thread 0x7f739dbfe700 (LWP 15508))]
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185	../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
(gdb) where
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000000000262642e in __gthread_cond_wait(pthread_cond_t*, pthread_mutex_t*) ()
#2  0x00000000026264ec in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
#3  0x00000000025092ef in rocksdb::ThreadPoolImpl::Impl::BGThread ()
    at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/util/threadpool_imp.cc:195
#4  0x00000000025096cf in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper ()
    at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/util/threadpool_imp.cc:302
#5  0x000000000267c300 in execute_native_thread_routine ()
#6  0x00007f73a5e666ba in start_thread (arg=0x7f739dbfe700) at pthread_create.c:333
#7  0x00007f73a567141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Hmm, maybe dlv will be better...

@petermattis
Copy link
Collaborator

Ok, this time with dlv. The process has 19 goroutines when it crashed:

(dlv) goroutines
[19 goroutines]
* Goroutine 1 - User: _cgo_gotypes.go:573 github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBFlush (0x11be969)
  Goroutine 2 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 3 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 4 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 5 - User: /usr/local/go/src/runtime/lock_futex.go:227 runtime.notetsleepg (0x6c1d32)
  Goroutine 6 - User: /usr/local/go/src/runtime/sigqueue.go:139 os/signal.signal_recv (0x6f4056)
  Goroutine 7 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 8 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 18 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 19 - User: /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1203 github.com/cockroachdb/cockroach/pkg/util/log.flushDaemon (0xceaeb1)
  Goroutine 20 - User: /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:626 github.com/cockroachdb/cockroach/pkg/util/log.signalFlusher (0xce836b)
  Goroutine 29 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 34 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 35 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 38 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x6df43a)
  Goroutine 50 - User: /usr/local/go/src/runtime/lock_futex.go:227 runtime.notetsleepg (0x6c1d32)
  Goroutine 51 - User: /go/src/github.com/cockroachdb/cockroach/vendor/go.opencensus.io/stats/view/worker.go:144 github.com/cockroachdb/cockroach/vendor/go.opencensus.io/stats/view.(*worker).start (0x133e4dd)
  Goroutine 235 - User: /usr/local/go/src/runtime/sema.go:510 sync.runtime_notifyListWait (0x6f0efb)
  Goroutine 236 - User: /go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:176 github.com/cockroachdb/cockroach/pkg/cli.runSyncer.func3 (0x21a07a4)

Just as we saw from the panic, the crash occurred during a flush:

(dlv) bt
 0  0x000000000070bad2 in runtime.asmcgocall
    at /usr/local/go/src/runtime/asm_amd64.s:674
 1  0x00000000006b3a28 in runtime.cgocall
    at /usr/local/go/src/runtime/cgocall.go:131
 2  0x00000000011be969 in github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBFlush
    at _cgo_gotypes.go:573
 3  0x00000000011d78db in github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).Flush.func1
    at /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:922
 4  0x00000000011c5062 in github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).Flush
    at /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:922
 5  0x00000000021a0b9f in github.com/cockroachdb/cockroach/pkg/cli.runSyncer.func4
    at /go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:201
 6  0x00000000021691ad in github.com/cockroachdb/cockroach/pkg/cli.runSyncer
    at /go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:220
 7  0x00000000021688dc in github.com/cockroachdb/cockroach/pkg/cli.runDebugSyncTest
    at /go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:97
 8  0x0000000001f668bd in github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
    at /go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:698
 9  0x0000000001f67034 in github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
    at /go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:783
10  0x0000000001f66d2b in github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
    at /go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:736
11  0x000000000215cdad in github.com/cockroachdb/cockroach/pkg/cli.Run
    at /go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:194
12  0x000000000215c6d3 in github.com/cockroachdb/cockroach/pkg/cli.Main
    at /go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:69
13  0x00000000021b8300 in main.main
    at /go/src/github.com/cockroachdb/cockroach/pkg/cmd/cockroach/main.go:30
14  0x00000000006defe2 in runtime.main
    at /usr/local/go/src/runtime/proc.go:198
15  0x000000000070ce31 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:2361

This only shows the Go side. I think the segfault happened in C++ land. Looking at the backtrace of the thread that caught the signals shows something similar to gdb:

(dlv) bt
0  0x000000000070dec4 in runtime.raise
   at /usr/local/go/src/runtime/sys_linux_amd64.s:159
1  0x00000000006f321b in runtime.dieFromSignal
   at /usr/local/go/src/runtime/signal_unix.go:422
2  0x00000000006f37d8 in runtime.sigfwdgo
   at /usr/local/go/src/runtime/signal_unix.go:631
3  0x00000000006f2ad8 in runtime.sigtrampgo
   at /usr/local/go/src/runtime/signal_unix.go:287
4  0x000000000070e163 in runtime.sigtramp
   at /usr/local/go/src/runtime/sys_linux_amd64.s:352
5  0x00000000021baa01 in C.x_cgo_callers
   at /usr/local/go/src/runtime/cgo/gcc_traceback.c:30
6  0x894848246c8d4848 in ???
   at ?:-1
   error: EOF
(truncated)

It's late. I'll pick this up again tomorrow.

@petermattis
Copy link
Collaborator

Seems likely this is a bug in RocksDB. Worth tracking down, though not super urgent. I'm going to put this on the back burner.

petermattis added a commit to petermattis/cockroach that referenced this issue Nov 26, 2018
This reverts commit a321424.

We have a report with credible testing that using FlushWAL instead of
SyncWAL causes data loss in disk full situations. Presumably there is
some error that is not being propagated correctly.
Possibly related to cockroachdb#31948.

See cockroachdb#25173. Possibly unrelated, but the symptom is the same.

Release note (bug fix): Fix a node data loss bug that occurs when a disk
becomes temporarily full.
craig bot pushed a commit that referenced this issue Nov 27, 2018
32605: Revert "libroach: use FlushWAL instead of SyncWAL" r=bdarnell a=petermattis

This reverts commit a321424.

We have a report with credible testing that using FlushWAL instead of
SyncWAL causes data loss in disk full situations. Presumably there is
some error that is not being propagated correctly.
Possibly related to #31948.

See #25173. Possibly unrelated, but the symptom is the same.

Release note (bug fix): Fix a node data loss bug that occurs when a disk
becomes temporarily full.

Co-authored-by: Peter Mattis <[email protected]>
petermattis added a commit to petermattis/cockroach that referenced this issue Nov 27, 2018
This reverts commit a321424.

We have a report with credible testing that using FlushWAL instead of
SyncWAL causes data loss in disk full situations. Presumably there is
some error that is not being propagated correctly.
Possibly related to cockroachdb#31948.

See cockroachdb#25173. Possibly unrelated, but the symptom is the same.

Release note (bug fix): Fix a node data loss bug that occurs when a disk
becomes temporarily full.
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e6cb0c5c329617b560eee37527248171b5e06382

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=synctest PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1038478&tab=buildLog

The test failed on master:
	test.go:630,cluster.go:1141,synctest.go:67: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1038478-synctest:1 -- ./cockroach debug synctest {store-dir}/faulty ./nemesis returned:
		stderr:
		
		stdout:
		47d1a0, 0x2c)
			/usr/local/go/src/runtime/sema.go:510 +0x10b
		sync.(*Cond).Wait(0xc42047d190)
			/usr/local/go/src/sync/cond.go:56 +0x80
		github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).syncLoop(0xc42047d080)
			/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:656 +0x72
		created by github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).open
			/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:644 +0x418
		
		goroutine 25 [chan receive]:
		github.com/cockroachdb/cockroach/pkg/cli.runSyncer.func3(0x32c7700, 0xc4207ac300)
			/go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:176 +0xc4
		github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4202f8390, 0xc4201cec60, 0xc42078a240)
			/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
		created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
			/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad
		Error:  exit status 2
		: exit status 1

@petermattis
Copy link
Collaborator

The failure about was appropriate given I was able to reproduce a failure on top of the revert of FlushWAL. That is, the FlushWAL bug is not the culprit here.

It's interesting that all of the failures have the same stack trace, ending with _Cfunc_DBFlush. That almost certainly means something.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/06d2222fd9010f01a8cdf6a6c24597bbed181f36

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=synctest PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1044826&tab=buildLog

The test failed on master:
	test.go:630,cluster.go:1139,synctest.go:67: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1044826-synctest:1 -- ./cockroach debug synctest {store-dir}/faulty ./nemesis returned:
		stderr:
		
		stdout:
		7df20, 0x3a)
			/usr/local/go/src/runtime/sema.go:510 +0x10b
		sync.(*Cond).Wait(0xc42047df10)
			/usr/local/go/src/sync/cond.go:56 +0x80
		github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).syncLoop(0xc42047de00)
			/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:655 +0x72
		created by github.com/cockroachdb/cockroach/pkg/storage/engine.(*RocksDB).open
			/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:643 +0x418
		
		goroutine 116 [chan receive]:
		github.com/cockroachdb/cockroach/pkg/cli.runSyncer.func3(0x336eee0, 0xc4208171a0)
			/go/src/github.com/cockroachdb/cockroach/pkg/cli/debug_synctest.go:175 +0xc4
		github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4203162c0, 0xc420616360, 0xc4207bdaa0)
			/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:198 +0xe9
		created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
			/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:191 +0xad
		Error:  exit status 2
		: exit status 1

@petermattis
Copy link
Collaborator

The previous failure was before #32931. I've since tried to reproduce on top of #32931 and have had 118 successful runs and 0 failures over the course of almost 20 hours. I'm going to leave this running for a while longer, and will declare success later today if I don't see a failure. The theory is that whatever crash was occurring in RocksDB was fixed between 5.13 and 5.17.2.

@petermattis
Copy link
Collaborator

182 runs and 0 failures over the course of 30 hours. Previously I was able to see a failure every 5-6 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

3 participants