kv: general test flakiness due to Pebble close error #51544

knz · 2020-07-17T09:08:58Z

Describe the problem

make stress PKG=./pkg/kv/kvclient/kvcoord is failing reliably with the following stack trace:

panic: pebble: closed [recovered]
        panic: pebble: closed [recovered]
        panic: pebble: closed

goroutine 9311 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc001803b00, 0x1702240, 0xc000431680)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:183 +0x11f
panic(0x9b3820, 0xc0002931a0)
        /usr/local/go/src/runtime/panic.go:969 +0x166
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send.func1(0xc000b1af58, 0xc000b1afe0, 0xc000b1afd8, 0xc000476e00)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:103 +0x1e6
panic(0x9b3820, 0xc0002931a0)
        /usr/local/go/src/runtime/panic.go:975 +0x3e3
github.com/cockroachdb/pebble.(*DB).newIterInternal(0xc000441500, 0x177b5a0, 0xc000bf8280, 0x0, 0x0, 0x0, 0xc00095c370, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/db.go:668 +0xd94
github.com/cockroachdb/pebble.(*Batch).NewIter(0xc00043b180, 0xc00095c370, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/batch.go:675 +0x1df
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).init(0xc00095c368, 0x16e6680, 0xc00043b180, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:125 +0x46b
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleBatch).NewIterator(0xc00095c340, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_batch.go:207 +0x14f
github.com/cockroachdb/cockroach/pkg/storage.MVCCGet(0x1702300, 0xc001d389c0, 0x82ece9408, 0xc001204780, 0xc0004461c0, 0x26, 0x40, 0x0, 0x0, 0xc000000000, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:754 +0x9b
github.com/cockroachdb/cockroach/pkg/storage.MVCCGetProto(0x1702300, 0xc001d389c0, 0x82ece9408, 0xc001204780, 0xc0004461c0, 0x26, 0x40, 0x0, 0x0, 0x173e060, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:630 +0xd7
github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.EndTxn(0x1702300, 0xc001d389c0, 0x82ed61598, 0xc001204780, 0x179c260, 0xc00038b000, 0x7b, 0x3, 0x100000001, 0x1, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval/cmd_end_transaction.go:200 +0x309
github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand(0x1702300, 0xc001d389c0, 0xc00058a190, 0x8, 0x0, 0x82ed61598, 0xc001204780, 0x179c260, 0xc00038b000, 0xc000734000, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_evaluate.go:471 +0x235
github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateBatch(0x1702300, 0xc001d389c0, 0xc00058a190, 0x8, 0x82ed61598, 0xc001204780, 0x179c260, 0xc00038b000, 0xc000734000, 0xc000bf8080, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_evaluate.go:241 +0x3c2
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWrapper(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc00058a190, 0x8, 0x179c260, 0xc00038b000, 0xc000734000, 0xc000bf8080, 0xc0012043c0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:557 +0x144
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWithServersideRefreshes(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc00058a190, 0x8, 0x179c260, 0xc00038b000, 0xc000734000, 0xc000bf8080, 0xc0012043c0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:526 +0x135
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatch(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc00058a190, 0x8, 0xc000bf8080, 0xc0012043c0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:349 +0x212
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateProposal(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc00058a190, 0x8, 0xc000bf8080, 0xc0012043c0, 0x0, 0xc000aa5600, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:735 +0x127
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).requestToProposal(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc00058a190, 0x8, 0xc000bf8080, 0xc0012043c0, 0x0, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:855 +0x8e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evalAndPropose(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc000bf8080, 0xc000744000, 0xc000b1a330, 0x0, 0x0, 0x0, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:73 +0xee
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc000bf8080, 0x0, 0x0, 0xc0004b2da0, 0x100000001, 0x1, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:133 +0x769
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc000bf8080, 0xdfed58, 0x0, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:275 +0x3e7
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc00038b000, 0x1702300, 0xc001d38990, 0x2, 0xc000bf8080, 0x0, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:95 +0x6b2
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(0xc00038b000, 0x1702300, 0xc001d38990, 0x7b, 0x3, 0x100000001, 0x1, 0x0, 0x2, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:37 +0x91
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc000476e00, 0x1702300, 0xc001d38810, 0x7b, 0x3, 0x100000001, 0x1, 0x0, 0x2, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:194 +0x5a2
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc000caf300, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x100000001, 0x1, 0x0, 0x2, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:177 +0xed
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*senderTransport).SendNext(0xc000b04000, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x100000001, 0x1, 0x0, 0x2, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:299 +0x21e
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*localTestClusterTransport).SendNext(0xc000f200a0, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/local_test_cluster_util.go:44 +0x8f
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(0xc000578840, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1795 +0x6b1
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc000578840, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1459 +0x305
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc000578840, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1099 +0x18ef
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc000578840, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:739 +0x8e4
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(0xc00014cd38, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:86 +0x11c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked(0xc00014cd00, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:46 +0x8d
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnCommitter).SendLocked(0xc00014ccd0, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:190 +0x53b
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc00014cc38, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:247 +0x9b
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(0xc00014cc38, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:182 +0x180
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(0xc00014cb78, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:252 +0x159
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(0xc00014cb58, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105 +0x20c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked(0xc00014cab8, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:172 +0x1a9
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(0xc00014c900, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:499 +0x3cc
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc000caf400, 0x17022c0, 0xc001204360, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:742 +0x122
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(0xc0016e6360, 0x17022c0, 0xc001204360, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:911 +0x11e
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).rollback.func1.1(0x17022c0, 0xc001204360, 0xb2d05e00, 0x17022c0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:733 +0x98
github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout(0x17022c0, 0xc001204360, 0xc8b302, 0x12, 0xb2d05e00, 0xc000bcee68, 0x0, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:135 +0x9e
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).rollback.func1(0x1702240, 0xc000431680)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:732 +0x14b
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc001803b00, 0x1702240, 0xc000431680, 0xc000d34020, 0x16, 0x0, 0x0, 0xc001d78140)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:323 +0xee
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:318 +0x131

To Reproduce

make stress PKG=./pkg/kv/kvclient/kvcoord

Jira issue: CRDB-4032

The text was updated successfully, but these errors were encountered:

blathers-crl · 2020-07-17T09:08:59Z

Hi @knz, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

knz · 2020-07-17T09:12:01Z

cc @tbg @petermattis for triage.

knz · 2020-07-17T09:13:30Z

I am going to bisect this and see when it came from.

knz · 2020-07-17T09:20:44Z

~~This commit seems OK: 12b58af~~

~~Will bisect between now and then.~~

edit: that selection was incorrect

knz · 2020-07-17T10:30:16Z

I have tried very hard to use git bisect to get to the root cause and every time git bisect comes to the conclusion I have contradictory outputs.

It may be that our Makefile / Go dependency tracking is not right so that when I switch branches, I am not re-building the vendor dir properly.

knz · 2020-07-17T10:30:48Z

I will not investigate this further myself.

knz · 2020-07-17T10:47:33Z

For reference this one is very probably OK, after running from a clean repo:

[91ae9bc] Merge #51232 #51299

knz · 2020-07-17T10:53:51Z

(I'm re-running my bisect session with 'make clean' in-between each step, just to be sure)

knz · 2020-07-17T11:17:12Z

Here's my latest bisect log:

# bad: [af0031e3004327b8d09e23e99eb9659abf7d82de] Merge #51375 #51520
# good: [753fe8f51291aff12e4dad5a6fc850988c25dd82] Merge #51235
git bisect start 'master' '753fe8f51291aff12e4dad5a6fc850988c25dd82'
# good: [d0c79625eda85d3aa38afad5b0254d419a9bc4cd] Merge #51023 #51154 #51256 #51303
git bisect good d0c79625eda85d3aa38afad5b0254d419a9bc4cd
# good: [91ae9bc70d00868e46c83139c0e5621e4e4971f3] Merge #51232 #51299
git bisect good 91ae9bc70d00868e46c83139c0e5621e4e4971f3
# good: [2f05eb02306768de6d8c35310851c5d4684d55a5] geoviz: add improvements
git bisect good 2f05eb02306768de6d8c35310851c5d4684d55a5
# good: [2d21db0fca674cd448c1d9d2bf14c53244de56f3] builtins: add underscore variant for each index builtin
git bisect good 2d21db0fca674cd448c1d9d2bf14c53244de56f3
# good: [2d21db0fca674cd448c1d9d2bf14c53244de56f3] builtins: add underscore variant for each index builtin
git bisect good 2d21db0fca674cd448c1d9d2bf14c53244de56f3
# good: [2d21db0fca674cd448c1d9d2bf14c53244de56f3] builtins: add underscore variant for each index builtin
git bisect good 2d21db0fca674cd448c1d9d2bf14c53244de56f3
# good: [f00c3ff59db097746011ffb1586fd4ce67f5596d] Merge #51225 #51311
git bisect good f00c3ff59db097746011ffb1586fd4ce67f5596d
# bad: [9062cb0fb03e7af73745601e5ad44f3b8b87d519] Merge #50771
git bisect bad 9062cb0fb03e7af73745601e5ad44f3b8b87d519
# bad: [51398fb4a822bb9257ddba87682380ec0745f6ae] Merge #51327
git bisect bad 51398fb4a822bb9257ddba87682380ec0745f6ae
# bad: [0f36c8df07b6d7f948989a68926141b1e9866368] Merge #51351
git bisect bad 0f36c8df07b6d7f948989a68926141b1e9866368
# bad: [a215af7dd3a7387a4454082c27c1528ff54b1698] Merge #51319
git bisect bad a215af7dd3a7387a4454082c27c1528ff54b1698
# good: [d86e265eb443e9eeac8a743ad558b5bfa58e3720] kv: consolidate RangeInfos in RPC responses
git bisect good d86e265eb443e9eeac8a743ad558b5bfa58e3720
# good: [3a299690b26b528a1fdbf28865e1bc81ae697264] Merge #51310
git bisect good 3a299690b26b528a1fdbf28865e1bc81ae697264
# good: [dfdd8bb563dd7f6899c96a963a3906257784c135] geoviz: expose --geo_libs flag
git bisect good dfdd8bb563dd7f6899c96a963a3906257784c135
# bad: [a9e9c9cdf75f07c51dfca3773305d4af379c5fb7] Merge #51168
git bisect bad a9e9c9cdf75f07c51dfca3773305d4af379c5fb7
# first bad commit: [a9e9c9cdf75f07c51dfca3773305d4af379c5fb7] Merge #51168

This points to #51168 as the culprit, but I don't understand how it causes the problem.

Maybe someone else needs to re-run the bisect to confirm.

knz · 2020-07-17T11:18:06Z

The problem with this bisect result is the following:

the merge commit a9e9c9c triggers the bug
however the only commit inside that branch, d86e265, does not.

(This is why I wrote it's incosnsitent)

tbg · 2020-07-17T12:40:24Z

Raphael I have a PR out about this. It doesn't fix all of the causes but provides a tool for doing so.

…

On Fri, Jul 17, 2020, 13:18 kena ***@***.***> wrote: The problem with this bisect result is the following: - the merge commit a9e9c9c <a9e9c9c> triggers the bug - however the only commit inside that branch, d86e265 <d86e265>, does not. (This is why I wrote it's incosnsitent) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#51544 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGXPZGTGUCGKKLVIYWUTOLR4AXP5ANCNFSM4O54A7BQ> .

tbg · 2020-07-17T14:59:30Z

Looking at this with Andrei. Like you suggested it's probably https://github.com/cockroachdb/cockroach/pull/51310/files exposing an existing problem.

andreimatei · 2020-07-24T16:24:23Z

Hopefully this was fixed enough by #51413

ajwerner · 2020-08-08T00:50:52Z

I hit this on a freshly rebased branch here: https://teamcity.cockroachdb.com/viewLog.html?buildId=2161675&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Test

ajwerner · 2020-08-08T01:08:24Z

One thing I'll note is that we seem to lack synchronization that all of our outstanding grpc requests have concluded when the stopper stops. This seems like a problem.

It seems like we should add something to the server shutdown code that prevents new requests and wait for outstanding requests on certain grpc services. In particular Batch and RangeFeed from Internal and others like PerReplicaClient

andreimatei · 2020-08-08T04:31:30Z

I'll get new energy on #51566

ajwerner · 2020-08-14T16:00:21Z

https://teamcity.cockroachdb.com/viewLog.html?buildId=2182751&buildTypeId=Cockroach_UnitTests_Test&tab=artifacts#%2Fraw.0.json.txt.tgz here's another

This patch makes the Store reject requests once its stopper is quiescing. Before this patch, we didn't seem to have good protection against requests not running after the stopper has been stopped. We've seen this in some tests, where requests were racing with the engine closing. Running after the stopper has stopped is generally pretty undefined behavior, so let's avoid it. I think the reason why we didn't see a lot of errors from such races is that we're stopping the gRPC server even before we start quiescing, so at least for requests coming from remote nodes we had some draining window. This is a lighter version of cockroachdb#51566. That patch was trying to run requests as tasks, so that they properly synchronize with server shutdown. This patch still allows races between requests that started evaluating the server started quiescing and server shutdown. Touches cockroachdb#51544 Release note: None

This patch runs some infrequent operations that might use the storage engine as tasks, and thus synchronizes them with server shutdown. In cockroachdb#51544 we've seen one of these cause a crash when executing after Pebble was shut down. Release note: None

…tasks Prior to this change, it was possible for a rangefeed request to be issued concurrently with shutting down which could lead to an iterator being constructed after the engine has been closed. Touches cockroachdb#51544 Release note: None

…eation This commit optimizes the Stopper for task creation by ripping out the existing heavyweight task tracking in production builds. I realized that my biggest concern with most of the proposals (cockroachdb#52843 and cockroachdb#51566) being floated to address cockroachdb#51544 was that they bought more into the inefficient tracking in the Stopper, not that they were doing anything inherently wrong themselves. Before this change, creating a task acquired an exclusive mutex and then wrote to a hashmap. At high levels of concurrency, this would have become a performance chokepoint. After this change, the cost of launching a Task is three atomic increments – one to acquire a read lock, one to register with a WaitGroup, and one to release the read lock. When no one is draining the Stopper, these are all wait-free operations, which means that task creation becomes wait-free. With a change like this, I would feel much more comfortable pushing on Stopper tasks to solve cockroachdb#51544.

…tasks Prior to this change, it was possible for a rangefeed request to be issued concurrently with shutting down which could lead to an iterator being constructed after the engine has been closed. Touches cockroachdb#51544 Release note: None

52844: kvserver,rangefeed: ensure that iterators are only constructed under tasks r=andreimatei a=ajwerner Prior to this change, it was possible for a rangefeed request to be issued concurrently with shutting down which could lead to an iterator being constructed after the engine has been closed. Touches #51544 Release note: None 52996: partialidx: prove implication for comparisons with two variables r=RaduBerinde a=mgartner This commit adds support for proving partial index predicates are implied by query filters when they contain comparison expressions with two variables and they are not identical expressions. Below are some examples where the left expression implies (=>) the right expression. The right is guaranteed to contain the left despite both expressions having no constant values. a > b => a >= b a = b => a >= b b < a => a >= b a > b => a != b Release note: None 53113: roachprod: introduce --skip-init to `roachprod start` r=irfansharif a=irfansharif ..and `roachprod init`. I attempted to originally introduce this flag in \#51329, and ultimately abandoned it. I still think it's a good idea to have such a thing, especially given now we're writing integration tests that want to control `init` behaviour. It's much easier to write them with this --skip-init flag than it is to work around roachprod's magical auto-init behavior. To do what's skipped when using --skip-init, we introduce a `roachprod init` sub command. Release note: None Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: irfan sharif <[email protected]>

…tasks Prior to this change, it was possible for a rangefeed request to be issued concurrently with shutting down which could lead to an iterator being constructed after the engine has been closed. Touches cockroachdb#51544 Release note: None

ajwerner · 2020-10-02T14:17:03Z

The vast majority of these seem to have to do with range merges happening after shutdown.

We are likely going to invest more in the stopper-conferred observability in the near future as part of initiatives such as cockroachdb#58164, but the task tracking that has been a part of the stopper since near its conception has not proven to be useful in practice, while at the same time raising concern about stopper use in hot paths. When shutting down a running server, we don't particularly care about leaking goroutines (as the process will end anyway). In tests, we want to ensure goroutine hygiene, but if a test hangs during `Stop`, it is easier to look at the stacks to find out why than to consult the task map. Together, this left little reason to do anything more complicated than what's left after this commit: we keep track of the running number of tasks, and wait until this drops to zero. With this change in, we should feel comfortable using the stopper extensively and, for example, ensuring that any CRDB goroutine is anchored in a Stopper task; this is the right approach for test flakes such as in cockroachdb#51544 and makes sense for all of the reasons mentioned in issue cockroachdb#58164 as well. In a future change, we should make the Stopper more configurable and, through this configurability, we could in principle bring a version of the task map back (in debug builds) without backing it into the stopper, though I don't anticipate that we'll want to. Closes cockroachdb#52894. Release note: None

59647: stop: rip out expensive task tracking r=knz a=tbg First commit was put up for PR separately, ignore it here. ---- We are likely going to invest more in the stopper-conferred observability in the near future as part of initiatives such as #58164, but the task tracking that has been a part of the stopper since near its conception has not proven to be useful in practice, while at the same time raising concern about stopper use in hot paths. When shutting down a running server, we don't particularly care about leaking goroutines (as the process will end anyway). In tests, we want to ensure goroutine hygiene, but if a test hangs during `Stop`, it is easier to look at the stacks to find out why than to consult the task map. Together, this left little reason to do anything more complicated than what's left after this commit: we keep track of the running number of tasks, and wait until this drops to zero. With this change in, we should feel comfortable using the stopper extensively and, for example, ensuring that any CRDB goroutine is anchored in a Stopper task; this is the right approach for test flakes such as in #51544 and makes sense for all of the reasons mentioned in issue #58164 as well. In a future change, we should make the Stopper more configurable and, through this configurability, we could in principle bring a version of the task map back (in debug builds) without backing it into the stopper, though I don't anticipate that we'll want to. Closes #52894. Release note: None 59732: backupccl: add an owner column behind the WITH PRIVILEGES option r=pbardea a=Elliebababa Previously, when users perform RESTORE, they are ignorant of the original owner. This PR gives ownership data as a column behind privileges. Resolves: #57906. Release note: None. 59746: opt: switch checks to use CrdbTestBuild instead of RaceEnabled r=RaduBerinde a=RaduBerinde The RaceEnabled flag is not very useful for checks; e.g. apparently execbuilder tests aren't run routinely in race mode. These checks are now "live" in any test build, using the crdb_test build tag. Release note: None 59747: tree: correct StatementTag of ALTER TABLE ... LOCALITY r=ajstorm a=otan Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: elliebababa <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Oliver Tan <[email protected]>

jbowens · 2022-08-15T21:28:12Z

Closing out as unactionable. I think most instances have been fixed throughout the cockroach repo.

knz added C-test-failure Broken test (automatically or manually discovered). S-3-productivity Severe issues that impede the productivity of CockroachDB developers. labels Jul 17, 2020

knz mentioned this issue Jul 17, 2020

cli: temporarily skip TestPartialZip and CDC tests #51542

Merged

nvanbenschoten mentioned this issue Jul 17, 2020

kv/kvclient/kvcoord: TestTxnCoordSenderCondenseLockSpans failed #51363

Closed

lunevalex assigned andreimatei Jul 24, 2020

andreimatei closed this as completed Jul 24, 2020

ajwerner reopened this Aug 8, 2020

andreimatei mentioned this issue Aug 14, 2020

kvserver: reject requests when quiescing #52843

Open

ajwerner mentioned this issue Aug 14, 2020

kvserver,rangefeed: ensure that iterators are only constructed under tasks #52844

Merged

nvanbenschoten mentioned this issue Aug 17, 2020

ccl/partitionccl: TestRemovePartitioningExpiredLicense failed #52849

Closed

nvanbenschoten mentioned this issue Aug 17, 2020

[DNM] stop: disable task tracking in release builds, optimize task creation #52894

Closed

nvanbenschoten mentioned this issue Sep 17, 2020

ccl/partitionccl: TestRemovePartitioningExpiredLicense failed #53874

Closed

nvanbenschoten mentioned this issue Sep 24, 2020

kv/kvserver: TestStoreRangeMergeDeadFollowerDuringTxn failed #54756

Closed

spaskob mentioned this issue Oct 2, 2020

Data race in pebble #55169

Closed

yuzefovich mentioned this issue Dec 9, 2020

sql/opt/exec/execbuilder: TestExecBuild failed #54753

Closed

thoszhang mentioned this issue Dec 24, 2020

release-20.2: sql: fix bug where bad mutation job state could block dropping tables #58255

Merged

tbg mentioned this issue Feb 1, 2021

stop: rip out expensive task tracking #59647

Merged

andreimatei mentioned this issue Mar 12, 2021

sql/opt/exec/execbuilder: TestExecBuild failed #61894

Closed

RaduBerinde mentioned this issue Jun 3, 2021

sql/opt/exec/execbuilder: TestExecBuild failed #64558

Closed

jlinder added the T-storage Storage Team label Jun 16, 2021

jbowens closed this as completed Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: general test flakiness due to Pebble close error #51544

kv: general test flakiness due to Pebble close error #51544

knz commented Jul 17, 2020 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Jul 17, 2020

knz commented Jul 17, 2020

knz commented Jul 17, 2020

knz commented Jul 17, 2020 •

edited

Loading

knz commented Jul 17, 2020

knz commented Jul 17, 2020

knz commented Jul 17, 2020

knz commented Jul 17, 2020

knz commented Jul 17, 2020

knz commented Jul 17, 2020

tbg commented Jul 17, 2020 via email

tbg commented Jul 17, 2020

andreimatei commented Jul 24, 2020

ajwerner commented Aug 8, 2020

ajwerner commented Aug 8, 2020

andreimatei commented Aug 8, 2020

ajwerner commented Aug 14, 2020

ajwerner commented Oct 2, 2020

jbowens commented Aug 15, 2022

kv: general test flakiness due to Pebble close error #51544

kv: general test flakiness due to Pebble close error #51544

Comments

knz commented Jul 17, 2020 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Jul 17, 2020

knz commented Jul 17, 2020

knz commented Jul 17, 2020

knz commented Jul 17, 2020 • edited Loading

knz commented Jul 17, 2020

knz commented Jul 17, 2020

knz commented Jul 17, 2020

knz commented Jul 17, 2020

knz commented Jul 17, 2020

knz commented Jul 17, 2020

tbg commented Jul 17, 2020 via email

tbg commented Jul 17, 2020

andreimatei commented Jul 24, 2020

ajwerner commented Aug 8, 2020

ajwerner commented Aug 8, 2020

andreimatei commented Aug 8, 2020

ajwerner commented Aug 14, 2020

ajwerner commented Oct 2, 2020

jbowens commented Aug 15, 2022

knz commented Jul 17, 2020 •

edited by cockroach-jira-scripts

Loading

knz commented Jul 17, 2020 •

edited

Loading