Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Pebble block cache infinite loop #70154

Closed
jbowens opened this issue Sep 13, 2021 · 18 comments
Closed

storage: Pebble block cache infinite loop #70154

jbowens opened this issue Sep 13, 2021 · 18 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team

Comments

@jbowens
Copy link
Collaborator

jbowens commented Sep 13, 2021

Here's a couple of recent failures:

https://teamcity.cockroachdb.com//viewLog.html?buildId=3434287&buildTypeId=Cockroach_MergedExtendedCi
https://teamcity.cockroachdb.com//viewLog.html?buildId=3432973&buildTypeId=Cockroach_MergedExtendedCi

I didn't notice anything obvious in the test log.

Jira issue: CRDB-9964

@jbowens jbowens added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery labels Sep 13, 2021
jbowens added a commit to jbowens/cockroach that referenced this issue Sep 13, 2021
Refs: cockroachdb#70154

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None
@ajwerner ajwerner added branch-release-21.2 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 13, 2021
craig bot pushed a commit that referenced this issue Sep 13, 2021
70155: ccl/backupccl: skip TestRestoreOldVersions r=irfansharif a=jbowens

Refs: #70154

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None

Co-authored-by: Jackson Owens <[email protected]>
@adityamaru
Copy link
Contributor

adityamaru commented Sep 14, 2021

Stack can be found at the end of:

full_output.txt

@adityamaru
Copy link
Contributor

Looks like its the multi-region-restore subtest. Still investigating.

@adityamaru
Copy link
Contributor

adityamaru commented Sep 14, 2021

The restore itself seems to succeed:
I210913 20:07:03.421930 712384 1@util/log/event_log.go:32 ⋮ [n1,job=‹693020805582487553›] 11097 ={"Timestamp":1631563621861928092,"EventType":"restore","JobID":693020805582487553,"JobType":"RESTORE","Description":"‹RESTORE DATABASE multi_region_db FROM 'nodelocal://0/external_backup_dir'›","User":"‹root›","DescriptorIDs":[53,54,52,58,55,56,57],"Status":"succeeded"}

The test doesn't do much after the restore except run a bunch of select statements to verify the restored state.

@adityamaru
Copy link
Contributor

adityamaru commented Sep 14, 2021

It looks like the test is stuck when trying to tear the test servers down?

goroutine 657059 [semacquire, 33 minutes]:
sync.runtime_Semacquire(0xc01754bb88)
	/usr/local/go/src/runtime/sema.go:56 +0x45
sync.(*WaitGroup).Wait(0xc01754bb80)
	/usr/local/go/src/sync/waitgroup.go:130 +0xe5
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers(0xc004888f00, 0xa70dd40, 0xc0001a6018)
	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:113 +0x26d
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func2()
	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:346 +0x74
github.com/cockroachdb/cockroach/pkg/util/stop.CloserFn.Close(0xc000c85b70)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:108 +0x35
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop(0xc0060ac5a0, 0xa70dd40, 0xc0001a6010)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:511 +0x2fe
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.runOldVersionMultiRegionTest.func1(0xc004173e00)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/restore_old_versions_test.go:414 +0x1b12
testing.tRunner(0xc004173e00, 0xc0084b1290)
	/usr/local/go/src/testing/testing.go:1193 +0x203
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:1238 +0x5d8

I see several goroutines waiting to acquire mutexes. It smells of a deadlock but I'm going to have to rope in someone from KV to help understand better. @irfansharif pinging you as L2, I'm going to try and see if this is reliably reproducible in the meantime:

goroutine 658511 [semacquire, 33 minutes]:
sync.runtime_SemacquireMutex(0xc007f6f5b4, 0x900000000, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc007f6f5b0)
	/usr/local/go/src/sync/mutex.go:138 +0x1d4
sync.(*Mutex).Lock(0xc007f6f5b0)
	/usr/local/go/src/sync/mutex.go:81 +0x85
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*Mutex).Lock(0xc007f6f5b0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/mutex_sync_race.go:29 +0x3b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evalAndPropose.func3()
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:226 +0x85
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc007f6f500, 0xa70ddb0, 0xc0029c86f0, 0xc01e419500, 0xc00fc39ae0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:313 +0x1ac3
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc007f6f500, 0xa70ddb0, 0xc0029c86f0, 0xc01e419500, 0x75ba2b0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:377 +0x487
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc007f6f500, 0xa70ddb0, 0xc0029c86c0, 0x4, 0xc01e419500, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:107 +0x68e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:43
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc003136000, 0xa70ddb0, 0xc0029c8690, 0x16a46e452ea3365f, 0x0, 0x1, 0x100000001, 0x1, 0x0, 0x4, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:190 +0x5d8
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc00866a480, 0xa70ddb0, 0xc0029c8690, 0x0, 0x0, 0x0, 0x100000001, 0x1, 0x0, 0x4, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:191 +0x1a5
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0xa70ddb0, 0xc0029c8690, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:912 +0x298
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc0060ac6c0, 0xa70ddb0, 0xc0029c8690, 0x6f48d01, 0x10, 0xc016c5afa8, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:324 +0x12a
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc00a662e00, 0xa70ddb0, 0xc0029c8690, 0xc01e419180, 0x0, 0x0, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:900 +0x267
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc00a662e00, 0xa70ddb0, 0xc0029c8660, 0xc01e419180, 0xc016c5b240, 0x5b61dc, 0xc00d068e70)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:1007 +0x125
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:469
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch(0xc014096eb0, 0xa70ddb0, 0xc0029c8660, 0x1, 0xa774118, 0xc002a16ab0, 0x0, 0x0, 0x0, 0x100000001, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:203 +0x266
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext(0xc014096eb0, 0xa70ddb0, 0xc00d068e70, 0x0, 0x0, 0x0, 0x100000001, 0x1, 0x0, 0x4, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:185 +0x298
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.raceTransport.SendNext(0xa75baf8, 0xc014096eb0, 0xa70ddb0, 0xc00d068e70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport_race.go:78 +0x45d
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(0xc0048e1680, 0xa70ddb0, 0xc00d068e70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x4, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1943 +0xbb5
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc0048e1680, 0xa70ddb0, 0xc00d068e70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1537 +0x45a
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc0048e1680, 0xa70ddb0, 0xc00d068e70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1155 +0x1fbf
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc0048e1680, 0xa70ddb0, 0xc00d068e70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:789 +0x7c5
github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send(0xc00866a358, 0xa70ddb0, 0xc00d068e70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:219 +0x102
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc00866a300, 0xa70ddb0, 0xc00d068e70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:854 +0x1b5
github.com/cockroachdb/cockroach/pkg/kv.(*DB).send(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:837
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill(0xa70ddb0, 0xc00d068e70, 0xc016c5f6f0, 0xc007c36f00, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:765 +0x1f5
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Run(0xc00866a300, 0xa70ddb0, 0xc00d068e70, 0xc007c36f00, 0x1, 0xc016c5f7e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:788 +0xd7
github.com/cockroachdb/cockroach/pkg/ts.(*DB).storeKvs(0xc009838750, 0xa70ddb0, 0xc00d068e70, 0xc0366a2000, 0x45b, 0x500, 0xc00a409500, 0x32)
	/go/src/github.com/cockroachdb/cockroach/pkg/ts/db.go:307 +0x295
github.com/cockroachdb/cockroach/pkg/ts.(*DB).tryStoreData(0xc009838750, 0xa70ddb0, 0xc00d068e70, 0x1, 0xc014c1a000, 0x45b, 0x45b, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/ts/db.go:245 +0x377
github.com/cockroachdb/cockroach/pkg/ts.(*DB).StoreData(0xc009838750, 0xa70ddb0, 0xc00d068e70, 0x1, 0xc014c1a000, 0x45b, 0x45b, 0xc009df0f00, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/ts/db.go:210 +0x20b
github.com/cockroachdb/cockroach/pkg/ts.(*poller).poll.func1(0xa70ddb0, 0xc009a80c60)
	/go/src/github.com/cockroachdb/cockroach/pkg/ts/db.go:193 +0x1db
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask(0xc0060ac6c0, 0xa70ddb0, 0xc009a80c60, 0x6f45aed, 0xf, 0xc016c5fd10, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:307 +0x12a
github.com/cockroachdb/cockroach/pkg/ts.(*poller).poll(0xc00dfb6060)
	/go/src/github.com/cockroachdb/cockroach/pkg/ts/db.go:184 +0x195
github.com/cockroachdb/cockroach/pkg/ts.(*poller).start.func1(0xa70dd40, 0xc0001a6018)
	/go/src/github.com/cockroachdb/cockroach/pkg/ts/db.go:168 +0xdb
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2(0xc0060ac6c0, 0xa70dd40, 0xc0001a6018, 0x0, 0x0, 0xc007fef380)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:442 +0x162
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:434 +0x28e
goroutine 657574 [semacquire, 33 minutes]:
sync.runtime_SemacquireMutex(0xc0078f9634, 0x4100000000, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0078f9630)
	/usr/local/go/src/sync/mutex.go:138 +0x1d4
sync.(*Mutex).Lock(0xc0078f9630)
	/usr/local/go/src/sync/mutex.go:81 +0x85
github.com/cockroachdb/pebble.(*commitPipeline).prepare(0xc0078f8600, 0xc01a9292c0, 0x431c01, 0xc007bfe7df, 0xc007bfe7d0, 0x10)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/commit.go:364 +0x107
github.com/cockroachdb/pebble.(*commitPipeline).Commit(0xc0078f8600, 0xc01a9292c0, 0x1, 0x0, 0xc0096436f0)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/commit.go:253 +0xc5
github.com/cockroachdb/pebble.(*DB).Apply(0xc004643400, 0xc01a9292c0, 0xca8d61e, 0x177e365460b, 0xcdcb860)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/db.go:602 +0x21d
github.com/cockroachdb/pebble.(*Batch).Commit(...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/batch.go:732
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleBatch).Commit(0xc00373cd80, 0xc001a66001, 0x0, 0xa70ddb0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_batch.go:504 +0xd2
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked(0xc01400ca80, 0xa70ddb0, 0xc006529590, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:777 +0xcb9
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady(0xc01400ca80, 0xa70ddb0, 0xc006529590, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:485 +0x191
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady(0xc003136000, 0xa70ddb0, 0xc005ccde30, 0x1b)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:515 +0x1e5
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker(0xc0071ca960, 0xa70ddb0, 0xc005ccde30)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:284 +0x3e6
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2(0xc0060ac6c0, 0xa70ddb0, 0xc005ccde30, 0x0, 0x0, 0xc00917ad70)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:442 +0x162
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:434 +0x28e

@blathers-crl blathers-crl bot added the T-kv KV Team label Sep 14, 2021
@tbg
Copy link
Member

tbg commented Sep 15, 2021

The salient deadlock is the last one, in pebble (the rest is a consequence of it). I'm not sure what is going on there. Perhaps the fact that a shutdown is going on has to do with it, but I'm not sure, and even if so pebble shouldn't just deadlock - handing it back to storage. This might be helpful:

cat full_output.txt | awk -vRS= '(/pebble/) {print $0"\n"}' | gh gist create

https://gist.github.com/e3c6bdce1940bcae6d5febbc24090237

It's only the goroutines containing "pebble" in the stack. I personally think this one here is up to no good, since it probably has a "shard" related mutex which is what everyone else seems to be blocking on. I could be wrong.

goroutine 720904 [runnable]:
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandHot(0xc0067e1050)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:432 +0x2ef
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold(0xc0067e1050)
[tons more frames repeating the pattern]
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandHot(0xc0067e1050)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:434 +0x297
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold(0xc0067e1050)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:428 +0xe7
github.com/cockroachdb/pebble/internal/cache.(*shard).evict(0xc0067e1050)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:402 +0x3a
github.com/cockroachdb/pebble/internal/cache.(*shard).Reserve(0xc0067e1050, 0x400000)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:254 +0x9c
github.com/cockroachdb/pebble/internal/cache.(*Cache).Reserve(0xc00147adc0, 0x4000000, 0x8000000)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:717 +0x125
github.com/cockroachdb/pebble.(*DB).newMemTable(0xc004643400, 0x4f, 0x44cbd, 0xc0072d2000, 0xc0114344e0)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/db.go:1367 +0x185
github.com/cockroachdb/pebble.(*DB).makeRoomForWrite(0xc004643400, 0x0, 0xc0190bf1d0, 0xc00af07440)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/db.go:1635 +0x75c
github.com/cockroachdb/pebble.(*DB).Ingest.func1()
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/ingest.go:572 +0x37c
github.com/cockroachdb/pebble.(*commitPipeline).AllocateSeqNum(0xc0078f8600, 0x6, 0xc027f8b608, 0xc027f8b640)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/commit.go:334 +0x422
github.com/cockroachdb/pebble.(*DB).Ingest(0xc004643400, 0xc01de47aa0, 0x6, 0x6, 0xf9fd990, 0xf9fd990)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/ingest.go:605 +0x73c
github.com/cockroachdb/cockroach/pkg/storage.(*Pebble).IngestExternalFiles(0xc007224100, 0xa70ddb0, 0xc00af06840, 0xc0052caa80, 0x6, 0x8, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble.go:1268 +0x6e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).applySnapshot(0xc007ec1500, 0xa70ddb0, 0xc00af06840, 0xa147550f461d5efc, 0xc815441864b3fdb7, 0xc0148f5dc0, 0x0, 0x0, 0x0, 0xc0015e2480, ...)

@blathers-crl blathers-crl bot added the T-storage Storage Team label Sep 15, 2021
@jbowens jbowens changed the title ccl/backupccl: TestRestoreOldVersions flaky storage: Pebble table cache deadlock Sep 15, 2021
@jbowens
Copy link
Collaborator Author

jbowens commented Sep 15, 2021

Thanks y'all, I'd forgotten the right dance to get the full output.

@bananabrick
Copy link
Contributor

bananabrick commented Sep 15, 2021

Seems like these mutually recursive functions never hit a base case:

github.com/cockroachdb/pebble/internal/cache.(*shard).runHandTest(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:456 +0x296
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandHot(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:434 +0x297
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:428 +0xe7
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandTest(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:456 +0x296
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandHot(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:434 +0x297
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:428 +0xe7
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandTest(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:456 +0x296
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandHot(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:434 +0x297
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:428 +0xe7
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandTest(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:456 +0x296
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:420 +0x428
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandTest(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:456 +0x296
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandHot(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:434 +0x297
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:428 +0xe7
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandTest(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:456 +0x296
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandHot(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:434 +0x297
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:428 +0xe7
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandTest(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:456 +0x296
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandHot(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:434 +0x297
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold(0xc0067e1050)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:428 +0xe7

It's mostly repeating line 428 -> 434 -> 456 -> 428 in clockpro.go. This sequence starts with a call to evict in clockpro.go.

Seems like there's precedence for evict infinite looping: https://github.com/cockroachdb/pebble/blob/crl-release-21.2/internal/cache/clockpro.go#L268

This was the pr which previously fixed a similar issue: cockroachdb/pebble@c96d043

@jbowens jbowens changed the title storage: Pebble table cache deadlock storage: Pebble block cache deadlock Sep 15, 2021
@jbowens jbowens removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 20, 2021
@jbowens
Copy link
Collaborator Author

jbowens commented Sep 20, 2021

There have been no recent changes within the cache implementation. We suspect this is a long dormant bug that we've so far been unable to reproduce. Removed the release-blocker tag for now.

@petermattis
Copy link
Collaborator

The title is a bit misleading. This isn't a deadlock, but an infinite loop in the cache eviction code. Nothing has changed in this code recently, so I agree with removing the release-blocker tag.

jbowens added a commit to jbowens/pebble that referenced this issue Sep 21, 2021
Add hot, cold and test counts to block cache shard. Use `coldCount` to
assert that `coldSize` matches expectations.

This is intended to help gain more insight out of any future
reproductions of the infinite loop encountered in
cockroachdb/cockroach#70154
jbowens added a commit to jbowens/pebble that referenced this issue Sep 21, 2021
Add sizeCold as a parameter to runHandCold so that stack traces from any
future cockroachdb/cockroach#70154 reproductions will include it.
@bananabrick
Copy link
Contributor

I looked at the code to check if sizes are being updated correctly recently, and didn't find anything either.

An entry could be removed from the entry.blockLink list without updating shard.sizeCold. I don't see a code path this can happen.

There is one code path where this can happen. If someone calls cache.Unref which calls shard.Free, the linked list gets freed, but the sizes aren't updated. But the same code path makes sure that, all of the hands are set to nil. So, a subsequent call to Reserve, if such a call exists, wouldn't run forever, since all of the hands are nil.

So, the loop can also happen if there's a call to shard.Free, followed by a shard.Set which adds a node again, followed by a Reserve. I checked if this was possible, and it turns out that shard.Set panics if it's called after shard.Free.

The reason I think the call to shard.Free could be the culprit is because the unit test fails during shutdown.

@petermattis
Copy link
Collaborator

shard.Free clears the shards.blocks map which will cause a subsequent shard.Set call to panic. I agree that this is a potential path to the corruption we've seen, but I don't see how it could be a problem in practice.

jbowens added a commit to jbowens/pebble that referenced this issue Sep 22, 2021
Add hot, cold and test counts to block cache shard. Use `countCold` to
assert that `sizeCold` matches expectations.

This is intended to help gain more insight out of any future
reproductions of the infinite loop encountered in
cockroachdb/cockroach#70154
jbowens added a commit to jbowens/pebble that referenced this issue Sep 22, 2021
…races

Add countCold and sizeCold as parameters to runHandCold so that stack
traces from any future cockroachdb/cockroach#70154 reproductions will
include their values.
jbowens added a commit to cockroachdb/pebble that referenced this issue Sep 23, 2021
Add hot, cold and test counts to block cache shard. Use `countCold` to
assert that `sizeCold` matches expectations.

This is intended to help gain more insight out of any future
reproductions of the infinite loop encountered in
cockroachdb/cockroach#70154
jbowens added a commit to cockroachdb/pebble that referenced this issue Sep 23, 2021
…races

Add countCold and sizeCold as parameters to runHandCold so that stack
traces from any future cockroachdb/cockroach#70154 reproductions will
include their values.
@jbowens
Copy link
Collaborator Author

jbowens commented Oct 25, 2021

Another reproduction, on the release-21.2 branch.
https://teamcity.cockroachdb.com//viewLog.html?buildId=3631246&buildTypeId=Cockroach_MergedExtendedCi

Unfortunately, the release-21.2 branch doesn't have the debugging context added in cockroachdb/pebble#1284.

goroutine 838927 [runnable]:
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandHot(0xc004844450)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:432 +0x2ef
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold(0xc004844450)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:428 +0xe7
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandTest(0xc004844450)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:456 +0x296
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandHot(0xc004844450)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:434 +0x297
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold(0xc004844450)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:428 +0xe7
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandTest(0xc004844450)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:456 +0x296
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandHot(0xc004844450)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:434 +0x297
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold(0xc004844450)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:428 +0xe7
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandTest(0xc004844450)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:456 +0x296
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold(0xc004844450)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:420 +0x428
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandTest(0xc004844450)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:456 +0x296
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandHot(0xc004844450)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:434 +0x297
github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold(0xc004844450)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:428 +0xe7
github.com/cockroachdb/pebble/internal/cache.(*shard).evict(0xc004844450)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:402 +0x3a
github.com/cockroachdb/pebble/internal/cache.(*shard).Reserve(0xc004844450, 0x400000)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:254 +0x9c
github.com/cockroachdb/pebble/internal/cache.(*Cache).Reserve(0xc001d28280, 0x4000000, 0x8000000)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:717 +0x125
github.com/cockroachdb/pebble.(*DB).newMemTable(0xc003fcfc00, 0x117, 0xc705, 0xc000eb0f00, 0xc00dd147e0)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/db.go:1367 +0x185
github.com/cockroachdb/pebble.(*DB).makeRoomForWrite(0xc003fcfc00, 0x0, 0xc031ed2460, 0xc01d67f200)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/db.go:1635 +0x75c
github.com/cockroachdb/pebble.(*DB).Ingest.func1()
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/ingest.go:572 +0x37c
github.com/cockroachdb/pebble.(*commitPipeline).AllocateSeqNum(0xc00383e600, 0x6, 0xc054abf608, 0xc00dbd3640)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/commit.go:334 +0x422
github.com/cockroachdb/pebble.(*DB).Ingest(0xc003fcfc00, 0xc012e9b380, 0x6, 0x6, 0x26f183d3, 0x26f183d3)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/ingest.go:605 +0x73c
github.com/cockroachdb/cockroach/pkg/storage.(*Pebble).IngestExternalFiles(0xc00161a800, 0xa78a410, 0xc01d67e9c0, 0xc002d65d80, 0x6, 0x8, 0x0, 0x0)

@mwang1026
Copy link

@bananabrick @jbowens thoughts on closing vs. keeping this open? still reproducible?

@mwang1026 mwang1026 assigned nicktrav and unassigned jbowens May 16, 2022
@nicktrav
Copy link
Collaborator

nicktrav commented May 16, 2022

Side note: I'm going to attempt to unskip the skipped test that was linked upthread.

edit: false alarm - it was unskipped in #70501.

@jbowens
Copy link
Collaborator Author

jbowens commented Oct 17, 2022

Closing this out. We haven't observed it in many, many months.

@jbowens jbowens closed this as not planned Won't fix, can't repro, duplicate, stale Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team
Projects
None yet
Development

No branches or pull requests

9 participants