Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestStoreAddRemoveRanges failed #96931

Closed
cockroach-teamcity opened this issue Feb 10, 2023 · 7 comments · Fixed by #97381
Closed

kv/kvserver: TestStoreAddRemoveRanges failed #96931

cockroach-teamcity opened this issue Feb 10, 2023 · 7 comments · Fixed by #97381
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 10, 2023

kv/kvserver.TestStoreAddRemoveRanges failed with artifacts on master @ 9ea0ef2f6a8e31e22c7350de5cacd9c5992d07a0:


goroutine 118907905 lock 0xc014e24460
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics.go:2110 kvserver.(*TenantsStorageMetrics).acquireTenant ??? <<<<<
github.com/sasha-s/go-deadlock/external/com_github_sasha_s_go_deadlock/deadlock.go:84 go-deadlock.(*Mutex).Lock ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:79 kvserver.TestTenantsStorageMetricsConcurrency.func2 ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:91 kvserver.TestTenantsStorageMetricsConcurrency.func3 ???

goroutine 118907733 lock 0xc01bdc1ea0
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics.go:2110 kvserver.(*TenantsStorageMetrics).acquireTenant ??? <<<<<
github.com/sasha-s/go-deadlock/external/com_github_sasha_s_go_deadlock/deadlock.go:84 go-deadlock.(*Mutex).Lock ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:79 kvserver.TestTenantsStorageMetricsConcurrency.func2 ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:91 kvserver.TestTenantsStorageMetricsConcurrency.func3 ???

goroutine 118907850 lock 0xc01c898d20
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics.go:2110 kvserver.(*TenantsStorageMetrics).acquireTenant ??? <<<<<
github.com/sasha-s/go-deadlock/external/com_github_sasha_s_go_deadlock/deadlock.go:84 go-deadlock.(*Mutex).Lock ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:79 kvserver.TestTenantsStorageMetricsConcurrency.func2 ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:91 kvserver.TestTenantsStorageMetricsConcurrency.func3 ???

goroutine 118907884 lock 0xc02497c1e0
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics.go:2110 kvserver.(*TenantsStorageMetrics).acquireTenant ??? <<<<<
github.com/sasha-s/go-deadlock/external/com_github_sasha_s_go_deadlock/deadlock.go:84 go-deadlock.(*Mutex).Lock ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:79 kvserver.TestTenantsStorageMetricsConcurrency.func2 ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:91 kvserver.TestTenantsStorageMetricsConcurrency.func3 ???

goroutine 118907754 lock 0xc01f42fcc0
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics.go:2110 kvserver.(*TenantsStorageMetrics).acquireTenant ??? <<<<<
github.com/sasha-s/go-deadlock/external/com_github_sasha_s_go_deadlock/deadlock.go:84 go-deadlock.(*Mutex).Lock ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:79 kvserver.TestTenantsStorageMetricsConcurrency.func2 ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:91 kvserver.TestTenantsStorageMetricsConcurrency.func3 ???

goroutine 118907898 lock 0xc02105f180
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics.go:2110 kvserver.(*TenantsStorageMetrics).acquireTenant ??? <<<<<
github.com/sasha-s/go-deadlock/external/com_github_sasha_s_go_deadlock/deadlock.go:84 go-deadlock.(*Mutex).Lock ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:79 kvserver.TestTenantsStorageMetricsConcurrency.func2 ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:91 kvserver.TestTenantsStorageMetricsConcurrency.func3 ???

goroutine 118907674 lock 0xc01db9d360
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics.go:2110 kvserver.(*TenantsStorageMetrics).acquireTenant ??? <<<<<
github.com/sasha-s/go-deadlock/external/com_github_sasha_s_go_deadlock/deadlock.go:84 go-deadlock.(*Mutex).Lock ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:79 kvserver.TestTenantsStorageMetricsConcurrency.func2 ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:91 kvserver.TestTenantsStorageMetricsConcurrency.func3 ???

goroutine 118907736 lock 0xc01db9d680
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics.go:2110 kvserver.(*TenantsStorageMetrics).acquireTenant ??? <<<<<
github.com/sasha-s/go-deadlock/external/com_github_sasha_s_go_deadlock/deadlock.go:84 go-deadlock.(*Mutex).Lock ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:79 kvserver.TestTenantsStorageMetricsConcurrency.func2 ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:91 kvserver.TestTenantsStorageMetricsConcurrency.func3 ???


Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-24414

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Feb 10, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Feb 10, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Feb 10, 2023
@tbg
Copy link
Member

tbg commented Feb 20, 2023

POTENTIAL DEADLOCK: Recursive locking:
=== RUN   TestStoreAddRemoveRanges
08:52:10         test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestStoreAddRemoveRanges1568558436
08:52:10         test_log_scope.go:79: use -show-logs to present logs inline
08:52:10     POTENTIAL DEADLOCK: Recursive locking:
08:52:10     current goroutine 268923878 lock 0xc001c6d7f8
08:52:10     FAIL: //pkg/kv/kvserver:kvserver_test (see /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/testlogs/pkg/kv/kvserver/kvserver_test/test.log)
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica.go:964 kvserver.(*Replica).Desc ??? <<<<<
08:52:10     [3,983 / 3,984] 1 / 1 tests, 1 failed; Testing //pkg/kv/kvserver:kvserver_test; 3247s processwrapper-sandbox
08:52:10     Target //pkg/kv/kvserver:kvserver_test up-to-date:
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica.go:963 kvserver.(*Replica).Desc ???
08:52:10       _bazel/bin/pkg/kv/kvserver/kvserver_test_/kvserver_test
08:52:10     INFO: Elapsed time: 3663.220s, Critical Path: 3369.58s
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_replica_btree.go:131 kvserver.(*storeReplicaBTree).VisitKeyRange.func1 ???
08:52:10     INFO: 3084 processes: 53 internal, 3031 processwrapper-sandbox.
08:52:10     INFO: Build completed, 1 test FAILED, 3084 total actions
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_replica_btree.go:130 kvserver.(*storeReplicaBTree).VisitKeyRange.func1 ???
08:52:10     INFO: Build completed, 1 test FAILED, 3084 total actions
08:52:10     I230210 08:52:10.768540 1 util/log/log_bridge.go:81  [T] 1  bad log format: failed parent test "TestStoreAddRemoveRanges" (no subtests)
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_replica_btree.go:225 kvserver.(*storeReplicaBTree).descendLessOrEqual.func1 ???
08:52:10     github.com/google/btree/external/com_github_google_btree/btree.go:556 btree.(*node).iterate ???
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_replica_btree.go:228 kvserver.(*storeReplicaBTree).descendLessOrEqual ???
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_replica_btree.go:224 kvserver.(*storeReplicaBTree).descendLessOrEqual ???
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_replica_btree.go:213 kvserver.(*storeReplicaBTree).mustDescendLessOrEqual ???
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_replica_btree.go:128 kvserver.(*storeReplicaBTree).VisitKeyRange ???
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:2491 kvserver.(*Store).getOverlappingKeyRangeLocked ???
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_create_replica.go:263 kvserver.(*Store).addToReplicasByKeyLockedReplicaRLocked ???
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/helpers_test.go:73 kvserver.(*Store).AddReplica ???
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_test.go:589 kvserver.TestStoreAddRemoveRanges ???
08:52:10     
08:52:10     Previous place where the lock was grabbed (same goroutine)
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/helpers_test.go:70 kvserver.(*Store).AddReplica ??? <<<<<
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/helpers_test.go:69 kvserver.(*Store).AddReplica ???
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_test.go:589 kvserver.TestStoreAddRemoveRanges ???
08:52:10     
08:52:10     Other goroutines holding locks:
08:52:10     goroutine 118907863 lock 0xc01c898dc0
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics.go:2110 kvserver.(*TenantsStorageMetrics).acquireTenant ??? <<<<<
08:52:10     github.com/sasha-s/go-deadlock/external/com_github_sasha_s_go_deadlock/deadlock.go:84 go-deadlock.(*Mutex).Lock ???
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:79 kvserver.TestTenantsStorageMetricsConcurrency.func2 ???
08:52:10     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics_test.go:91 kvserver.TestTenantsStorageMetricsConcurrency.func3 ???

@tbg
Copy link
Member

tbg commented Feb 20, 2023

@pavelkalinnikov this looks like a problem in

// addToReplicasByKeyLockedReplicaRLocked adds the replica to the replicasByKey
// btree. The replica must already be in replicasByRangeID. Returns an error if
// a different replica with the same range ID, or an overlapping replica or
// placeholder exists in this Store. Replica.mu must be at least read-locked.
func (s *Store) addToReplicasByKeyLockedReplicaRLocked(repl *Replica) error {
desc := repl.descRLocked()
if !desc.IsInitialized() {
return errors.Errorf("%s: attempted to add uninitialized replica %s", s, repl)
}
if got := s.GetReplicaIfExists(repl.RangeID); got != repl { // NB: got can be nil too
return errors.Errorf("%s: replica %s not in replicasByRangeID; got %s", s, repl, got)
}
if it := s.getOverlappingKeyRangeLocked(desc); it.item != nil {
return errors.Errorf(
"%s: cannot add to replicasByKey: range %s overlaps with %s", s, repl, it.Desc())
}
if it := s.mu.replicasByKey.ReplaceOrInsertReplica(context.Background(), repl); it.item != nil {
return errors.Errorf(
"%s: cannot add to replicasByKey: key %v already exists in the btree", s, it.item.key())
}
return nil
}

In that method, we hold Replica.mu. But then we call .Desc() here which gets a recursive RLock:

There are other calls to Desc() from the replicasByKey btree so this is a more general issue. Possibly this was tickled with recent refactors we made in that area and previously flew under the radar.

@tbg
Copy link
Member

tbg commented Feb 20, 2023

(The reason recursive RLock() is bad is because if someone else is trying to Lock() while we have the first RLock() but not the second, we'll be deadlocked)

@tbg tbg added the GA-blocker label Feb 20, 2023
@tbg
Copy link
Member

tbg commented Feb 20, 2023

Marking as GA blocker to make sure we investigate before release. Deadlocks = bad grey failures, so this occurring in the wild could have a large blast radius.

@pav-kv pav-kv self-assigned this Feb 20, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Feb 21, 2023

it.Desc is called for the result of getOverlappingKeyRangeLocked, which shouldn't return our replica. Maybe it does sometimes. Note that this is a "return error" flow which is supposed to not happen.

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 21, 2023

Ah, there are more Desc calls inside getOverlappingKeyRangeLocked, I see.

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 21, 2023

Culprit: #96870

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants