Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: failover/chaos/read-only failed #106108

Closed
cockroach-teamcity opened this issue Jul 4, 2023 · 26 comments
Closed

roachtest: failover/chaos/read-only failed #106108

cockroach-teamcity opened this issue Jul 4, 2023 · 26 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. X-duplicate Closed as a duplicate of another issue.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 4, 2023

roachtest.failover/chaos/read-only failed with artifacts on master @ 428dc9da6a320de218460de6c6c8807caa4ded98:

(test_runner.go:1075).runTest: test timed out (1h0m0s)
(cluster.go:2282).Run: output in run_115254.651623685_n10_cockroach-workload-i: ./cockroach workload init kv --splits 1000 --insert-count 100000 {pgurl:1} returned: COMMAND_PROBLEM: exit status 137
(cluster.go:2282).Run: cluster.RunE: context canceled
(cluster.go:2282).Run: cluster.RunE: context canceled
(cluster.go:2282).Run: cluster.RunE: context canceled
(cluster.go:2282).Run: cluster.RunE: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-only/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=2 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-29407

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jul 4, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Jul 4, 2023
@erikgrinaker erikgrinaker self-assigned this Jul 4, 2023
@erikgrinaker
Copy link
Contributor

Something's pretty messed up here. Looks like we're losing r1 meta availability somehow. n1 rebalances away its r1 replica at 11:53:20:

I230704 11:53:20.598003 49673 13@kv/kvserver/allocator/plan/replicate.go:900 ⋮ [T1,n1,replicate,s1,r1/1:‹/{Min-System/NodeL…}›] 4044  rebalancing ‹voter› n1,s1 to n6,s6: ‹[1*:14752, 6:14750, 3:14750, 4:14750, 5:14742]›
I230704 11:53:20.598819 49673 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,n1,replicate,s1,r1/1:‹/{Min-System/NodeL…}›] 4045  change replicas (add [(n6,s6):7LEARNER] remove []): existing descriptor r1:‹/{Min-System/NodeLiveness}› [(n1,s1):1, (n7,s7):6, (n4,s4):3, (n9,s9):4, (n8,s8):5, next=7, gen=12]
I230704 11:53:20.607565 49673 13@kv/kvserver/replica_raft.go:396 ⋮ [T1,n1,s1,r1/1:‹/{Min-System/NodeL…}›] 4046  proposing SIMPLE(l7) [(n6,s6):7LEARNER]: after=[(n1,s1):1 (n7,s7):6 (n4,s4):3 (n9,s9):4 (n8,s8):5 (n6,s6):7LEARNER] next=8
I230704 11:53:20.688360 49700 13@kv/kvserver/store_snapshot.go:1596 ⋮ [T1,n1,s1,r1/1:‹/{Min-System/NodeL…}›] 4047  streamed INITIAL snapshot e9d45910 at applied index 14759 to (n6,s6):7LEARNER with 1.1 MiB in 0.04s @ 31 MiB/s: kvs=9203 rangeKVs=0, rate-limit: 32 MiB/s, queued: 0.04s
I230704 11:53:20.709957 49673 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,n1,replicate,s1,r1/1:‹/{Min-System/NodeL…}›] 4048  change replicas (add [(n6,s6):7VOTER_INCOMING] remove [(n1,s1):1VOTER_DEMOTING_LEARNER]): existing descriptor r1:‹/{Min-System/NodeLiveness}› [(n1,s1):1, (n7,s7):6, (n4,s4):3, (n9,s9):4, (n8,s8):5, (n6,s6):7LEARNER, next=8, gen=13]
I230704 11:53:20.714175 49673 13@kv/kvserver/replica_raft.go:396 ⋮ [T1,n1,s1,r1/1:‹/{Min-System/NodeL…}›] 4049  proposing ENTER_JOINT(r1 l1 v7) [(n6,s6):7VOTER_INCOMING], [(n1,s1):1VOTER_DEMOTING_LEARNER]: after=[(n1,s1):1VOTER_DEMOTING_LEARNER (n7,s7):6 (n4,s4):3 (n9,s9):4 (n8,s8):5 (n6,s6):7VOTER_INCOMING] next=8
I230704 11:53:20.735057 49673 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,n1,replicate,s1,r1/1:‹/{Min-System/NodeL…}›] 4050  change replicas (add [] remove []): existing descriptor r1:‹/{Min-System/NodeLiveness}› [(n1,s1):1VOTER_DEMOTING_LEARNER, (n7,s7):6, (n4,s4):3, (n9,s9):4, (n8,s8):5, (n6,s6):7VOTER_INCOMING, next=8, gen=14]
I230704 11:53:20.755340 49673 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,n1,replicate,s1,r1/1:‹/{Min-System/NodeL…}›] 4051  change replicas (add [] remove [(n1,s1):1LEARNER]): existing descriptor r1:‹/{Min-System/NodeLiveness}› [(n1,s1):1LEARNER, (n7,s7):6, (n4,s4):3, (n9,s9):4, (n8,s8):5, (n6,s6):7, next=8, gen=15]
I230704 11:53:20.776110 185 kv/kvserver/store_remove_replica.go:151 ⋮ [T1,n1,s1,r1/1:‹/{Min-System/NodeL…}›,raft] 579  removing replica r1/1

Then about a minute later we start seeing the first signs of r1 unavailability:

W230704 11:54:10.395697 39032 kv/kvclient/kvcoord/dist_sender.go:1808 ⋮ [T1,n1,replicaGC,s1,r393/1:‹/Table/106/1/5{88784…-92470…}›] 767  slow range RPC: have been waiting 60.00s (1 attempts) for RPC Scan [/Meta2/Table/106/1/‹5887846884665536192›/‹NULL›,/System‹/›‹"›‹"›) to r1:‹/{Min-System/NodeLiveness}› [(n1,s1):1, (n7,s7):6, (n4,s4):3, (n9,s9):4, (n8,s8):5, next=7, gen=12]; resp: ‹(err: context deadline exceeded)›

Notice that this is not 60 seconds after the replica was removed at 11:53:20, the request began around 11:53:10. The only suspect log entry we see at that time is this one:

W230704 11:53:09.815397 38115 kv/kvserver/intentresolver/intent_resolver.go:502 ⋮ [-] 506  failed to push during intent resolution: {<nil> 0 {‹0xc007ae1748›} ‹index:0 › 1688471589.815357569,0}

We keep seeing signs of r1 unavailability throughout the cluster until the test times out an hour later. However, r1 is alive and well with leader/leaseholder on n6:

    "raft_state": {
      "replica_id": 7,
      "hard_state": {
        "term": 7,
        "vote": 7,
        "commit": 17837
      },
      "lead": 7,
      "state": "StateLeader",
      "applied": 17837,
      "progress": {
        "3": {
          "match": 17837,
          "next": 17838,
          "state": "StateReplicate"
        },
        "4": {
          "match": 17837,
          "next": 17838,
          "state": "StateReplicate"
        },
        "5": {
          "match": 17837,
          "next": 17838,
          "state": "StateReplicate"
        },
        "6": {
          "match": 17837,
          "next": 17838,
          "state": "StateReplicate"
        },
        "7": {
          "match": 17837,
          "next": 17838,
          "state": "StateReplicate"
        }
      }
    },

      "state": {
        "raft_applied_index": 17837,
        "lease_applied_index": 16610,
        "desc": {
          "range_id": 1,
          "end_key": "BABsaXZlbmVzcy0=",
          "internal_replicas": [
            {
              "node_id": 6,
              "store_id": 6,
              "replica_id": 7,
              "type": 0
            },
            {
              "node_id": 7,
              "store_id": 7,
              "replica_id": 6,
              "type": 0
            },
            {
              "node_id": 4,
              "store_id": 4,
              "replica_id": 3,
              "type": 0
            },
            {
              "node_id": 9,
              "store_id": 9,
              "replica_id": 4,
              "type": 0
            },
            {
              "node_id": 8,
              "store_id": 8,
              "replica_id": 5,
              "type": 0
            }
          ],
          "next_replica_id": 8,
          "generation": 16,
          "sticky_bit": {}
        },

        "lease": {
          "start": {
            "wall_time": 1688471600725422293
          },
          "expiration": {
            "wall_time": 1688475527720524070
          },
          "replica": {
            "node_id": 6,
            "store_id": 6,
            "replica_id": 7,
            "type": 0
          },
          "deprecated_start_stasis": {
            "wall_time": 1688475527720524070
          },
          "proposed_ts": {
            "wall_time": 1688475521720524070
          },
          "sequence": 2,
          "acquisition_type": 2
        },

We must somehow have stopped gossiping r1 or something. However, we see fresh gossip info for r1 pointing to n6, here from n1:

    "first-range": {
      "value": {
        "raw_bytes": "7ujJvAMIARIAGgsEAGxpdmVuZXNzLSIICAYQBhgHIAAiCAgHEAcYBiAAIggIBBAEGAMgACIICAkQCRgEIAAiCAgIEAgYBSAAKAgwEDoA",
        "timestamp": {
          "wall_time": 1688475521220175881
        }   
      },  
      "orig_stamp": 1688475521220177273,
      "ttl_stamp": 9223372036854775807,
      "hops": 2,
      "node_id": 6,
      "peer_id": 3
    },  

@erikgrinaker
Copy link
Contributor

The intent resolution error appears to be a bare kvpb.Error, for which decoding failed and so GoError() converted it to a internalError which simply called kvpb.Error.String(), dispatching to the default Protobuf formatter:

// Error is a generic representation including a string message
// and information about retryability.
message Error {
option (gogoproto.goproto_stringer) = false;
// An optional updated transaction. This is to be used by the client in case
// of retryable errors.
//
// Not to be accessed directly - use Error.GetTxn() and Error.SetTxn().
optional roachpb.Transaction unexposed_txn = 4;
// Node at which the error was generated (zero if does not apply).
optional int32 origin_node = 5 [(gogoproto.nullable) = false,
(gogoproto.casttype) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID"];
// encoded_error is the Go error that caused this Error.
optional errorspb.EncodedError encoded_error = 9 [(gogoproto.nullable) = false];
// The index, if given, contains the index of the request (in the batch)
// whose execution caused the error.
optional ErrPosition index = 7;
// now is the current time at the node sending the response,
// which can be used by the receiver to update its local HLC.
optional util.hlc.Timestamp now = 8 [(gogoproto.nullable) = false,
(gogoproto.casttype) = "github.com/cockroachdb/cockroach/pkg/util/hlc.ClockTimestamp"];
reserved 1, 2, 3, 6 ;
}

return (*internalError)(e)

Not clear if we can recover the original error, or if it's significant here.

@erikgrinaker
Copy link
Contributor

On n6 (the new r1 leaseholder), we're seeing 9 goroutines stuck on intent resolution. None of the other nodes have this. One of them have been stuck for 65 minutes, since the outage began:

Intent resolution goroutine stack
goroutine 202 [select, 65 minutes]:
google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc00fc5e5a0)
	google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:328 +0x7c
google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...)
	google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:343
google.golang.org/grpc.(*csAttempt).recvMsg(0xc005745a00, {0x5f06e40?, 0xc00fb3b680}, 0xc00fc5e240?)
	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:1046 +0xc5
google.golang.org/grpc.(*clientStream).RecvMsg.func1(0x0?)
	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:900 +0x25
google.golang.org/grpc.(*clientStream).withRetry(0xc00fc5e360, 0xc00fe162e0, 0xc00fe162b0)
	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:751 +0x144
google.golang.org/grpc.(*clientStream).RecvMsg(0xc00fc5e360, {0x5f06e40?, 0xc00fb3b680?})
	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:899 +0x12e
google.golang.org/grpc.invoke({0x77c61f0?, 0xc00fb37bf0?}, {0x6195e7a?, 0x20?}, {0x6068740, 0xc00fc5e240}, {0x5f06e40, 0xc00fb3b680}, 0x0?, {0xc0056f7e00, ...})
	google.golang.org/grpc/external/org_golang_google_grpc/call.go:73 +0xd7
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.ClientInterceptor.func2({0x77c61f0, 0xc00fb37bf0}, {0x6195e7a, 0x21}, {0x6068740, 0xc00fc5e240}, {0x5f06e40, 0xc00fb3b680}, 0x7e122e71a5cf6b77?, 0x64ede00, ...)
	github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:227 +0x155
google.golang.org/grpc.(*ClientConn).Invoke(0x3bffde935ab10b16?, {0x77c61f0?, 0xc00fb37bf0?}, {0x6195e7a?, 0x0?}, {0x6068740?, 0xc00fc5e240?}, {0x5f06e40?, 0xc00fb3b680?}, {0x0, ...})
	google.golang.org/grpc/external/org_golang_google_grpc/call.go:35 +0x223
github.com/cockroachdb/cockroach/pkg/kv/kvpb.(*internalClient).Batch(0xc00a4f09a0, {0x77c61f0, 0xc00fb37bf0}, 0x0?, {0x0, 0x0, 0x0})
	github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-opt/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:9761 +0xc9
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.TracingInternalClient.Batch({{0x7828570?, 0xc00a4f09a0?}}, {0x77c61f0, 0xc00fb37bf0}, 0xc00fc5e120, {0x0, 0x0, 0x0})
	github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:282 +0x1f0
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch(0xc00758c180, {0x77c61f0, 0xc00fb37bf0}, 0x34?, {0x77c36b0, 0xc002eb8040?}, 0xc00fc5e120)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:210 +0x103
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext(0xc00758c180, {0x77c61f0, 0xc00fb37bf0}, 0xc00758c180?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:189 +0x92
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(0xc0006b4000, {0x77c61f0, 0xc00fb37bf0}, 0xc00fc5e000?, {0xc00087da90, 0xc006d52000, 0x0, 0xffffffff, 0x0}, 0x0)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:2283 +0x124c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc0006b4000, {0x77c61f0?, 0xc00fb37bf0}, 0xc00fc5e000, {{0xc00fc4cc40, 0xb, 0x10}, {0xc00fc4cc40, 0xc, 0x10}}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1801 +0x845
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc0006b4000, {0x77c61f0, 0xc00fb37bf0}, 0xc00fc5e000, {{0xc00fc4cc40, 0xb, 0x10}, {0xc00fc4cc40, 0xc, 0x10}}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1372 +0x3e8
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc0006b4000, {0x77c61f0, 0xc009402ed0}, 0xc00fc5e000)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:992 +0x666
github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send(0xc001ff1f58, {0x77c61f0, 0xc009402ed0}, 0xc00fc5e000)
	github.com/cockroachdb/cockroach/pkg/kv/db.go:223 +0xa6
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc001ff1ef0, {0x77c61f0, 0xc009402ed0}, 0xc00fc5e000, {0x7772a60, 0xc001ff1f58})
	github.com/cockroachdb/cockroach/pkg/kv/db.go:1011 +0xe7
github.com/cockroachdb/cockroach/pkg/kv.(*DB).send(0xc00cb6b1a0?, {0x77c61f0?, 0xc009402ed0?}, 0xc003c2a2c0?)
	github.com/cockroachdb/cockroach/pkg/kv/db.go:994 +0x30
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill({0x77c61f0, 0xc009402ed0}, 0xc003c2a2f8, 0xc00cb6b180)
	github.com/cockroachdb/cockroach/pkg/kv/db.go:841 +0xfe
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Run(0xc001ff1ef0, {0x77c61f0, 0xc009402ed0}, 0x19?)
	github.com/cockroachdb/cockroach/pkg/kv/db.go:864 +0x74
github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver.(*IntentResolver).MaybePushTransactions(0xc0024acc00, {0x77c61f0, 0xc009402ed0}, _, {{0x176ea7fe62358069, 0x0, 0x0}, 0x0, {0x0, 0x0, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go:422 +0x6fb
github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver.(*IntentResolver).PushTransaction(0x0?, {0x77c61f0, 0xc009402ed0}, _, {{0x176ea7fe62358069, 0x0, 0x0}, 0x0, {0x0, 0x0, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go:327 +0x165
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*lockTableWaiterImpl).pushLockTxn(0xc004fd8be0, {0x77c61f0, 0xc009402ed0}, {0xc00d23f7a0, {0x176ea7fe5c3b6e04, 0x0, 0x0}, 0x0, 0x0, 0x0, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go:520 +0x54f
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*lockTableWaiterImpl).WaitOn.func3({0x77c61f0?, 0xc009402ed0?})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go:380 +0x2bc
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*lockTableWaiterImpl).WaitOn(0xc004fd8be0, {0x77c61f0, 0xc009402ed0}, {0xc00d23f7a0, {0x176ea7fe5c3b6e04, 0x0, 0x0}, 0x0, 0x0, 0x0, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go:432 +0x4e6
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).sequenceReqWithGuard(0xc004fd8b90, {0x77c61f0, 0xc009402ed0}, 0xc004c69ae0, 0x2?)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:345 +0x9cf
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).SequenceReq(0x0?, {0x77c61f0, 0xc009402ed0}, 0xc001f0b040?, {0xc00d23f7a0, {0x176ea7fe5c3b6e04, 0x0, 0x0}, 0x0, 0x0, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:239 +0x2cc
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc005c3cc80, {0x77c61f0, 0xc009402ed0}, 0xc00d23f9e0, 0x64e49c0)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:466 +0x310
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SendWithWriteBytes(0xc005c3cc80, {0x77c61f0?, 0xc009402ea0?}, 0xc00d23f9e0)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:183 +0x6f1
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).SendWithWriteBytes(0xc002c77500, {0x77c61f0?, 0xc009402e70?}, 0xc00d23f9e0)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:193 +0x66a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).SendWithWriteBytes(0x0?, {0x77c61f0, 0xc009402e70}, 0xc00d23f9e0)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:202 +0xef
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000277400, {0x77c61f0?, 0xc009402e40?}, {0xc001c41620?}, 0xc00d23f9e0)
	github.com/cockroachdb/cockroach/pkg/server/node.go:1280 +0x490
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000277400, {0x77c61f0, 0xc009402de0}, 0xc00d23f9e0)
	github.com/cockroachdb/cockroach/pkg/server/node.go:1408 +0x2f5
github.com/cockroachdb/cockroach/pkg/rpc.makeInternalClientAdapter.func1({0x77c61f0?, 0xc009402de0?}, {0x6068740?, 0xc00d23f9e0?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:637 +0x4b
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.ServerInterceptor.func1({0x77c61f0, 0xc009402de0}, {0x6068740, 0xc00d23f9e0}, 0xc0026796a0, 0xc002c41830)
	github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:96 +0x254
github.com/cockroachdb/cockroach/pkg/rpc.bindUnaryServerInterceptorToHandler.func1({0x77c61f0?, 0xc009402de0?}, {0x6068740?, 0xc00d23f9e0?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:748 +0x3a
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func3({0x77c61f0, 0xc009402de0}, {0x6068740, 0xc00d23f9e0}, 0xc00d3abf58?, 0xc0026796c0)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:167 +0x83
github.com/cockroachdb/cockroach/pkg/rpc.bindUnaryServerInterceptorToHandler.func1({0x77c61f0?, 0xc009402de0?}, {0x6068740?, 0xc00d23f9e0?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:748 +0x3a
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1.1({0x77c61f0?, 0xc009402de0?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:134 +0x39
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc000d1aa00, {0x77c61f0, 0xc009402de0}, {0x0?, 0x176ea5105c78e000?}, 0xc00d3ac020)
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:336 +0xd1
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1({0x77c61f0?, 0xc009402de0?}, {0x6068740?, 0xc00d23f9e0?}, 0x0?, 0x0?)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:132 +0x95
github.com/cockroachdb/cockroach/pkg/rpc.bindUnaryServerInterceptorToHandler.func1({0x77c61f0?, 0xc009402de0?}, {0x6068740?, 0xc00d23f9e0?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:748 +0x3a
github.com/cockroachdb/cockroach/pkg/rpc.makeInternalClientAdapter.func2({0x77c61f0?, 0xc009402de0?}, {0xc009402de0?, 0x4?}, {0x6068740?, 0xc00d23f9e0?}, {0x5f06e40?, 0xc00c921b80}, 0x203002?, {0x0, ...})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:647 +0x54
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.ClientInterceptor.func2({0x77c61f0, 0xc009402de0}, {0x6195e7a, 0x21}, {0x6068740, 0xc00d23f9e0}, {0x5f06e40, 0xc00c921b80}, 0x57ca000?, 0xc00085b660, ...)
	github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:227 +0x155
github.com/cockroachdb/cockroach/pkg/rpc.getChainUnaryInvoker.func1({0x77c61f0, 0xc009402de0}, {0x6195e7a, 0x21}, {0x6068740, 0xc00d23f9e0}, {0x5f06e40, 0xc00c921b80}, 0x2000100000000?, {0x0, ...})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:832 +0x13e
github.com/cockroachdb/cockroach/pkg/rpc.makeInternalClientAdapter.func3({0x77c61f0, 0xc009402d50}, 0xc00d23f8c0, {0x0, 0x0, 0x0})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:717 +0x329
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch(...)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:840
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch(0xc0075be840, {0x77c61f0, 0xc009402d50}, 0xa7?, {0x77c3650, 0xc00242c5a0?}, 0xc00d23f8c0)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:210 +0x103
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext(0xc0075be840, {0x77c61f0, 0xc009402d50}, 0xc0075be840?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:189 +0x92
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(0xc0006b4000, {0x77c61f0, 0xc009402d50}, 0xc00d23f560?, {0xc00087da90, 0xc0095aa5b0, 0xc0095aa620, 0x0, 0x0}, 0x0)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:2283 +0x124c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc0006b4000, {0x77c61f0?, 0xc009402d50}, 0xc00d23f560, {{0xaeb99ad, 0x1, 0x1}, {0xaeb99ab, 0x1, 0x1}}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1801 +0x845
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc0006b4000, {0x77c61f0, 0xc009402d50}, 0xc00d23f560, {{0xaeb99ad, 0x1, 0x1}, {0xaeb99ab, 0x1, 0x1}}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1372 +0x3e8
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc0006b4000, {0x77c61f0, 0xc009402d20}, 0xc00d23f560)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:992 +0x666
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(0xc00cb68aa0, {0x77c61f0, 0xc009402d20}, 0xc00d23f560)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:82 +0x1e2
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked(0xc00cb68a68, {0x77c61f0?, 0xc009402d20?}, 0x0?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:47 +0xe2
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc00cb68988, {0x77c61f0, 0xc009402d20}, 0xc00d23f560, 0x5)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:220 +0x263
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(0xc00cb68988, {0x77c61f0, 0xc009402d20}, 0x319280bd7c287742?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:148 +0xb3
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnCommitter).SendLocked(0xc00cb68950, {0x77c61f0, 0xc009402d20}, 0xc00d23f560)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:145 +0x551
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(0xc00cb68820, {0x77c61f0, 0xc009402d20}, 0x1?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:293 +0x125
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(0xc00cb68800?, {0x77c61f0?, 0xc009402d20?}, 0xc00d23f560?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:110 +0x82
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked(0xc00cb68750, {0x77c61f0, 0xc009402d20}, 0xc00d23f560)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:246 +0x4a6
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(0xc00cb68580, {0x77c6148, 0xc002a46800}, 0xc00d23f560)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:533 +0x5c2
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc001ff1ef0, {0x77c6148, 0xc002a46800}, 0xc00d23f560, {0x7f615b148670, 0xc00cb68580})
	github.com/cockroachdb/cockroach/pkg/kv/db.go:1011 +0xe7
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(0xc00d277cc0, {0x77c6148, 0xc002a46800}, 0xc00d23f560)
	github.com/cockroachdb/cockroach/pkg/kv/txn.go:1087 +0x215
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill({0x77c6148, 0xc002a46800}, 0xc00cc550a0, 0xc00cb68b00)
	github.com/cockroachdb/cockroach/pkg/kv/db.go:841 +0xfe
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Run(0xc00d277cc0, {0x77c6148, 0xc002a46800}, 0x5ddf6c0?)
	github.com/cockroachdb/cockroach/pkg/kv/txn.go:697 +0x74
github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports.(*meta2RangeIter).readBatch(0xc00d27edc0, {0x77c6148, 0xc002a46800})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports/reporter.go:746 +0x27a
github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports.(*meta2RangeIter).Next(0xc00d27edc0, {0x77c6148, 0xc002a46800})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports/reporter.go:699 +0x1fc
github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports.visitRanges({0x77c6148, 0xc002a46800}, {0x77950a8, 0xc00d27edc0}, 0xc000f3ec00?, {0xc00fe1f7b8, 0x3, 0x3})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports/reporter.go:582 +0x17a
github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports.(*Reporter).update(0xc001016bd0, {0x77c6148, 0xc002a46800}, 0x0?, 0x0?, 0x1?)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports/reporter.go:235 +0x8c5
github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports.(*Reporter).Start.func2({0x77c61f0?, 0xc000c03620?})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports/reporter.go:156 +0x1e5
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b

We also see these failed pushes:

n1 W230704 11:53:09.815397 38115 kv/kvserver/intentresolver/intent_resolver.go:502 ⋮ [-] 506  failed to push during intent resolution: {<nil> 0 {‹0xc007ae1748›} ‹index:0 › 1688471589.815357569,0}
n4 W230704 11:53:10.189270 15507 kv/kvserver/intentresolver/intent_resolver.go:502 ⋮ [-] 482  failed to push during intent resolution: {<nil> 0 {‹0xc0028a10f8›} ‹index:0 › 1688471590.189218613,0}
n6 W230704 11:53:21.992761 27477 kv/kvserver/intentresolver/intent_resolver.go:502 ⋮ [-] 619  failed to push during intent resolution: {<nil> 0 {‹0xc00a456820›} ‹index:0 › 1688471601.992707689,0}
n7 W230704 11:52:58.578073 4545 kv/kvserver/intentresolver/intent_resolver.go:502 ⋮ [-] 145  failed to push during intent resolution: {<nil> 0 {‹0xc003fc4508›} ‹index:3 › 1688471578.578010243,0}
n8 W230704 11:52:53.391008 2876 kv/kvserver/intentresolver/intent_resolver.go:502 ⋮ [-] 114  failed to push during intent resolution: {<nil> 0 {‹0xc000a53be8›} ‹index:0 › 1688471573.390978604,0}
n9 W230704 11:53:00.142659 5163 kv/kvserver/intentresolver/intent_resolver.go:502 ⋮ [-] 166  failed to push during intent resolution: {<nil> 0 {‹0xc0085dd198›} ‹index:0 › 1688471580.142571490,0}

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 5, 2023

I'm also seeing a bunch of slow latch acquisitions, but only on n2, across 66 ranges. This began right around the time we started having issues with r1. Unclear if it's causal or just fallout.

391:I230704 11:53:01.411830 9172 kv/kvserver/replica_command.go:439 ⋮ [T1,n2,s2,r160/6:‹/Table/106/1/5{57456…-64827…}›,*kvpb.AdminSplitRequest] 314  initiating a split of this range at key /Table/106/1/‹5611422148296262192› [r352] (‹manual›)
487:W230704 11:53:20.527384 9316 kv/kvserver/spanlatch/manager.go:568 ⋮ [T1,n2,s2,r352/1:‹/Table/106/1/56{11422…-48278…}›] 410  have been waiting 15s to acquire read latch ‹/Local/Range/Table/106/1/5611422148296262192/RangeDescriptor›@0,0, held by write latch ‹/Local/Range/Table/106/1/5611422148296262192/RangeDescriptor›@0,0

It does have 34 write commands that have been replicating for 65 minutes, all of which appear to be txn heartbeats:

executeWriteBatch stack
goroutine 12405 [select, 65 minutes]:
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc00879ac80, {0x77c61f0, 0xc007215830}, 0xc0086379e0, 0xc007e87540)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:201 +0x8f1
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc00879ac80, {0x77c61f0, 0xc007215830}, 0xc0086379e0, 0x64e49c8)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:508 +0x383
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SendWithWriteBytes(0xc00879ac80, {0x77c61f0?, 0xc007215800?}, 0xc0086379e0)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:187 +0x756
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).SendWithWriteBytes(0xc002ce0a80, {0x77c61f0?, 0xc0072157d0?}, 0xc0086379e0)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:193 +0x66a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).SendWithWriteBytes(0x0?, {0x77c61f0, 0xc0072157d0}, 0xc0086379e0)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:202 +0xef
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000417c00, {0x77c61f0?, 0xc007215770?}, {0xc00201d560?}, 0xc0086379e0)
	github.com/cockroachdb/cockroach/pkg/server/node.go:1280 +0x490
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000417c00, {0x77c61f0, 0xc007215710}, 0xc0086379e0)
	github.com/cockroachdb/cockroach/pkg/server/node.go:1408 +0x2f5
github.com/cockroachdb/cockroach/pkg/rpc.makeInternalClientAdapter.func1({0x77c61f0?, 0xc007215710?}, {0x6068740?, 0xc0086379e0?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:637 +0x4b
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.ServerInterceptor.func1({0x77c61f0, 0xc007215710}, {0x6068740, 0xc0086379e0}, 0xc000d61720, 0xc000b20750)
	github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:96 +0x254
github.com/cockroachdb/cockroach/pkg/rpc.bindUnaryServerInterceptorToHandler.func1({0x77c61f0?, 0xc007215710?}, {0x6068740?, 0xc0086379e0?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:748 +0x3a
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func3({0x77c61f0, 0xc007215710}, {0x6068740, 0xc0086379e0}, 0xc008ee5ce8?, 0xc000d61740)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:167 +0x83
github.com/cockroachdb/cockroach/pkg/rpc.bindUnaryServerInterceptorToHandler.func1({0x77c61f0?, 0xc007215710?}, {0x6068740?, 0xc0086379e0?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:748 +0x3a
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1.1({0x77c61f0?, 0xc007215710?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:134 +0x39
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc002aa30e0, {0x77c61f0, 0xc007215710}, {0x176ea7f5dd4b618e?, 0x0?}, 0xc008ee5db0)
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:336 +0xd1
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1({0x77c61f0?, 0xc007215710?}, {0x6068740?, 0xc0086379e0?}, 0x0?, 0x1?)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:132 +0x95
github.com/cockroachdb/cockroach/pkg/rpc.bindUnaryServerInterceptorToHandler.func1({0x77c61f0?, 0xc007215710?}, {0x6068740?, 0xc0086379e0?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:748 +0x3a
github.com/cockroachdb/cockroach/pkg/rpc.makeInternalClientAdapter.func2({0x77c61f0?, 0xc007215710?}, {0xc007215710?, 0x4?}, {0x6068740?, 0xc0086379e0?}, {0x5f06e40?, 0xc00787ef00}, 0x203001?, {0x0, ...})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:647 +0x54
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.ClientInterceptor.func2({0x77c61f0, 0xc007215710}, {0x6195e7a, 0x21}, {0x6068740, 0xc0086379e0}, {0x5f06e40, 0xc00787ef00}, 0x57ca000?, 0xc000fa5150, ...)
	github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:227 +0x155
github.com/cockroachdb/cockroach/pkg/rpc.getChainUnaryInvoker.func1({0x77c61f0, 0xc007215710}, {0x6195e7a, 0x21}, {0x6068740, 0xc0086379e0}, {0x5f06e40, 0xc00787ef00}, 0x2000100000000?, {0x0, ...})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:832 +0x13e
github.com/cockroachdb/cockroach/pkg/rpc.makeInternalClientAdapter.func3({0x77c61f0, 0xc007215680}, 0xc0086378c0, {0x0, 0x0, 0x0})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:717 +0x329
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch(...)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:840
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch(0xc0084d5800, {0x77c61f0, 0xc007215680}, 0xbc?, {0x77c3650, 0xc0021112c0?}, 0xc0086378c0)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:210 +0x103
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext(0xc0084d5800, {0x77c61f0, 0xc007215680}, 0xc0084d5800?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:189 +0x92
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(0xc0008bc500, {0x77c61f0, 0xc007215680}, 0xc008637680?, {0xc00100f3b0, 0xc0089509c0, 0xc008950a30, 0x0, 0x0}, 0x0)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:2283 +0x124c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc0008bc500, {0x77c61f0?, 0xc007215680}, 0xc008637680, {{0xc00277a393, 0xb, 0x11}, {0xc00277a393, 0xc, 0x11}}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1801 +0x845
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc0008bc500, {0x77c61f0, 0xc007215680}, 0xc008637680, {{0xc00277a393, 0xb, 0x11}, {0xc00277a393, 0xc, 0x11}}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1372 +0x3e8
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc0008bc500, {0x77c61f0, 0xc007215620}, 0xc008637680)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:992 +0x666
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(0xc002db4aa0, {0x77c61f0, 0xc007215620}, 0xc008637680)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:82 +0x1e2
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).heartbeatLocked(0xc002db4750, {0x77c61f0, 0xc007215620})
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:449 +0x31b
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).heartbeat(0xc002db4750, {0x77c61f0, 0xc007215620})
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:416 +0xe5
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).heartbeatLoop(0xc002db4750, {0x77c61f0, 0xc007215620})
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:378 +0xd3
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask(0xc002aa30e0, {0x77c61f0, 0xc007215620}, {0x619a4bb?, 0x0?}, 0xc0079b5fb8)
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:319 +0xd1
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).startHeartbeatLoopLocked.func1()
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:339 +0xf9
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x32

No replica circuit breakers were tripped.

@erikgrinaker
Copy link
Contributor

All of the stuck ranges on n2 follow the same pattern. They have a couple of pending proposals as well as read/write latches. Here's r352's range info from n2:

Details

  {
    "span": {
      "start_key": "/Table/106/1/5611422148296262192",
      "end_key": "/Table/106/1/5648278779812165392"
    },
    "raft_state": {
      "replica_id": 1,
      "hard_state": {
        "term": 6,
        "vote": 1,
        "commit": 19
      },
      "lead": 1,
      "state": "StateLeader",
      "applied": 19,
      "progress": {
        "1": {
          "match": 19,
          "next": 20,
          "state": "StateReplicate"
        },
        "2": {
          "match": 19,
          "next": 20,
          "state": "StateReplicate"
        },
        "3": {
          "match": 19,
          "next": 20,
          "state": "StateReplicate"
        },
        "4": {
          "match": 19,
          "next": 20,
          "state": "StateReplicate"
        },
        "5": {
          "next": 19,
          "state": "StateProbe",
          "paused": true
        }
      }
    },
    "state": {
      "state": {
        "raft_applied_index": 19,
        "lease_applied_index": 7,
        "desc": {
          "range_id": 352,
          "start_key": "8on9Td/Gtpn1QjA=",
          "end_key": "8on9TmK3n9cUtxA=",
          "internal_replicas": [
            {
              "node_id": 2,
              "store_id": 2,
              "replica_id": 1,
              "type": 0
            },
            {
              "node_id": 6,
              "store_id": 6,
              "replica_id": 2,
              "type": 0
            },
            {
              "node_id": 3,
              "store_id": 3,
              "replica_id": 3,
              "type": 0
            },
            {
              "node_id": 4,
              "store_id": 4,
              "replica_id": 4,
              "type": 0
            },
            {
              "node_id": 8,
              "store_id": 8,
              "replica_id": 5,
              "type": 1
            }
          ],
          "next_replica_id": 6,
          "generation": 67,
          "sticky_bit": {
            "wall_time": 9223372036854775807,
            "logical": 2147483647
          }
        },
        "lease": {
          "start": {
            "wall_time": 1688471581088063261
          },
          "replica": {
            "node_id": 2,
            "store_id": 2,
            "replica_id": 1,
            "type": 0
          },
          "proposed_ts": {
            "wall_time": 1688471581148598109
          },
          "epoch": 1,
          "sequence": 13,
          "acquisition_type": 2
        },
        "truncated_state": {
          "index": 10,
          "term": 5
        },
        "gc_threshold": {},
        "stats": {
          "contains_estimates": 0,
          "last_update_nanos": 1688471582558396732,
          "intent_age": 0,
          "gc_bytes_age": 0,
          "live_bytes": 6600,
          "live_count": 200,
          "key_bytes": 5000,
          "key_count": 200,
          "val_bytes": 1600,
          "val_count": 200,
          "intent_bytes": 0,
          "intent_count": 0,
          "separated_intent_count": 0,
          "range_key_count": 0,
          "range_key_bytes": 0,
          "range_val_count": 0,
          "range_val_bytes": 0,
          "sys_bytes": 623,
          "sys_count": 5,
          "abort_span_bytes": 0
        },
        "version": {
          "major": 23,
          "minor": 1,
          "patch": 0,
          "internal": 10
        },
        "raft_closed_timestamp": {
          "wall_time": 1688471579558611788
        },
        "raft_applied_index_term": 6,
        "gc_hint": {
          "latest_range_delete_timestamp": {}
        }
      },
      "last_index": 19,
      "num_pending": 2,
      "raft_log_size": 3114,
      "approximate_proposal_quota": 8388109,
      "proposal_quota_base_index": 19,
      "range_max_bytes": 536870912,
      "active_closed_timestamp": {
        "wall_time": 1688471585467455751
      },
      "tenant_id": 1,
      "closed_timestamp_sidetransport_info": {
        "replica_closed": {
          "wall_time": 1688471585467455751
        },
        "replica_lai": 7,
        "central_closed": {}
      }
    },
    "source_node_id": 2,
    "source_store_id": 2,
    "lease_history": [
      {
        "start": {
          "wall_time": 1688471581088063261
        },
        "replica": {
          "node_id": 2,
          "store_id": 2,
          "replica_id": 1,
          "type": 0
        },
        "proposed_ts": {
          "wall_time": 1688471581148598109
        },
        "epoch": 1,
        "sequence": 13,
        "acquisition_type": 2
      }
    ],
    "problems": {
      "underreplicated": true
    },
    "stats": {
      "queries_per_second": 0.0032683483833489094,
      "requests_per_second": 0.0032683483839620617,
      "cpu_time_per_second": 605.5896575170474
    },
    "lease_status": {
      "lease": {
        "start": {
          "wall_time": 1688471581088063261
        },
        "replica": {
          "node_id": 2,
          "store_id": 2,
          "replica_id": 1,
          "type": 0
        },
        "proposed_ts": {
          "wall_time": 1688471581148598109
        },
        "epoch": 1,
        "sequence": 13,
        "acquisition_type": 2
      },
      "now": {
        "wall_time": 1688475524035369955
      },
      "request_time": {
        "wall_time": 1688475524035369955
      },
      "state": 1,
      "liveness": {
        "node_id": 2,
        "epoch": 1,
        "expiration": {
          "wall_time": 1688475529219325917,
          "logical": 0
        }
      },
      "min_valid_observed_timestamp": {
        "wall_time": 1688471581088063261
      }
    },
    "ticking": true,
    "read_latches": 4,
    "write_latches": 2,
    "top_k_locks_by_wait_queue_waiters": null,
    "locality": {
      "tiers": [
        {
          "key": "cloud",
          "value": "gce"
        },
        {
          "key": "region",
          "value": "us-east1"
        },
        {
          "key": "zone",
          "value": "us-east1-b"
        }
      ]
    },
    "is_leaseholder": true,
    "lease_valid": true
  },

Interestingly, I even found a few of these that were quiesced:

Details

  {
    "span": {
      "start_key": "/Table/106/1/6717121093773358192",
      "end_key": "/Table/106/1/6753977725289261392"
    },
    "raft_state": {
      "replica_id": 5,
      "hard_state": {
        "term": 7,
        "vote": 5,
        "commit": 37
      },
      "lead": 5,
      "state": "StateLeader",
      "applied": 37,
      "progress": {
        "3": {
          "match": 37,
          "next": 38,
          "state": "StateReplicate"
        },
        "4": {
          "match": 37,
          "next": 38,
          "state": "StateReplicate"
        },
        "5": {
          "match": 37,
          "next": 38,
          "state": "StateReplicate"
        }
      }
    },
    "state": {
      "state": {
        "raft_applied_index": 37,
        "lease_applied_index": 22,
        "desc": {
          "range_id": 255,
          "start_key": "8on9XTgCC8Ok9HA=",
          "end_key": "8on9Xbry9QDEaVA=",
          "internal_replicas": [
            {
              "node_id": 2,
              "store_id": 2,
              "replica_id": 5,
              "type": 0
            },
            {
              "node_id": 5,
              "store_id": 5,
              "replica_id": 4,
              "type": 0
            },
            {
              "node_id": 3,
              "store_id": 3,
              "replica_id": 3,
              "type": 0
            }
          ],
          "next_replica_id": 6,
          "generation": 60,
          "sticky_bit": {
            "wall_time": 9223372036854775807,
            "logical": 2147483647
          }
        },
        "lease": {
          "start": {
            "wall_time": 1688471580416803028
          },
          "replica": {
            "node_id": 2,
            "store_id": 2,
            "replica_id": 5,
            "type": 2
          },
          "proposed_ts": {
            "wall_time": 1688471580433292029
          },
          "epoch": 1,
          "sequence": 13,
          "acquisition_type": 2
        },
        "truncated_state": {
          "index": 22,
          "term": 6
        },
        "gc_threshold": {},
        "stats": {
          "contains_estimates": 0,
          "last_update_nanos": 1688471581911434236,
          "intent_age": 0,
          "gc_bytes_age": 0,
          "live_bytes": 6600,
          "live_count": 200,
          "key_bytes": 5000,
          "key_count": 200,
          "val_bytes": 1600,
          "val_count": 200,
          "intent_bytes": 0,
          "intent_count": 0,
          "separated_intent_count": 0,
          "range_key_count": 0,
          "range_key_bytes": 0,
          "range_val_count": 0,
          "range_val_bytes": 0,
          "sys_bytes": 1284,
          "sys_count": 6,
          "abort_span_bytes": 0
        },
        "version": {
          "major": 23,
          "minor": 1,
          "patch": 0,
          "internal": 10
        },
        "raft_closed_timestamp": {
          "wall_time": 1688471578911824630
        },
        "raft_applied_index_term": 7,
        "gc_hint": {
          "latest_range_delete_timestamp": {}
        }
      },
      "last_index": 37,
      "num_pending": 2,
      "raft_log_size": 5934,
      "approximate_proposal_quota": 8388025,
      "proposal_quota_base_index": 37,
      "range_max_bytes": 536870912,
      "active_closed_timestamp": {
        "wall_time": 1688472975379811219
      },
      "tenant_id": 1,
      "closed_timestamp_sidetransport_info": {
        "replica_closed": {
          "wall_time": 1688472975379811219
        },
        "replica_lai": 22,
        "central_closed": {}
      }
    },
    "source_node_id": 2,
    "source_store_id": 2,
    "lease_history": [
      {
        "start": {
          "wall_time": 1688471577723787541
        },
        "replica": {
          "node_id": 8,
          "store_id": 8,
          "replica_id": 1,
          "type": 0
        },
        "proposed_ts": {
          "wall_time": 1688471577750593265
        },
        "epoch": 1,
        "sequence": 11,
        "acquisition_type": 2
      },
      {
        "start": {
          "wall_time": 1688471580416803028
        },
        "expiration": {
          "wall_time": 1688471586415413325
        },
        "replica": {
          "node_id": 2,
          "store_id": 2,
          "replica_id": 5,
          "type": 2
        },
        "proposed_ts": {
          "wall_time": 1688471580415413325
        },
        "sequence": 12,
        "acquisition_type": 1
      },
      {
        "start": {
          "wall_time": 1688471580416803028
        },
        "replica": {
          "node_id": 2,
          "store_id": 2,
          "replica_id": 5,
          "type": 2
        },
        "proposed_ts": {
          "wall_time": 1688471580433292029
        },
        "epoch": 1,
        "sequence": 13,
        "acquisition_type": 2
      }
    ],
    "problems": {
      "underreplicated": true
    },
    "stats": {},
    "lease_status": {
      "lease": {
        "start": {
          "wall_time": 1688471580416803028
        },
        "replica": {
          "node_id": 2,
          "store_id": 2,
          "replica_id": 5,
          "type": 2
        },
        "proposed_ts": {
          "wall_time": 1688471580433292029
        },
        "epoch": 1,
        "sequence": 13,
        "acquisition_type": 2
      },
      "now": {
        "wall_time": 1688475524035369955
      },
      "request_time": {
        "wall_time": 1688475524035369955
      },
      "state": 1,
      "liveness": {
        "node_id": 2,
        "epoch": 1,
        "expiration": {
          "wall_time": 1688475529219325917,
          "logical": 0
        }
      },
      "min_valid_observed_timestamp": {
        "wall_time": 1688471580416803028
      }
    },
    "quiescent": true,
    "read_latches": 2,
    "write_latches": 2,
    "top_k_locks_by_wait_queue_waiters": null,
    "locality": {
      "tiers": [
        {
          "key": "cloud",
          "value": "gce"
        },
        {
          "key": "region",
          "value": "us-east1"
        },
        {
          "key": "zone",
          "value": "us-east1-b"
        }
      ]
    },
    "is_leaseholder": true,
    "lease_valid": true
  },

At this point, I'm starting to suspect proposal issues. Wdyt @tbg?

@tbg tbg assigned tbg and unassigned erikgrinaker Jul 5, 2023
@tbg tbg added T-kv-replication and removed T-kv KV Team labels Jul 5, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 5, 2023

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 5, 2023

Copying over this summary from Slack, to avoid having to piece it together from the above.

What seems to be happening, is that about 60 ranges on n2 end up with stuck proposals holding latches, without tripping the circuit breakers. Some of these latches are for transaction records owning intents on meta1. When other callers try to scan meta1, they hit the intents, and r1 on n6 attempts to resolve the intents on n2, hitting the latches owned by the stuck proposals and stalling forever.

@tbg
Copy link
Member

tbg commented Jul 6, 2023

Starting this investigation from scratch to get my bearings.

First step: look at the combined logs and find the first slow latch ("have been waiting" message), which is usually a good way to figure out when replication-related trouble began. This happens on r352, at 11:53:20 on n2, and the write latch that has been held for 15s at that point is the range descriptor (‹/Local/Range/Table/106/1/5611422148296262192/RangeDescriptor›@0,0).

15s before that point, we see an AdminScatter. Okay, we tried to scatter the range, and that borked it. So far, so good. From that point on there are various more slow latches, etc, so ok, it doesn't unbork and is likely wedged until the end of the test. But why?

The new thing we see at 11:57:23 is raft receive queue for r352 is full, again on n2, and this repeats frequently throughout. This is unexpected - we don't expect any traffic on this range.

Spot checking a few other ranges, there is a pattern.

85 ranges logged slow latches in the test.
$ grep 'have been waiting' combined.log  | grep -Eo 'r[0-9]+' | sort | uniq | tee affected.txt
r1
r106
r114
r116
r145
r154
r155
r157
r159
r160
r163
r168
r174
r177
r178
r180
r181
r182
r183
r187
r190
r194
r205
r214
r217
r218
r220
r230
r231
r255
r257
r258
r292
r308
r310
r311
r313
r326
r327
r344
r345
r346
r347
r348
r349
r350
r351
r352
r353
r368
r380
r384
r385
r386
r387
r388
r389
r390
r391
r392
r393
r414
r415
r417
r418
r419
r420
r421
r422
r433
r493
r64
r652
r68
r704
r74
r78
r80
r824
r84
r87
r88
r90
r91
r93

When you look at the distribution on when they got stuck, there is a very strong correlation. These are not independent events - almost all of the slow latches are logged at 11:53:2[01]. Something got borked. In addition to this, all of the first events occur on n2. So, something happened on n2 that affected only the ranges it held the lease for.

Details

$ for rid in $(cat affected.txt); do grep -E "$rid.*have been waiting.*latch" combined.log | head -n 1; done | grep -Eo 'W230704 ........' > first_affected.txt
$ cat first_affected.txt | sort | uniq -c
  12 W230704 11:53:20
  17 W230704 11:53:21
   3 W230704 11:53:23
   3 W230704 11:53:24
   1 W230704 11:54:30
   1 W230704 11:56:30
   1 W230704 11:57:30
   1 W230704 11:58:30
   1 W230704 11:59:30
   1 W230704 12:00:30
   1 W230704 12:02:30
   1 W230704 12:04:30
   1 W230704 12:07:30
   1 W230704 12:09:30
   1 W230704 12:12:30
   1 W230704 12:13:30
   1 W230704 12:14:30
   1 W230704 12:16:30
   1 W230704 12:17:30
   1 W230704 12:19:30
   1 W230704 12:20:30
   1 W230704 12:21:30
   1 W230704 12:22:30
   1 W230704 12:23:30
   1 W230704 12:24:30
   1 W230704 12:28:30
   1 W230704 12:30:30
   1 W230704 12:33:30
   1 W230704 12:47:30
   1 W230704 12:49:30
   1 W230704 12:50:30
   1 W230704 12:51:30
   1 W230704 12:52:30
   1 W230704 12:53:30
   1 W230704 12:55:30
   1 W230704 12:56:30

@tbg
Copy link
Member

tbg commented Jul 6, 2023

Flipping through the raft state for the affected ranges, for most of them n2 is the leader+leaseholder and all followers are in StateReplicate.

Details

for rid in $(cat affected.txt | sed 's/r//g'); do cat debug/nodes/2/ranges.json  | jq ".[] | select(.state.state.desc.range_id == $rid) | .raft_state"; done | less
[...]

We see 63 ranges report "raft receive queue is full". Notably, this is fewer than the 85 ranges we have in total, so maybe there is more than one way in which things broke down. Again, there doesn't seem to be a reason why any receive queue should fill up. After all, looking at e.g. r352, its last index is 19. There simply isn't much traffic on these ranges, and they're stuck anyway.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 6, 2023

Couldn't the receive queues filling up simply be heartbeat responses and such?

Normally I'd suspect some kind of mutex deadlock here, but I didn't immediately see anything in the stacks. Could this be some kind of livelock scenario?

(EDIT: I originally looked at n1 and n6 stacks, where r1 was stuck, not n2)

@erikgrinaker

This comment was marked as outdated.

@erikgrinaker

This comment was marked as outdated.

@tbg
Copy link
Member

tbg commented Jul 6, 2023

This smelled like some sort of deadlock, so I looked through the stacks. This one caught my eye right away:

goroutine 155 [select, 65 minutes]: github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaScanner).RemoveReplica(...)

goroutine 155 [select, 65 minutes]:
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaScanner).RemoveReplica(...)
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scanner.go:185
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).removeInitializedReplicaRaftMuLocked(0xc002ce0a80, {0x77c61f0, 0xc0051e2240}, 0xc008d69200, 0x7, {0x0?, 0x0?})
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_remove_replica.go:227 +0xa45
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleChangeReplicasResult(0xc008d69200, {0x77c61f0, 0xc0051e2240}, 0xc007989c80)
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_result.go:619 +0x29e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult(0xc008d69308, {0x77c61f0, 0xc0051e2240}, 0xc0078f11d8)
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:340 +0x670
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects(0xc008d69308, {0x77c61f0, 0xc0051e2240}, {0x780c0c0?, 0xc003158508})
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:186 +0x46a
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCheckedCmdIter({0x7f6f381ff818?, 0xc008d696d8}, 0xc0062ca560)
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:214 +0x158
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch(0xc00779ea78, {0x77c61f0, 0xc0051e2240}, {0x77f9b40, 0xc008d69678})
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:295 +0x1f8
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries(0xc0062caa78, {0x77c61f0, 0xc0051e2240})
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:251 +0x9a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked(_, {_, _}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1052 +0x1545
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady(_, {_, _}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:751 +0x19b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady(0xc002ce0a80, 0x77c61f0?)
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:660 +0x145
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker(0xc000f6d1f0, {0x77c61f0, 0xc00102cab0}, {0x77a9ac0, 0xc002ce0a80}, 0xc002ce0000)
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:418 +0x1f2
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2({0x77c61f0?, 0xc00102cab0?})
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321 +0x45
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b

We seem to have deadlocked on replica removal on here, which deadlocks that replica's raftMu. It's not obvious why removing from the scanner would get stuck like this but all sorts of bad things result from it.

Let's find out what the scanner is doing.

We're trying to send to this channel

func (rs *replicaScanner) RemoveReplica(repl *Replica) {
select {
case rs.removed <- repl:

The consumer is this:

func (rs *replicaScanner) waitAndProcess(ctx context.Context, start time.Time, repl *Replica) bool {
waitInterval := rs.paceInterval(start, timeutil.Now())
rs.waitTimer.Reset(waitInterval)
if log.V(6) {
log.Infof(ctx, "wait timer interval set to %s", waitInterval)
}
for {
select {
case <-rs.waitTimer.C:
if log.V(6) {
log.Infof(ctx, "wait timer fired")
}
rs.waitTimer.Read = true
if repl == nil {
return false
}
if log.V(2) {
log.Infof(ctx, "replica scanner processing %s", repl)
}
for _, q := range rs.queues {
q.MaybeAddAsync(ctx, repl, rs.clock.NowAsClockTimestamp())
}
return false
case repl := <-rs.removed:
rs.removeReplica(repl)
case <-rs.stopper.ShouldQuiesce():
return true
}
}
}

In the stacks we see that this goroutine... isn't there? That can't be good. Instead, we see replicaScanner.scanLoopdeadlocked:

goroutine 263 [semacquire, 65 minutes]:
sync.runtime_SemacquireMutex(0x7794ef0?, 0x10?, 0x0?)
        GOROOT/src/runtime/sema.go:77 +0x25
sync.(*RWMutex).RLock(...)
        GOROOT/src/sync/rwmutex.go:71
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*storeReplicaVisitor).Visit(0xc001021410, 0xc002f93cc0)
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:463 +0x145
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaScanner).scanLoop.func1({0x77c61f0?, 0xc000cb9950})
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scanner.go:284 +0x23c
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b

where it's trying to acquire a Replica mutex here:

for _, repl := range rs.repls {
// TODO(tschottdorf): let the visitor figure out if something's been
// destroyed once we return errors from mutexes (#9190). After all, it
// can still happen with this code.
rs.visited++
repl.mu.RLock()

TLDR: there's at least one deadlocked replica mu on n2.

An immediate consequence of this is possible raft scheduler starvation. We see lots of the scheduler's goroutines stuck in the same removeReplica deadlock, since everyone who calls RemoveReplica on the scanner gets stuck on the same glue trap. This likely explains the majority of badness in this cluster (not why we have quiesced replicas with pending proposals though, that is likely an independent bug we ought to look into). We see also see scheduler threads deadlocked on tryGetReplica, so it's safe to say there isn't a free scheduler thread on this node.

Ok, so can we find out where the mutex leaked? I don't see an obvious way. I went through all goroutines containing Replica in their stack, and they all seem to be suffering from the mutex being held, but nobody holding it.

The deadlock reminds me of what we saw in https://github.com/cockroachlabs/support/issues/2387#issuecomment-1602346347 but could never track down.

Since we now seem to have a way in which this should at least be somewhat reproducible, we can annotate the mutex and hack together a version of #66765.

I'll get started on that today.

@tbg
Copy link
Member

tbg commented Jul 6, 2023

Kicking off the repro machine, on #106254. Let's hope that something shakes out soon, if it does then we'll know who leaked the mutex and that should pretty much settle it. If it doesn't repro... uhm, well, I am willing to try pretty hard to have it repro anyway. It happened once, it will happen again.

@cockroach-teamcity

This comment was marked as resolved.

@tbg
Copy link
Member

tbg commented Jul 6, 2023

I ran this on master and something else seems to have broken that's way worse and very frequent. Now the test fails most of the time, with the initial splits getting stuck in an infinite loop, where it's continously seeing a changed descriptor and retrying with the same descriptor - that descriptor looks OK, but somehow the one it sees in the split txn is "old". Something got borked. I suspect it's #106104, which, I think, can cause replica change txns to succeed on the range but roll back on meta2. This is being reverted now, in #106267.

@tbg
Copy link
Member

tbg commented Jul 6, 2023

Well, the good news is that with #106267 we seem to actually be able to run this test. The bad news is, it doesn't seem to want to reproduce the original issue here. I ran all of the 🔴 🟢 in #106285 with #106254, i.e. it was doing only the splits and then exiting early, and this never reproduced the problem we are seeing here, over a total of, I'm guessing, 500 runs, including 300 on the SHA of this issue (which predates the other bad commit anyway, so didn't need the revert).

Going to try for 9999 runs overnight, on this issue's SHA.

@tbg
Copy link
Member

tbg commented Jul 7, 2023

All of the nightly 1643 runs have passed.
I am going to kick of a few hundred on vanilla original SHA (428dc9d) with an early-exit to see if it's the mutex instrumentation preventing the deadlock.

@tbg
Copy link
Member

tbg commented Jul 7, 2023

Just for posterity, I didn't understand why I saw AdminScatter in the original logs (and a ton of it)

teamcity-10771334-1688449519-65-n10cpu2-0001> I230704 11:52:56.660750 16288 13@kv/kvserver/allocator/plan/replicate.go:900 ⋮ [T1,n1,s1,r17/1:‹/Table/1{5-6}›,kvpb.AdminScatterRequest] 2156 rebalancing ‹voter› n2,s2 to n6,s6: ‹[1:180, 2:180, 3:180, 4:180, 5:180]›,

I was confused by this because we do not see this line printed in the logs

t.L().Printf("scattering table")

but we do see the "creating workload database" above the workload init. The workload init does not pass --scatter either, and it defaults to false.

But I found out by code inspection that when the workload creates splits, it also scatters each split:

fmt.Fprintf(&buf, `ALTER TABLE %s SCATTER FROM (%s) TO (%s)`,
tree.NameString(table.Name), split, split)

So everything checks out and the repro attempts, sadly, actually made sense. I was hoping that I had missed something that was present in the original execution.

@tbg
Copy link
Member

tbg commented Jul 7, 2023

I am going to kick of a few hundred on vanilla original SHA (428dc9d) with an early-exit to see if it's the mutex instrumentation preventing the deadlock.

Predictably all 100 passed. I have since run hundreds more and counting.

@tbg
Copy link
Member

tbg commented Jul 8, 2023

The full test is passing repeatedly too, just as you'd think it would. There is one flake I have hit three times (out of 181 runs of the test),

--- FAIL: failover/chaos/read-only#107 (2599.08s)
(assertions.go:333).Fail: 
        Error Trace:    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1605
                                                github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:327
                                                main/pkg/cmd/roachtest/monitor.go:105
                                                golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
                                                GOROOT/src/runtime/asm_amd64.s:1594
        Error:          Received unexpected error:
                        pq: setting updated but timed out waiting to read new value
        Test:           failover/chaos/read-only

I've spot checked the logs for one instance and it doesn't look like the deadlock.

@tbg
Copy link
Member

tbg commented Jul 8, 2023

I also ran 3151 more iterations of the "shortened" test overnight. They all passed.
I'll give this another 24h and then it might be time to cut our losses.

edit, 2 days later:

I'll give this another 24h and then it might be time to cut our losses.

didn't start due to a dumb error on my part. 24h timer starts now.

@tbg
Copy link
Member

tbg commented Jul 10, 2023

😢 PASS failover/chaos/read-only (run 2071)

@tbg
Copy link
Member

tbg commented Jul 11, 2023

Okay, time to throw the towel:

failover/chaos/read-only (run 3701) PASS

@tbg
Copy link
Member

tbg commented Jul 11, 2023

I filed #106568 instead (with release-blocker label for now, even though it's not a new bug).

@tbg
Copy link
Member

tbg commented Jul 17, 2023

#106574 hopefully fixes this.

@tbg tbg closed this as completed Jul 17, 2023
@andrewbaptist andrewbaptist added the X-duplicate Closed as a duplicate of another issue. label Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. X-duplicate Closed as a duplicate of another issue.
Projects
None yet
Development

No branches or pull requests

4 participants