Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: splits/largerange/size=32GiB,nodes=6 failed [raft snaps; needs #106813] #104588

Closed
cockroach-teamcity opened this issue Jun 8, 2023 · 21 comments
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 8, 2023

roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ c7a7e423506c88f655c08d6ba6acd6e907e1dae5:

test artifacts and logs in: /artifacts/splits/largerange/size=32GiB_nodes=6/run_1
(sql_runner.go:203).Query: error executing 'SHOW RANGES FROM TABLE bank.bank': pq: replica unavailable: (n3,s3):5 unable to serve request to r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n6,s6):4, (n3,s3):5, next=7, gen=10]: closed timestamp: 1686231735.000116914,0 (2023-06-08 13:42:15); raft status: {"id":"5","term":8,"vote":"5","commit":6636,"lead":"5","raftState":"StateLeader","applied":6636,"progress":{"2":{"match":6636,"next":6637,"state":"StateReplicate"},"3":{"match":6636,"next":6637,"state":"StateReplicate"},"4":{"match":6636,"next":6637,"state":"StateReplicate"},"5":{"match":6636,"next":6637,"state":"StateReplicate"},"1":{"match":6636,"next":6637,"state":"StateReplicate"}},"leadtransferee":"0"}: encountered poisoned latch /Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"@0,0
(monitor.go:137).Wait: monitor failure: monitor task failed: t.Fatal() was called

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-28616

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jun 8, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jun 8, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Jun 9, 2023

encountered poisoned latch /Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"@0,0

while querying r1

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 9, 2023

Could be an instance of #102936

@pav-kv pav-kv added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv Anything in KV that doesn't belong in a more specific category. labels Jun 9, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Jun 9, 2023

First error mentioning r548 is a circuit breaker tripped while n3 is still in a demoting state:

13:41:43.916584 kv/kvserver/replica_circuit_breaker.go:175 ⋮ [T1,n3,s3,r548/2:‹/Table/106/1/119{1187…-3768…}›] breaker: tripped with error: replica unavailable: (n3,s3):2VOTER_DEMOTING_LEARNER unable to serve request to r548:‹/Table/106/1/119{118742-376853}› [(n1,s1):1, (n3,s3):2VOTER_DEMOTING_LEARNER, (n4,s4):3, (n2,s2):4VOTER_INCOMING, next=5, gen=475]: closed timestamp: 1686231639.296513274,0 (2023-06-08 13:40:39); raft status: {"id":"2","term":7,"vote":"2","commit":29,"lead":"2","raftState":"StateLeader","applied":29,"progress":{"1":{"match":0,"next":1,"state":"StateSnapshot"},"2":{"match":50,"next":51,"state":"StateReplicate"},"3":{"match":50,"next":52,"state":"StateReplicate"},"4":{"match":0,"next":1,"state":"StateSnapshot"}},"leadtransferee":"0"}: have been waiting 62.00s for slow proposal ConditionalPut [/Local/Range/Table/106/1/‹119118742›/‹RangeDescriptor›,/Min), [txn: 97e8d7ec]

Exactly the same poisoned latch (key ‹/Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"›@0,0) pops up in logs a few times shortly after:

13:41:43.930897 kv/kvserver/reports/reporter.go:159 ⋮ [T1,n3,replication-reporter] failed to generate replication reports: failed to compute constraint conformance report: <same latch>
13:41:43.931478 kv/txn.go:878 ⋮ [T1,n4] async rollback failed: <same latch>: "change-replica" meta={id=97e8d7ec key=/Local/Range/Table/106/1/‹119118742›/‹RangeDescriptor› iso=Serializable pri=0.03170581 epo=0 ts=1686231642.299152484,0 min=1686231642.299152484,0 seq=1} lock=true stat=PENDING rts=1686231642.299152484,0 wto=false gul=1686231642.799152484,0

Between 13:41:44-13:42:15, there are a few "poisoned latch" errors with the same key logged on nodes 1, 2, 4, 5 (from line kv/kvserver/queue.go:1146):

[T1,n4,replicaGC,s4,r155/3:‹/Table/106/1/21{294673-552784}›]
[T1,n5,replicaGC,s5,r155/5:‹/Table/106/1/21{294673-552784}›]
[T1,n1,replicaGC,s1,r3/1:‹/System/{NodeLive…-tsd}›]
[T1,n2,replicaGC,s2,r514/4:‹/Table/106/1/110{3429…-6010…}›]
[T1,n5,replicaGC,s5,r163/5:‹/Table/106/1/23{359561-617672}›]
[T1,n2,replicaGC,s2,r502/4:‹/Table/106/1/107{2456…-5037…}›]
[T1,n1,replicaGC,s1,r182/1:‹/Table/106/1/28{263670-521781}›]
[T1,n5,replicaGC,s5,r85/5:‹/Table/106/1/3{121855-381967}›]
[T1,n2,replicaGC,s2,r199/5:‹/Table/106/1/32{651557-909668}›]

@kvoli kvoli removed this from the 23.1 milestone Jun 20, 2023
@tbg
Copy link
Member

tbg commented Jun 27, 2023

@erikgrinaker
Copy link
Contributor

@tbg says:

It's weird to me that we're seeing someone take out that latch for range=548 but while on r1. Maybe some weird interaction where the r548 got merged into r1 while the latch was held, but how would that work (merges acquire that latch too)?

@erikgrinaker says:

What if it was synchronously resolving an intent it found on r1, which was written by txn 97e8d7ec anchored on r548. If the synchronous intent resolution hit a poisoned latch on the abort span on r548 (which it would because that txn was the one that tripped the breaker and got aborted), the poisoned latch error might have been propagated back up to the caller and reported as a poisoned latch on r1.

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 29, 2023

Contention events log has something to say about the aborted transaction 97e8d7ec:

debug/nodes/1/crdb_internal.node_contention_events.json:

{
   "count" : "1",
   "cumulative_contention_time" : "00:00:00.250195",
   "index_id" : "106",
   "key" : "\\x03f289f9071d8bd5",
   "num_contention_events" : "2",
   "table_id" : "4294967163",
   "txn_id" : "97e8d7ec-2a38-4880-9df6-3b51abc9086c"
}

debug/crdb_internal.transaction_contention_events.json:

{
   "blocking_txn_fingerprint_id" : "\\x0000000000000000",
   "blocking_txn_id" : "97e8d7ec-2a38-4880-9df6-3b51abc9086c",
   "collection_ts" : "2023-06-08 13:42:19.554013+00",
   "contending_key" : "\\x03f289f9071d8bd5",
   "contending_pretty_key" : "/Meta2/Table/106/1/119376853",
   "contention_duration" : "00:00:00.052876",
   "database_name" : "",
   "index_name" : "",
   "schema_name" : "",
   "table_name" : "",
   "waiting_stmt_fingerprint_id" : "\\x66de6ed8999bbf0e",
   "waiting_stmt_id" : "1766b2d540926e3c0000000000000001",
   "waiting_txn_fingerprint_id" : "\\x0000000000000000",
   "waiting_txn_id" : "55eed750-2e00-4680-b85b-d045eb9d45fb"
}

Execution insights on the contending transaction 55eed750 (looks like a failed consistency check):

{
   "app_name" : "",
   "causes" : "{}",
   "contention" : "00:00:00",
   "cpu_sql_nanos" : "6072049",
   "database_name" : "defaultdb",
   "end_time" : "2023-06-08 13:42:19.556586",
   "error_code" : "XXUUU",
   "exec_node_ids" : "{1}",
   "full_scan" : "f",
   "implicit_txn" : "t",
   "index_recommendations" : "{}",
   "last_retry_reason" : "NULL",
   "plan_gist" : "AgICABoMAwUIBgg=",
   "priority" : "normal",
   "problem" : "FailedExecution",
   "query" : "SELECT t.range_id, t.start_key_pretty, t.status, t.detail FROM ROWS FROM (crdb_internal.check_consistency(_, '_', '_')) AS t WHERE t.status NOT IN ('_', '_')",
   "retries" : "0",
   "rows_read" : "0",
   "rows_written" : "0",
   "session_id" : "1766b2d53cfd91a80000000000000001",
   "start_time" : "2023-06-08 13:42:19.480564",
   "status" : "Failed",
   "stmt_fingerprint_id" : "\\x66de6ed8999bbf0e",
   "stmt_id" : "1766b2d540926e3c0000000000000001",
   "txn_fingerprint_id" : "\\x3ff3476808ae6c64",
   "txn_id" : "55eed750-2e00-4680-b85b-d045eb9d45fb",
   "user_name" : "root"
}

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 29, 2023

r548 circuit breaker tripped with the following error:

"circuit_breaker_error": "replica unavailable: (n3,s3):2VOTER_DEMOTING_LEARNER unable to serve request to r548:/Table/106/1/119{118742-376853} [(n1,s1):1, (n3,s3):2VOTER_DEMOTING_LEARNER, (n4,s4):3, (n2,s2):4VOTER_INCOMING, next=5, gen=475]: closed timestamp: 1686231639.296513274,0 (2023-06-08 13:40:39); raft status: {\"id\":\"2\",\"term\":7,\"vote\":\"2\",\"commit\":29,\"lead\":\"2\",\"raftState\":\"StateLeader\",\"applied\":29,\"progress\":{\"1\":{\"match\":0,\"next\":1,\"state\":\"StateSnapshot\"},\"2\":{\"match\":109,\"next\":110,\"state\":\"StateReplicate\"},\"3\":{\"match\":109,\"next\":113,\"state\":\"StateReplicate\"},\"4\":{\"match\":0,\"next\":1,\"state\":\"StateSnapshot\"}},\"leadtransferee\":\"0\"}: operation \"probe\" timed out after 1m0.002s (given timeout 1m0s): result is ambiguous: after 60.00s of attempting command: context deadline exceeded"`

More details on r548 on n3:

Details
"span": {
  "start_key": "/Table/106/1/119118742",
  "end_key": "/Table/106/1/119376853"
},
"raft_state": {
  "replica_id": 2,
  "hard_state": {
    "term": 7,
    "vote": 2,
    "commit": 29
  },
  "lead": 2,
  "state": "StateLeader",
  "applied": 29,
  "progress": {
    "1": {
      "next": 1,
      "state": "StateSnapshot",
      "paused": true,
      "pending_snapshot": 29
    },
    "2": {
      "match": 161,
      "next": 162,
      "state": "StateReplicate"
    },
    "3": {
      "match": 161,
      "next": 162,
      "state": "StateReplicate"
    },
    "4": {
      "next": 1,
      "state": "StateSnapshot",
      "paused": true,
      "pending_snapshot": 26
    }
  }
},
"state": {
  "state": {
    "raft_applied_index": 29,
    "lease_applied_index": 9,
    "desc": {
      "range_id": 548,
      "start_key": "8on5Bxmblg==",
      "end_key": "8on5Bx2L1Q==",
      "internal_replicas": [
        {
          "node_id": 1,
          "store_id": 1,
          "replica_id": 1,
          "type": 0
        },
        {
          "node_id": 3,
          "store_id": 3,
          "replica_id": 2,
          "type": 4
        },
        {
          "node_id": 4,
          "store_id": 4,
          "replica_id": 3,
          "type": 0
        },
        {
          "node_id": 2,
          "store_id": 2,
          "replica_id": 4,
          "type": 2
        }
      ],
      "next_replica_id": 5,
      "generation": 475,
      "sticky_bit": {}
    },
    "lease": {
      "start": {
        "wall_time": 1686231642153434301
      },
      "expiration": {
        "wall_time": 1686231648153352643
      },
      "replica": {
        "node_id": 3,
        "store_id": 3,
        "replica_id": 2,
        "type": 0
      },
      "proposed_ts": {
        "wall_time": 1686231642153352643
      },
      "sequence": 5,
      "acquisition_type": 1
    },
    "truncated_state": {
      "index": 10,
      "term": 5
    },
    "gc_threshold": {},
    "stats": {
      "contains_estimates": 0,
      "last_update_nanos": 1686231615586276112,
      "intent_age": 0,
      "gc_bytes_age": 0,
      "live_bytes": 33554430,
      "live_count": 258111,
      "key_bytes": 5420331,
      "key_count": 258111,
      "val_bytes": 28134099,
      "val_count": 258111,
      "intent_bytes": 0,
      "intent_count": 0,
      "separated_intent_count": 0,
      "range_key_count": 0,
      "range_key_bytes": 0,
      "range_val_count": 0,
      "range_val_bytes": 0,
      "sys_bytes": 866,
      "sys_count": 8,
      "abort_span_bytes": 0
    },
    "version": {
      "major": 23,
      "minor": 1,
      "patch": 0,
      "internal": 8
    },
    "raft_closed_timestamp": {
      "wall_time": 1686231639296513274
    },
    "raft_applied_index_term": 7,
    "gc_hint": {
      "latest_range_delete_timestamp": {}
    }
  },
  "last_index": 161,
  "num_pending": 4,
  "raft_log_size": 43070,
  "approximate_proposal_quota": 8388183,
  "proposal_quota_base_index": 29,
  "range_max_bytes": 67108864,
  "active_closed_timestamp": {
    "wall_time": 1686231639296513274
  },
  "tenant_id": 1,
  "closed_timestamp_sidetransport_info": {
    "replica_closed": {
      "wall_time": 1686231639279698235
    },
    "replica_lai": 7,
    "central_closed": {}
  },
  "circuit_breaker_error": "replica unavailable: (n3,s3):2VOTER_DEMOTING_LEARNER unable to serve request to r548:/Table/106/1/119{118742-376853} [(n1,s1):1, (n3,s3):2VOTER_DEMOTING_LEARNER, (n4,s4):3, (n2,s2):4VOTER_INCOMING, next=5, gen=475]: closed timestamp: 1686231639.296513274,0 (2023-06-08 13:40:39); raft status: {\"id\":\"2\",\"term\":7,\"vote\":\"2\",\"commit\":29,\"lead\":\"2\",\"raftState\":\"StateLeader\",\"applied\":29,\"progress\":{\"1\":{\"match\":0,\"next\":1,\"state\":\"StateSnapshot\"},\"2\":{\"match\":109,\"next\":110,\"state\":\"StateReplicate\"},\"3\":{\"match\":109,\"next\":113,\"state\":\"StateReplicate\"},\"4\":{\"match\":0,\"next\":1,\"state\":\"StateSnapshot\"}},\"leadtransferee\":\"0\"}: operation \"probe\" timed out after 1m0.002s (given timeout 1m0s): result is ambiguous: after 60.00s of attempting command: context deadline exceeded"
}
Replica Type State Node
1 `VOTER_FULL` `StateSnapshot` @ 29 1
2 (leader) `VOTER_DEMOTING_LEARNER` `StateReplicate` @ 161 3
3 `VOTER_FULL` `StateReplicate` @ 161 4
4 `VOTER_INCOMING` `StateSnapshot` @ 26 2

We see that the incoming configuration (replicas 1, 3, 4) has two replicas in StateSnapshot, so we're unable to append/commit to it yet, until the snapshots go through. Notably, this is a "large range" test, so the snapshots can be big and slow to get through.

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 29, 2023

# INITIAL snapshot at index 25 arrived to replica 4 from replica 2
I230608 13:40:42.240333 34335 13@kv/kvserver/store_snapshot.go:1591 ⋮ [T1,n4,s4,r548/3:‹/Table/106/1/119{1187…-3768…}›] 739  streamed INITIAL snapshot 75ebb1f1 at applied index 25 to (n2,s2):4LEARNER with 32 MiB in 0.39s @ 81 MiB/s: kvs=258120 rangeKVs=0, rate-limit: 512 MiB/s, queued: 0.00s
# A config change promoting replica 4 to voter is being proposed
# likely at index 26
I230608 13:40:42.288917 34370 13@kv/kvserver/replica_command.go:2349 ⋮ [T1,n4,replicate,s4,r548/3:‹/Table/106/1/119{1187…-3768…}›] 740  change replicas (add [(n2,s2):4VOTER_INCOMING] remove [(n3,s3):2VOTER_DEMOTING_LEARNER]): existing descriptor r548:‹/Table/106/1/119{118742-376853}› [(n1,s1):1, (n3,s3):2, (n4,s4):3, (n2,s2):4LEARNER, next=5, gen=474]
I230608 13:40:42.300814 34370 13@kv/kvserver/replica_command.go:2349 ⋮ [T1,n4,replicate,s4,r548/3:‹/Table/106/1/119{1187…-3768…}›] 741  change replicas (add [] remove []): existing descriptor r548:‹/Table/106/1/119{118742-376853}› [(n1,s1):1, (n3,s3):2VOTER_DEMOTING_LEARNER, (n4,s4):3, (n2,s2):4VOTER_INCOMING, next=5, gen=475]

# could not acquire latches
W230608 13:40:57.310779 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire read latch ‹/Local/Range/Table/106/1/119118742/RangeDescriptor›@0,0, held by write latch ‹/Local/Range/Table/106/1/119118742/RangeDescriptor›@0,0
W230608 13:40:57.741162 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire write latch ‹/{Local-Meta1/}›@0,0, held by write latch ‹/Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"›@0,0
W230608 13:40:57.741162 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire write latch ‹/Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"›@0,0, held by read latch ‹/Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"›@0,0
W230608 13:40:57.802914 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire write latch ‹/Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"›@0,0, held by write latch ‹/{Local-Meta1/}›@0,0
W230608 13:40:57.852312 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire read latch ‹/Local/Range/Table/106/1/119118742/RangeDescriptor›@0,0, held by write latch ‹/{Local-Meta1/}›@0,0
W230608 13:40:58.303021 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire read latch ‹/Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"›@0,0, held by write latch ‹/{Local-Meta1/}›@0,0
W230608 13:40:58.309024 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire read latch ‹/Local/Range/Table/106/1/119118742/RangeDescriptor›@0,0, held by write latch ‹/{Local-Meta1/}›@0,0
W230608 13:41:21.518763 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire write latch ‹/Local/RangeID/548/r/AbortSpan/"97e8d7ec-2a38-4880-9df6-3b51abc9086c"›@0,0, held by write latch ‹/{Local-Meta1/}›@0,0
W230608 13:41:27.741000 [n3,r548/2:‹/Table/106/1/119{1187…-3768…}›] have been waiting 15s to acquire write latch ‹/{Local-Meta1/}›@0,0, held by write latch ‹/Local/Range/Table/106/1/119118742/RangeDescriptor›@0,0


# 1 minute later the proposal reports a timeout
I230608 13:41:41.838878 34370 13@kv/kvserver/replicate_queue.go:751 ⋮ [T1,n4,replicate,s4,r548/3:‹/Table/106/1/119{1187…-3768…}›] 778  error processing replica: change replicas of r548 failed: txn exec: context deadline exceeded
E230608 13:41:41.838985 34370 kv/kvserver/queue.go:1146 ⋮ [T1,n4,replicate,s4,r548/3:‹/Table/106/1/119{1187…-3768…}›] 423  operation ‹"replicate queue process replica 548"› timed out after 1m0.002s (given timeout 1m0s): change replicas of r548 failed: txn exec: context deadline exceeded

# 2 seconds later the poisoned latch is discovered
I230608 13:41:43.931478 38260 kv/txn.go:878 ⋮ [T1,n4] 426  async rollback failed: replica unavailable: (n3,s3):2VOTER_DEMOTING_LEARNER unable to serve request to r548:‹/Table/106/1/119{118742-376853}› [(n1,s1):1, (n3,s3):2VOTER_DEMOTING_LEARNER, (n4,s4):3, (n2,s2):4VOTER_INCOMING, next=5, gen=475]: closed timestamp: 1686231639.296513274,0 (2023-06-08 13:40:39); raft status: {"id":"2","term":7,"vote":"2","commit":29,"lead":"2","raftState":"StateLeader","applied":29,"progress":{"1":{"match":0,"next":1,"state":"StateSnapshot"},"2":{"match":52,"next":53,"state":"StateReplicate"},"3":{"match":52,"next":53,"state":"StateReplicate"},"4":{"match":0,"next":1,"state":"StateSnapshot"}},"leadtransferee":"0"}: encountered poisoned latch ‹/Local/Range/Table/106/1/119118742/RangeDescriptor›@0,0: "change-replica" meta={id=97e8d7ec key=/Local/Range/Table/106/1/‹119118742›/‹RangeDescriptor› iso=Serializable pri=0.03170581 epo=0 ts=1686231642.299152484,0 min=1686231642.299152484,0 seq=1} lock=true stat=PENDING rts=1686231642.299152484,0 wto=false gul=1686231642.799152484,0

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 29, 2023

There is a build-up in the snapshot queue on n3. The second snapshot@26 could be blocked by sending other snapshots, because the number of pending snapshots at 13:41:30 is still high and the receiver node process them 1 at a time.

Screenshot 2023-06-29 at 16 33 27 Screenshot 2023-06-30 at 11 09 17

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 30, 2023

$ cockroach debug merge-logs --filter="initiating a split" --format=crdb-v1 *.unredacted/*log
...

# The range is initially 16 GiB, when at 13:37:24 it got reconfigured for a smaller
# range size. It started splitting. The splits happen sequentially, chopping off 64 MiB
# from the left each time. The new RHS is being split again, etc.
I230608 13:37:24.016304 21504 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r64/2:‹/{Table/106-Max}›] 162  initiating a split of this range at key /Table/106/1/‹260623› [r74] (‹16 GiB above threshold size 64 MiB›)‹; r64/6 is being probed (may or may not need a Raft snapshot)›
I230608 13:37:24.484801 21598 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r74/2:‹/{Table/106/1/…-Max}›] 166  initiating a split of this range at key /Table/106/1/‹520735› [r75] (‹16 GiB above threshold size 64 MiB›)‹; r74/4 is waiting for a Raft snapshot›
I230608 13:37:24.938527 21659 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r75/2:‹/{Table/106/1/…-Max}›] 171  initiating a split of this range at key /Table/106/1/‹780847› [r76] (‹16 GiB above threshold size 64 MiB›)‹; r75/4 is waiting for a Raft snapshot›
I230608 13:37:25.365611 21859 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r76/2:‹/{Table/106/1/…-Max}›] 174  initiating a split of this range at key /Table/106/1/‹1040959› [r77] (‹16 GiB above threshold size 64 MiB›)‹; r76/4 is waiting for a Raft snapshot›

# The above splits print a "hint" saying something like
#  "r76/4 is waiting for a Raft snapshot"
# After a while, the hint started printing two replicas:
#  "r128/4 is waiting for a Raft snapshot; r128/1 is being probed (may or may not need a Raft snapshot"

I230608 13:37:41.251934 23611 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r128/2:‹/{Table/106/1/…-Max}›] 231  initiating a split of this range at key /Table/106/1/‹14566783› [r129] (‹14 GiB above threshold size 64 MiB›)‹; r128/4 is waiting for a Raft snapshot; r128/1 is being probed (may or may not need a Raft snapshot)›
I230608 13:37:41.562954 23687 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r129/2:‹/{Table/106/1/…-Max}›] 232  initiating a split of this range at key /Table/106/1/‹14826895› [r130] (‹14 GiB above threshold size 64 MiB›)‹; r129/1 is being probed (may or may not need a Raft snapshot); r129/4 is waiting for a Raft snapshot›
I230608 13:37:41.853105 23740 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r130/2:‹/{Table/106/1/…-Max}›] 233  initiating a split of this range at key /Table/106/1/‹15087007› [r131] (‹14 GiB above threshold size 64 MiB›)‹; r130/1 is being probed (may or may not need a Raft snapshot); r130/4 is waiting for a Raft snapshot›
I230608 13:37:42.137711 23729 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r131/2:‹/{Table/106/1/…-Max}›] 234  initiating a split of this range at key /Table/106/1/‹15347119› [r132] (‹14 GiB above threshold size 64 MiB›)‹; r131/1 is being probed (may or may not need a Raft snapshot); r131/4 is waiting for a Raft snapshot›

# There are also occurrences of
#  "r463/4 is being probed (may or may not need a Raft snapshot); r463/1 is being probed (may or may not need a Raft snapshot)"
# and
#  "r470/1 is waiting for a Raft snapshot; r470/4 is waiting for a Raft snapshot"

# At 13:39:53, splits started happening on node 4 (before that they all were on node 3).
I230608 13:39:53.548165 26443 kv/kvserver/replica_command.go:436 ⋮ [T1,n4,split,s4,r470/3:‹/{Table/106/1/…-Max}›] 177  initiating a split of this range at key /Table/106/1/‹102857749› [r484] (‹3.6 GiB above threshold size 64 MiB›)‹; r470/1 is waiting for a Raft snapshot; r470/4 is waiting for a Raft snapshot›
I230608 13:39:53.999167 26432 kv/kvserver/replica_command.go:436 ⋮ [T1,n4,split,s4,r484/3:‹/{Table/106/1/…-Max}›] 179  initiating a split of this range at key /Table/106/1/‹103115860› [r485] (‹3.5 GiB above threshold size 64 MiB›)‹; r484/1 is waiting for a Raft snapshot›
I230608 13:39:54.162182 26457 kv/kvserver/replica_command.go:436 ⋮ [T1,n4,split,s4,r484/3:‹/{Table/106/1/…-Max}›] 181  initiating a split of this range at key /Table/106/1/‹103115860› [r486] (‹3.5 GiB above threshold size 64 MiB›)‹; r484/1 is waiting for a Raft snapshot›
I230608 13:39:54.577882 26549 kv/kvserver/replica_command.go:436 ⋮ [T1,n4,split,s4,r486/3:‹/{Table/106/1/…-Max}›] 182  initiating a split of this range at key /Table/106/1/‹103373971› [r487] (‹3.5 GiB above threshold size 64 MiB›)‹; r486/1 is waiting for a Raft snapshot›
I230608 13:39:54.951504 26604 kv/kvserver/replica_command.go:436 ⋮ [T1,n4,split,s4,r487/3:‹/{Table/106/1/…-Max}›] 184  initiating a split of this range at key /Table/106/1/‹103632082› [r488] (‹3.5 GiB above threshold size 64 MiB›)‹; r487/1 is waiting for a Raft snapshot›

# All the way until the last split at 13:40:27.
I230608 13:40:27.662935 32236 kv/kvserver/replica_command.go:436 ⋮ [T1,n4,split,s4,r596/3:‹/{Table/106/1/…-Max}›] 373  initiating a split of this range at key /Table/106/1/‹131766181› [r597] (‹80 MiB above threshold size 64 MiB›)‹; r596/1 is waiting for a Raft snapshot›

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 30, 2023

There is a large 3.6 GiB snapshot for r470 that node 4 was streaming to node 1 for 38.62s (from ~13:39:47 till 13:40:25):

I230608 13:40:25.460219 25275 13@kv/kvserver/store_snapshot.go:1591 ⋮ [T1,n4,s4,r470/3:‹/Table/106/1/102{5996…-8577…}›] 471  streamed VIA_SNAPSHOT_QUEUE snapshot 6a8ead93 at applied index 10 to (n1,s1):1 with 3.6 GiB in 38.62s @ 95 MiB/s: kvs=29553207 rangeKVs=0, rate-limit: 512 MiB/s, queued: 0.40s
I230608 13:40:28.088780 357748 13@kv/kvserver/replica_raftstorage.go:503 ⋮ [T1,n1,s1,r470/1:{-}] 3218  applying VIA_SNAPSHOT_QUEUE snapshot 6a8ead93 from (n4,s4):3 at applied index 10
I230608 13:40:28.103633 357748 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n1,s1,r470/1:‹/{Table/106/1/…-Max}›] 1060  applied VIA_SNAPSHOT_QUEUE snapshot 6a8ead93 from (n4,s4):3 at applied index 10 (total=15ms data=3.2 GiB ingestion=6@13ms)

This was blocking all other snapshots incoming to node 1, which we see in the logs by filtering for "applied .* snapshot" messages:

# steady pace
I230608 13:39:46.429279 357687 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n1,s1,r213/1:‹/Table/106/1/36{265111-523222}›] 1054  applied VIA_SNAPSHOT_QUEUE snapsho
t 7fc2365c from (n3,s3):2 at applied index 48 (total=4ms data=29 MiB ingestion=6@2ms)
I230608 13:39:46.834671 357741 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n1,s1,r214/1:‹/Table/106/1/36{523222-781333}›] 1055  applied VIA_SNAPSHOT_QUEUE snapsho
t 3edf8a9a from (n3,s3):2 at applied index 47 (total=7ms data=29 MiB ingestion=6@4ms)

# big gap in timestamps
I230608 13:40:28.103633 357748 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n1,s1,r470/1:‹/{Table/106/1/…-Max}›] 1060  applied VIA_SNAPSHOT_QUEUE snapshot 6a8ead93 from (n4,s4):3 at applied index 10 (total=15ms data=3.2 GiB ingestion=6@13ms)

# steady pace again
I230608 13:40:28.570132 360929 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n1,s1,r216/1:‹/Table/106/1/37{039444-297555}›] 1064  applied VIA_SNAPSHOT_QUEUE snapshot 66a15d5d from (n3,s3):2 at applied index 55 (total=32ms data=29 MiB ingestion=6@17ms)
I230608 13:40:28.953381 362898 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n1,s1,r217/1:‹/Table/106/1/37{297555-555666}›] 1070  applied VIA_SNAPSHOT_QUEUE snapshot a5f938e9 from (n3,s3):2 at applied index 60 (total=6ms data=29 MiB ingestion=6@4ms)

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 30, 2023

This r470 range is the first one that node 4 split [see the log above]. Before that splits were driven by node 3. Likely there was a shift in leadership/leaseholder.

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 30, 2023

We're likely seeing the cascading effect described in

// split). Without this code, the split delay helper may end up
// returning early for each split in the following sequence:
//
// - r1=[a,z) has members [s1, s2]
// - r1 splits into r1=[a,b) and r2=[b,z)
// - s1 applies the split, r2@s1 reaches out to r2@s2 (MsgApp)
// - r2@s2 (not initialized yet) drops the MsgApp thanks to `maybeDropMsgApp`
// - r2@s1 marks r2@s1 as probing, will only contact again in 1s
// - r2 splits again into r2=[b,c), r3=[c,z)
// - r2@s1 applies the split, r3@s1 reaches out to r3@s2 (which is not initialized)
// - r3@s2 drops MsgApp due to `maybeDropMsgApp`
// - r3@s1 marks r3@s2 as probing, will only contact again in 1s
// - ...
// - r24@s1 splits into r25=[x,y) and r26=[y,z)
// - r24@s1 reaches out to r24@s2 (not inited and drops MsgApp)
//
// Note that every step here except the fourth one is almost guaranteed.
// Once an MsgApp has been dropped, the next split is also going to have
// the same behavior, since the dropped MsgApp prevents the next split
// from applying on that follower in a timely manner. The issue thus
// becomes self-sustaining.
//
// At some point during this cascade, s2 will likely apply the first split
// trigger on its replica of r1=[a,z), which will initialize r2@s2. However,
// since r2@s1 has already marked r2@s2 as probing, it won't contact it, on
// average, for another 500ms. When it does, it will append the next split to
// the log which can then be applied, but then there is another ~500ms wait
// until r3@s2 will be caught up by r3@s1 to learn about the next split. This
// means that on average, split N is delayed by ~N*500ms. `maybeDropMsgApp` on
// deeply nested ranges on s2 will eventually give up and this will lead to,
// roughly, snapshots being requested across most of the ranges, but none
// of these snapshots can apply because the keyspace is always occupied by
// one of the descendants of the initial range (however far the splits have
// made it on s2). On top of log spam and wasted work, this prevents the
// Raft snapshot queue from doing useful work that may also be necessary.
//
// The main contribution of the helper to avoiding this cascade is to wait
// for the replicas of the right-hand side to be initialized. This breaks
// the above history because a split will only commence once all prior
// splits in the chain have applied on all members.

Notably, though, splits driven by the split queue are not "delayable" (see here and the other 2 invocations below it), i.e. this workaround is not active.

We should probably enable these delays.

@pav-kv pav-kv removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jun 30, 2023
@tbg
Copy link
Member

tbg commented Jul 12, 2023

Synopsis:

  • raft snapshot when there shouldn't be one (kvserver: avoid unnecessary raft snapshots #97971)
  • no protection against performing a replication change that causes unavailability when followers are behind (we ended up rebalancing into a joint config where one side of the quorum had two followers needing snapshots, i.e. we had a follower that needed a snapshot but didn't realize that a rebalance was now not a good idea, I think)
  • manual splits are not "delayable", so they tend to cause more snapshots, and these snapshots are all dependent on each other when you do many splits, so they go through only in a specific order, which causes even more delay

@erikgrinaker erikgrinaker added T-kv-replication and removed T-kv KV Team labels Jul 18, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 18, 2023

cc @cockroachdb/replication

@tbg tbg added the A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. label Jul 24, 2023
@tbg tbg assigned tbg and unassigned pav-kv Jul 24, 2023
@tbg
Copy link
Member

tbg commented Jul 25, 2023

Going through the logs again just to button things up.

The initial large range is r64.

First, r64 gets hit by the randomized backup in the roachtest. Oops! The range is 16GiB at this point. We apply a 5min timeout to this export somewhere. So if you ever have a large range like this, backups are not going to work1

Details
W230608 13:24:12.697833 142988 kv/kvclient/kvcoord/dist_sender.go:1710 ⋮ [T1,n1,job=872152117891235841] 614  slow range RPC: have been waiting 533.30s (1 attempts) for RPC Export [/Table/106/1,/Table/106/2), [wait-policy: Error] to r64:‹/{Table/106-Max}› [(n1,s1):1, next=2, gen=1]; resp: ‹(err: context deadline exceeded)›
W230608 13:24:12.698705 142988 kv/kvclient/kvcoord/dist_sender.go:1720 ⋮ [T1,n1,job=872152117891235841] 615  slow RPC response: slow RPC finished after 533.36s (1 attempts)
E230608 13:24:12.698774 142988 ccl/backupccl/backup_processor.go:498 ⋮ [T1,n1,job=872152117891235841] 616  failed export request for span ‹/Table/106/{1-2}›
E230608 13:24:12.698774 142988 ccl/backupccl/backup_processor.go:498 ⋮ [T1,n1,job=872152117891235841] 616 + trace:
E230608 13:24:12.698774 142988 ccl/backupccl/backup_processor.go:498 ⋮ [T1,n1,job=872152117891235841] 616 +     0.000ms      0.000ms    === operation:backupccl.ExportRequest
E230608 13:24:12.698774 142988 ccl/backupccl/backup_processor.go:498 ⋮ [T1,n1,job=872152117891235841] 616 +     0.000ms      0.000ms    [dist sender send: {count: 1, duration 8m53s}]
E230608 13:24:12.698774 142988 ccl/backupccl/backup_processor.go:498 ⋮ [T1,n1,job=872152117891235841] 616 +     0.000ms      0.000ms    [/cockroach.roachpb.Internal/Batch: {count: 1, duration 8m53s}]
E230608 13:24:12.698774 142988 ccl/backupccl/backup_processor.go:498 ⋮ [T1,n1,job=872152117891235841] 616 +     0.000ms      0.000ms    [evalExport: {count: 1, duration 8m53s}]
E230608 13:24:12.698774 142988 ccl/backupccl/backup_processor.go:498 ⋮ [T1,n1,job=872152117891235841] 616 +     0.000ms      0.000ms    [storage.MVCCExportToSST: {count: 1, duration 8m53s}]
W230608 13:24:12.706125 142984 ccl/backupccl/backup_processor.go:372 ⋮ [T1,n1,job=872152117891235841] 617  failed to flush SST sink: context canceled
W230608 13:24:12.722142 142984 ccl/backupccl/backup_processor.go:662 ⋮ [T1,n1,job=872152117891235841] 618  failed to close SST sink: ‹context canceled›
E230608 13:24:12.779877 142171 jobs/registry.go:1599 ⋮ [T1,n1] 619  BACKUP job 872152117891235841: stepping through state reverting with error: failed to run backup: exporting 16 ranges: export request timeout: operation ‹"ExportRequest for span /Table/106/{1-2}"› timed out after 8m53.36s (given timeout 5m0s): context deadline exceeded

First, r64 successfully adds a replica on n3 (the snapshot takes ~260s and streams at ~63MB/s, i.e. right at the max configured rate). Next, n1 adds a learner, but then that learner gets removed by n3, which apparently now holds the lease. We can infer from that that there's likely still a LEARNER snapshot inflight from n1->n3 in what follows below. n3 then tries to add its own LEARNER to n2 instead (replID 4):

I230608 13:32:39.491927 11658 13@kv/kvserver/replica_raft.go:387 ⋮ [T1,n3,s3,r64/2:‹/{Table/106-Max}›] 26 proposing SIMPLE(l4) [(n2,s2):4LEARNER]: after=[(n1,s1):1 (n3,s3):2 (n2,s2):4LEARNER] next=5

We see in the logs that this overlaps with another attempt at exporting r64 because ~2 minutes later, we see this log line:

W230608 13:34:00.041299 275528 kv/kvclient/kvcoord/dist_sender.go:1710 ⋮ [T1,n1,f‹2866f611›,job=872155068781232130,distsql.gateway=2,distsql.appname=‹$ internal-resume-job-872155068781232130›] 792 slow range RPC:
have been waiting 230.44s (1 attempts) for RPC Export [/Table/106/1,/Table/106/2), [wait-policy: Error] to r64:‹/{Table/106-Max}› [(n1,s1):1, (n3,s3):2LEARNER, next=3, gen=2]; resp: ‹(err: ), *kvpb.ExportRespo
nse›

Less than a minute later, at around 13:34:43, we see n3 roll back the learner because sending the snapshot failed:

giving up during snapshot reservation due to "kv.snapshot_receiver.reservation_queue_timeout_fraction": context deadline exceeded

This is congruent with our earlier observation that there might still be a snapshot n1->n2 inflight - that snapshot would delay n3's.

Finally, at 13:36:49, n1's snapshot fails with "Raft group deleted"; makes sense since that Replica is long stale.

There is some more upreplication and it looks like the lease transfers (?) again from n3 to n1, and they step on each others toes more:

I230608 13:39:43.315732 20329 kv/kvserver/replica_command.go:1109 ⋮ [T1,n3,replicate,s3,r64/2:‹/Table/106{-/1/2606…}›] 586 could not promote [n5,s5] to voter, rolling back: change replicas of r64 failed: descripto
r changed: [expected] r64:‹/{Table/106-Max}› [(n1,s1):1, (n3,s3):2, (n4,s4):5, (n5,s5):6LEARNER, next=7, gen=10] != [actual] r64:‹/Table/106{-/1/260623}› [(n1,s1):1, (n3,s3):2, (n4,s4):5, (n5,s5):6LEARNER, next=7,
gen=11]

In particular, we hit this helpful logging:

I230608 13:37:24.016304 21504 kv/kvserver/replica_command.go:436 ⋮ [T1,n3,split,s3,r64/2:‹/{Table/106-Max}›] 162 initiating a split of this range at key /Table/106/1/‹260623› [r74] (‹16 GiB above threshold size 64 MiB›)‹; r64/6 is being probed (may or may not need a Raft snapshot)›
...
I230608 13:39:43.453866 20487 kv/kvserver/replica_gc_queue.go:307 ⋮ [T1,n5,replicaGC,s5,r64/6:‹/{Table/106-Max}›] 189 removing replica with pending split; will incur Raft snapshot for right hand side

Thanks to 2018 Tobias, we have a helpful comment2. Basically, if we split a range and then remove a replica, if that removed replica hasn't applied the split yet, it wipes out the would-be right hand side of the split. In other words, there's now a replica in the RHS that needs a snapshot - in our case, r74.

The problem is now that we're carrying out tons of splits due to range size, and they're not delayable:

if maxBytes > 0 && size > maxBytes {
if _, err := r.adminSplitWithDescriptor(
ctx,
kvpb.AdminSplitRequest{},
desc,
false, /* delayable */
fmt.Sprintf("%s above threshold size %s", humanizeutil.IBytes(size), humanizeutil.IBytes(maxBytes)),
false, /* findFirstSafeSplitKey */
); err != nil {
return false, err
}
sq.metrics.SizeBasedSplitCount.Inc(1)
return true, nil
}

Details

The way splits work is that they carve off chunks off the LHS. So r74 will be split again, its RHS will be split again, etc., and each such split causes another snapshot. It would have helped here if these splits had been delayable, but since they delay by at most ~20s and the snapshots take "minutes", it likely would've slowed the problem but not eliminated it.34

One might try to argue that aggressively replicaGC'ing r64/6 was the issue. But note that r64/6 had been removed from the range, so it wouldn't necessarily even be caught up to the split in the first place, but it will block snapshots to its keyspace. So it does need to get removed ASAP. What could have been delayed is removing r64/6 from the descriptor; but how was the replicate queue on another node supposed to know that that replica was still pending a split? It could in theory have known - the split and the removal of the learner are both driven by the leaseholder, so similarly how we coordinate snapshots between the replicate and raft snapshot queue, the replicate queue could back off if there's an ongoing split attempt (and the split queue would have to wait until all followers have applied the split). I am not advocating for actually doing this, since all of these dependencies between queues are a real problem in and of itself. But there is a compelling argument that the independence of these queues is a problem.

at this point we're flooding the raft snapshot queue with requests that it will be very slow to process. They are also following a pathological pattern due to the sequence of splits: each inflight snapshot will, by the time it arrives, make a replica that requires a ton more splits in order to be up to date, and which will block all but the snapshot for the very first split in that wide range. This means most of the time the raft snapshot queue will just cycle through replicas, getting rejections for all but ~1 replica, loop around, etc, each time wasting the 10min scanner interval. In other words, the backlog will clear out very slowly.

But we haven't lost quorum at this point. This happens only later, when we make this ill-fated conf change:

I230608 13:40:42.296384 72105 13@kv/kvserver/replica_raft.go:387 ⋮ [T1,n3,s3,r548/2:‹/Table/106/1/119{1187…-3768…}›] 1651 proposing ENTER_JOINT(r2 l2 v4) [(n2,s2):4VOTER_INCOMING], [(n3,s3):2VOTER_DEMOTING_LEARNER
]: after=[(n1,s1):1 (n3,s3):2VOTER_DEMOTING_LEARNER (n4,s4):3 (n2,s2):4VOTER_INCOMING] next=5

The config is (s1, s3, s4) && (s1, s2, s4) but when the replica circuit breaker trips a minute later, we see that the replicas on s1 and s2 both need a snapshot. s1's need for a snapshot was likely created in the way described above.

E230608 13:41:43.916584 90026 kv/kvserver/replica_circuit_breaker.go:175 ⋮ [T1,n3,s3,r548/2:‹/Table/106/1/119{1187…-3768…}›] 841  breaker: tripped with error: replica unavailable: (n3,s3):2VOTER_DEMOTING_LEARNER unable to serve request to r548:‹/Table/106/1/119{118742-376853}› [(n1,s1):1, (n3,s3):2VOTER_DEMOTING_LEARNER, (n4,s4):3, (n2,s2):4VOTER_INCOMING, next=5, gen=475]: closed timestamp: 1686231639.296513274,0 (2023-06-08 13:40:39); raft status: {"id":"2","term":7,"vote":"2","commit":29,"lead":"2","raftState":"StateLeader","applied":29,"progress":{"1":{"match":0,"next":1,"state":"StateSnapshot"},"2":{"match":50,"next":51,"state":"
StateReplicate"},"3":{"match":50,"next":52,"state":"StateReplicate"},"4":{"match":0,"next":1,"state":"StateSnapshot"}},"leadtransferee":"0"}: have been waiting 62.00s for slow proposal ConditionalPut [/Local/Range/Table/106/1/‹119118742›/‹RangeDescriptor›,/Min), [txn: 97e8d7ec]

s2 is just being promoted and this is what causes the loss of quorum. But, the promotion was carried out by the replicate queue, and didn't it just add the learner and sent it a snapshot? How is s2 snapshotless now (note the Match of zero above)? It did apply it:

I230608 13:40:42.284060 34141 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n2,s2,r548/4:‹/Table/106/1/119{1187…-3768…}›] 303 applied INITIAL snapshot 75ebb1f1 from (n4,s4):3 at applied index 25 (total=9ms data=29 MiB ingestion=6@3ms)

The debug.zip also shows this replica present on s2.

So this appears to be a failure in the replication layer - the data is there, but Raft is sticking to StateSnapshot even though it could be replicating to the follower.

I've written elsewhere5 about the oddities of raft snapshot handling. It's just generally a little brittle and especially assumes much closer coupling between raft and the snapshots than is present in CRDB.

Pure conjecture, but possible the following happened here:

  • follower gets probed before it has the snapshot applied
  • it responds with an MsgAppResp rejection (which is not delivered yet)
  • snapshot applies, follower's emitted MsgApp gets dropped
  • leader calls ReportSnapshot, transitions out of StateSnapshot (follower is now in StateProbe)
  • MsgAppResp rejection arrives, moves follower back to StateSnap before first probe is sent (if a probe got sent things would resolve)5
  • we sit here forever and usually this doesn't get noticed, but here it causes a faux loss of quorum so we do notice.

Footnotes

  1. filed https://github.com/cockroachdb/cockroach/issues/107519.

  2. https://github.com/cockroachdb/cockroach/blob/a851fe95f9715fd95fe25907e770514a7157d802/pkg/kv/kvserver/replica_gc_queue.go#L283-L308

  3. https://github.com/cockroachdb/cockroach/blob/de9b2b2879ec37f427f76d375fad920dd6a2538c/pkg/base/config.go#L659-L666

  4. besides, if the range is very large that might cause problems in itself, like stalling writes, and it may not be wise to delay splits.

  5. I think it would move it back, but need to check. 2

@tbg
Copy link
Member

tbg commented Jul 25, 2023

Pure conjecture, but possible the following happened here:

I think this might be fixed as of https://github.com/cockroachdb/cockroach/pull/106793/files#diff-1e9884dfbcc4eb6d90a152e1023fd2a84dcda68f04a877be8a01c16df0fe0331R2888. When the MsgAppResp is received by the leader, this ought to be almost enough, and that PR achieves that. The one remaining edge case is that fixed by etcd-io/raft#84, so once we have that one in as well (tracked in #106813), I think we can consider closing this test failure issue.

@tbg
Copy link
Member

tbg commented Jul 25, 2023

Filed #107520 to track the general poor interplay between splits and snapshots as triggered by replica movement in this test. It's something we (collectively) know about and this has been "a thing" for years, but it's hard to track since it lives in the spaces in between components. That issue will have to do.

@tbg tbg changed the title roachtest: splits/largerange/size=32GiB,nodes=6 failed roachtest: splits/largerange/size=32GiB,nodes=6 failed [needs #106813] Jul 25, 2023
@tbg tbg changed the title roachtest: splits/largerange/size=32GiB,nodes=6 failed [needs #106813] roachtest: splits/largerange/size=32GiB,nodes=6 failed [raft snaps; needs #106813] Jul 25, 2023
@dt
Copy link
Member

dt commented Jul 25, 2023

We apply a 5min timeout to this export somewhere. So if you ever have a large range like this, backups are not going to work

I don't follow this assertion. Individual ExportRequests, which are what have the 5min timeout, are paginated to a fixed 16mib, regardless of the underlying range size, so I don't think there is an automatic assumption that a bigger range should expect to hit this timeout. The only mechanism that comes to mind right now would very few keys in the range above the requested timestamp for an incremental exportreq, so the range iterates extensively without finding keys that match the timestamp predicate to return, meaning it doesn't hit the 16mib limit. We can discuss that more over in #107519 if we think that's what's up, but I certainly don't think we expect a large range to need a longer per-req timeout.

@tbg
Copy link
Member

tbg commented Jul 25, 2023

You just know more than I do, feel free to re-title #107519 or close if this doesn't capture anything new! I only skimmed the surface in looking at the export and wasn't digging deeper into how it works under the hood.

@tbg tbg assigned pav-kv and unassigned tbg Jul 25, 2023
@tbg
Copy link
Member

tbg commented Jul 26, 2023

Closing. #104588 (comment) and #106813 on the assumption that #106793 practically removes the erroneous StateSnapshot-despite-having-received-snapshot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

6 participants