Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: make poisoned latch errors less confusing #105798

Closed
erikgrinaker opened this issue Jun 29, 2023 · 3 comments · Fixed by #105816
Closed

kvserver: make poisoned latch errors less confusing #105798

erikgrinaker opened this issue Jun 29, 2023 · 3 comments · Fixed by #105816
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 29, 2023

In #104709 (comment), we see a tripped circuit breaker (the one described in #105797) poisoning a meta2 latch, seemingly outside of the replica's range bounds:

I230627 12:51:40.785010 124952 kv/txn.go:878 ⋮ [T1,n6] 30843 async rollback failed: replica unavailable: (n6,s6):7 unable to serve request to r1039:‹/Table/106/1/65{32837936193842192-51266251951793792}› [(n3,s3):13, (n6,s6):7, (n8,s8):11, (n4,s4):14, (n9,s9):5, next=15, gen=92, sticky=9223372036.854775807,2147483647]: closed timestamp: 1687870297.597281357,0 (2023-06-27 12:51:37); raft status: {"id":"7","term":12,"vote":"e","commit":135,"lead":"e","raftState":"StateFollower","applied":135,"progress":{},"leadtransferee":"0"}: encountered poisoned latch ‹/Meta2/Table/106/1/6551266251951793792›@1687870207.598066812,0: "change-replica" meta={id=494e1e1e key=/Local/Range/Table/106/1/‹6532837936193842192›/‹RangeDescriptor› iso=Serializable pri=0.06378614 epo=0 ts=1687870207.598066812,0 min=1687870207.598066812,0 seq=2} lock=true stat=PENDING rts=1687870207.598066812,0 wto=false gul=1687870208.098066812,0
I230627 12:51:43.285782 124887 kv/txn.go:878 ⋮ [T1,n6] 30848 async rollback failed: replica unavailable: (n6,s6):6 unable to serve request to r307:‹/Table/106/1/-68{27690988321067808-09262672563116208}› [(n9,s9):7, (n7,s7):8, (n4,s4):4, (n3,s3):11, (n6,s6):6, next=12, gen=52, sticky=9223372036.854775807,2147483647]: closed timestamp: 1687870300.196827941,0 (2023-06-27 12:51:40); raft status: {"id":"6","term":12,"vote":"b","commit":127,"lead":"b","raftState":"StateFollower","applied":127,"progress":{},"leadtransferee":"0"}: encountered poisoned latch ‹/Meta2/Table/106/1/-6809262672563116208›@1687870210.148120953,0: "change-replica" meta={id=321479c5 key=/Local/Range/Table/106/1/‹-6827690988321067808›/‹RangeDescriptor› iso=Serializable pri=0.02487111 epo=0 ts=1687870210.148120953,0 min=1687870210.148120953,0 seq=2} lock=true stat=PENDING rts=1687870210.148120953,0 wto=false gul=1687870210.648120953,0
I230627 12:51:44.285303 124854 kv/txn.go:878 ⋮ [T1,n6] 30855 async rollback failed: replica unavailable: (n6,s6):6 unable to serve request to r885:‹/Table/106/1/53{16569096169036592-34997411926988192}› [(n8,s8):11, (n4,s4):9, (n7,s7):12, (n5,s5):4, (n6,s6):6, next=13, gen=68, sticky=9223372036.854775807,2147483647]: closed timestamp: 1687870301.210269924,0 (2023-06-27 12:51:41); raft status: {"id":"6","term":13,"vote":"9","commit":122,"lead":"9","raftState":"StateFollower","applied":122,"progress":{},"leadtransferee":"0"}: encountered poisoned latch ‹/Meta2/Table/106/1/5334997411926988192›@1687870210.000120522,0: "change-replica" meta={id=336523e8 key=/Local/Range/Table/106/1/‹5316569096169036592›/‹RangeDescriptor› iso=Serializable pri=0.00570728 epo=0 ts=1687870210.000120522,0 min=1687870210.000120522,0 seq=2} lock=true stat=PENDING rts=1687870210.000120522,0 wto=false gul=1687870210.500120522,0
[...]
I230627 13:00:07.222273 366536 13@kv/kvserver/replicate_queue.go:751 ⋮ [T1,n6,replicate,s6,r1039/7:‹/Table/106/1/65{32837…-51266…}›] 5349 error processing replica: [n6,s6,r&{1039/7:‹/Table/106/1/65{32837…-51266…}› ‹1039/7:/Table/106/1/65{32837…-51266…}› ‹1039/7›}]: unable to transfer lease to s3: replica unavailable: (n6,s6):7 unable to serve request to r1039:‹/Table/106/1/65{32837936193842192-51266251951793792}› [(n3,s3):13, (n6,s6):7, (n8,s8):11, (n4,s4):14, (n9,s9):5, next=15, gen=92, sticky=9223372036.854775807,2147483647]: closed timestamp: 1687870671.147739129,0 (2023-06-27 12:57:51); raft status: {"id":"7","term":13,"vote":"7","commit":623,"lead":"7","raftState":"StateLeader","applied":623,"progress":{"7":{"match":623,"next":624,"state":"StateReplicate"},"b":{"match":623,"next":624,"state":"StateReplicate"},"d":{"match":623,"next":624,"state":"StateReplicate"},"e":{"match":623,"next":624,"state":"StateReplicate"},"5":{"match":623,"next":624,"state":"StateReplicate"}},"leadtransferee":"0"}: encountered poisoned latch ‹/Meta2/Table/106/1/6551266251951793792›@1687870207.598066812,0

Jira issue: CRDB-29203

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-replication labels Jun 29, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 29, 2023

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor Author

I think this is because EndTxn declares latches for all locks that it's holding:

// The spans may extend beyond this Range, but it's ok for the
// purpose of acquiring latches. The parts in our Range will
// be resolved eagerly.
for _, span := range et.LockSpans {
latchSpans.AddMVCC(spanset.SpanReadWrite, span, minTxnTS)
}

That seems pretty excessive, and likely causes some additional overhead. I'll try to limit them to the range's spans.

@erikgrinaker erikgrinaker self-assigned this Jun 29, 2023
@erikgrinaker
Copy link
Contributor Author

We saw something similar over on #104588, where we reported a poisoned latch on r548 as coming from r1. Probably because intent resolution on r1 tried to accessed the abort span of a txn anchored on r548, whose latches were poisoned.

I started out by trying to reduce the EndTxn latch spans, which can't easily be done because we don't have access to the range end key during latch declaration (there may be a concurrent range merge which modifies it). We could truncate it to the start key, but this would be incomplete.

I think the better solution here is to improve the error handling of poisoned latches to include the originating range ID where the latch was poisoned, instead of the range where the error is processed. Repurposing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant