kvserver: fix 'observed raft log position' assertion #107412

irfansharif · 2023-07-22T06:03:02Z

Fixes #107336.
Fixes #106123.
Fixes #107156.
Fixes #106589.

It's possible to hit this assertion under --stress --race when the proposing replica is starved enough for raft ticks that it loses leadership right when it steps proposals through raft. We're relying on undocumented API semantics in the etcd raft library whereby it mutates stepped entries with the term+index its to end up in. But that's only applicable if stepping through entries as a leader. Simply relax this assertion instead.

Release note: None

Fixes cockroachdb#107336. Fixes cockroachdb#106123. Fixes cockroachdb#107156. Fixes cockroachdb#106589. It's possible to hit this assertion under --stress --race when the proposing replica is starved enough for raft ticks that it loses leadership right when it steps proposals through raft. We're relying on undocumented API semantics in the etcd raft library whereby it mutates stepped entries with the term+index its to end up in. But that's only applicable if stepping through entries as a leader. Simply relax this assertion instead. Release note: None

blathers-crl · 2023-07-22T06:03:05Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-07-22T06:03:10Z

This change is

sumeerbhola

can you also get a quick look from someone in repl-team? I have limited understanding of etcd/raft implementation details.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)

erikgrinaker · 2023-07-23T17:25:16Z

I think this is fine, but I'll have a closer look tomorrow.

erikgrinaker

This seems OK at a glance.

This means that AC is effectively disabled when the leader and leaseholder isn't colocated. That should be rare enough that we can disregard it, at least initially.

How do we handle leader changes and such here? Let's say the leader proposes 1000 entries, and deducts tokens for them, but is unable to commit the entries. A new leader is elected which replaces the old leader's uncommitted tail. Will we return all of the tokens in that case? Otherwise, if the old leader reacquires leadership, can it deadlock because it will never commit the log entries AC is waiting for? Conversely, if we reset them, will flapping leadership effectively reset AC? We expect to see such flapping precisely under overload, if leaders are unable to heartbeat in time.

irfansharif · 2023-07-24T09:49:38Z

Thanks for looking!

bors r+

Will we return all of the tokens in that case?

Yes. We’ll err on the side of over-admission to avoid token leaks/write stalls.

will flapping leadership effectively reset AC?

Yes, for IO/flow tokens, not for CPU slots. I suspect CPU overload comes into play more so when we’re unable to tick raft groups in time. Perhaps also if IO latencies are severely degraded (>10ms p99s under IOPS/bandwidth saturation) but AC doesn’t monitor IO latencies directly today.

craig · 2023-07-24T10:24:57Z

Build succeeded:

Bazel Essential CI (Cockroach)

irfansharif requested review from sumeerbhola and a team July 22, 2023 06:03

sumeerbhola approved these changes Jul 23, 2023

View reviewed changes

erikgrinaker self-requested a review July 23, 2023 17:25

erikgrinaker approved these changes Jul 24, 2023

View reviewed changes

craig bot merged commit 24603c9 into cockroachdb:master Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: fix 'observed raft log position' assertion #107412

kvserver: fix 'observed raft log position' assertion #107412

irfansharif commented Jul 22, 2023

blathers-crl bot commented Jul 22, 2023

cockroach-teamcity commented Jul 22, 2023

sumeerbhola left a comment

erikgrinaker commented Jul 23, 2023

erikgrinaker left a comment •

edited

Loading

irfansharif commented Jul 24, 2023

craig bot commented Jul 24, 2023

kvserver: fix 'observed raft log position' assertion #107412

kvserver: fix 'observed raft log position' assertion #107412

Conversation

irfansharif commented Jul 22, 2023

blathers-crl bot commented Jul 22, 2023

cockroach-teamcity commented Jul 22, 2023

sumeerbhola left a comment

Choose a reason for hiding this comment

erikgrinaker commented Jul 23, 2023

erikgrinaker left a comment • edited Loading

Choose a reason for hiding this comment

irfansharif commented Jul 24, 2023

craig bot commented Jul 24, 2023

erikgrinaker left a comment •

edited

Loading