Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: fix 'observed raft log position' assertion #107412

Merged
merged 1 commit into from
Jul 24, 2023

Conversation

irfansharif
Copy link
Contributor

Fixes #107336.
Fixes #106123.
Fixes #107156.
Fixes #106589.

It's possible to hit this assertion under --stress --race when the proposing replica is starved enough for raft ticks that it loses leadership right when it steps proposals through raft. We're relying on undocumented API semantics in the etcd raft library whereby it mutates stepped entries with the term+index its to end up in. But that's only applicable if stepping through entries as a leader. Simply relax this assertion instead.

Release note: None

Fixes cockroachdb#107336.
Fixes cockroachdb#106123.
Fixes cockroachdb#107156.
Fixes cockroachdb#106589.

It's possible to hit this assertion under --stress --race when the
proposing replica is starved enough for raft ticks that it loses
leadership right when it steps proposals through raft. We're relying on
undocumented API semantics in the etcd raft library whereby it mutates
stepped entries with the term+index its to end up in. But that's only
applicable if stepping through entries as a leader. Simply relax this
assertion instead.

Release note: None
@irfansharif irfansharif requested review from sumeerbhola and a team July 22, 2023 06:03
@blathers-crl
Copy link

blathers-crl bot commented Jul 22, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also get a quick look from someone in repl-team? I have limited understanding of etcd/raft implementation details.

:lgtm:

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)

@erikgrinaker
Copy link
Contributor

I think this is fine, but I'll have a closer look tomorrow.

@erikgrinaker erikgrinaker self-requested a review July 23, 2023 17:25
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems OK at a glance.

This means that AC is effectively disabled when the leader and leaseholder isn't colocated. That should be rare enough that we can disregard it, at least initially.

How do we handle leader changes and such here? Let's say the leader proposes 1000 entries, and deducts tokens for them, but is unable to commit the entries. A new leader is elected which replaces the old leader's uncommitted tail. Will we return all of the tokens in that case? Otherwise, if the old leader reacquires leadership, can it deadlock because it will never commit the log entries AC is waiting for? Conversely, if we reset them, will flapping leadership effectively reset AC? We expect to see such flapping precisely under overload, if leaders are unable to heartbeat in time.

@irfansharif
Copy link
Contributor Author

Thanks for looking!

bors r+

Will we return all of the tokens in that case?

Yes. We’ll err on the side of over-admission to avoid token leaks/write stalls.

will flapping leadership effectively reset AC?

Yes, for IO/flow tokens, not for CPU slots. I suspect CPU overload comes into play more so when we’re unable to tick raft groups in time. Perhaps also if IO latencies are severely degraded (>10ms p99s under IOPS/bandwidth saturation) but AC doesn’t monitor IO latencies directly today.

@craig
Copy link
Contributor

craig bot commented Jul 24, 2023

Build succeeded:

@craig craig bot merged commit 24603c9 into cockroachdb:master Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment