kvserver: don't repropose probes #102956

tbg · 2023-05-09T14:56:52Z

We saw this likely contribute to OOMs during log application in
https://github.com/cockroachlabs/support/issues/2287.

Touches #98563.

Epic: CRDB-25503
Release note: None

We saw this likely contribute to OOMs during log application in cockroachlabs/support#2287. Touches cockroachdb#98563. Epic: CRDB-25503 Release note: None

blathers-crl · 2023-05-09T14:56:56Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-05-09T14:57:02Z

This change is

tbg · 2023-05-10T10:50:05Z

With the change as is, probes get an ambiguous result after 3s, which means a new probe can start after 3s instead of after 60s, so while avoiding quadratic build-up in the log, we get 20x more in-mem proposals and the log will grow at 20x the rate. Not sure this is a great trade-off.

We should probably repropose a probe exactly once, and give it an ambiguous result after 60s.

We should also commonly be able to avoid reproposing. If the local raft instance is the leader (as it mostly is), if it has unapplied log entries, we might as well wait for those to get applied first before reproposing. Also, commands that were proposed to a local leader and not dropped (i.e. if we handled ErrProposalDropped) are known to be in the log and could only be replaced if local leadership were lost, for which we get an event.

Long story short, what we're doing now is extremely inefficient. Log build-up during outages is totally avoidable.

In the same vein, we keep commands in the command map (r.mu.proposals) even after the callers have given up. It is true that we need to keep the latches in place until we have proven that the command can no longer apply. But here too we could be a lot more effective if, instead of reproposing the exact command multiple times (and doing so for possibly many in-flight commands), we'd instead send a single probe only. This would cause more spurious failures in the case in which the commands actually never made it into the log: we need to be much better with our tracking of such things to avoid flakes during the early life of ranges (while elections are happening, etc).

But this all can be done, if we are in the mood for a larger rework.

tbg · 2023-07-03T13:22:44Z

#105896 goes the other way and removes the timeout on probes to avoid the quadratic build-up. We can still discuss optimizing our reproposal behavior, but it's independent of probes at this point.

kvserver: don't repropose probes

8c9c44e

We saw this likely contribute to OOMs during log application in cockroachlabs/support#2287. Touches cockroachdb#98563. Epic: CRDB-25503 Release note: None

tbg mentioned this pull request May 10, 2023

storage: Handle raft.ErrProposalDropped #21849

Open

erikgrinaker mentioned this pull request May 10, 2023

kvserver: distribute COCKROACH_SCHEDULER_CONCURRENCY across stores #102859

Merged

erikgrinaker mentioned this pull request May 29, 2023

kvserver: avoid quadratic growth of raft log under unavailability #103908

Closed

tbg closed this Jul 3, 2023

tbg deleted the no-repropose-probe branch July 3, 2023 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: don't repropose probes #102956

kvserver: don't repropose probes #102956

tbg commented May 9, 2023

blathers-crl bot commented May 9, 2023

cockroach-teamcity commented May 9, 2023

tbg commented May 10, 2023

tbg commented Jul 3, 2023

kvserver: don't repropose probes #102956

kvserver: don't repropose probes #102956

Conversation

tbg commented May 9, 2023

blathers-crl bot commented May 9, 2023

cockroach-teamcity commented May 9, 2023

tbg commented May 10, 2023

tbg commented Jul 3, 2023