kvserver: more aggressive reproposals for lease requests #104024

erikgrinaker · 2023-05-28T19:25:28Z

In #98124 we saw that lease requests would sometimes get reproposed, typically due to leader changes and often following range splits. Since lease requests are latency-sensitive, especially with expiration lease extensions, we should consider more aggressively reproposing them (e.g. after 1 second).

Jira issue: CRDB-28319

Epic CRDB-25199

blathers-crl · 2023-05-28T19:25:30Z

cc @cockroachdb/replication

erikgrinaker · 2023-05-28T19:37:04Z

Here's an example trace:

     0.000ms      0.000ms    === operation:request-lease-async _verbose:‹1› node:‹1› store:‹1› range:‹285/1:/Table/1{09/1/728…-10/1}›
     0.053ms      0.053ms    event:‹kv/kvserver/replica_send.go:185 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] read-write path›
     0.073ms      0.020ms    event:‹kv/kvserver/concurrency/concurrency_manager.go:195 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] sequencing request›
     0.084ms      0.011ms    event:‹kv/kvserver/concurrency/concurrency_manager.go:239 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] not acquiring latches›
     0.100ms      0.016ms    event:‹kv/kvserver/replica_write.go:164 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] applied timestamp cache›
     0.114ms      0.014ms    event:‹kv/kvserver/replica_write.go:403 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] executing read-write batch›
     0.402ms      0.288ms    event:‹kv/kvserver/replica_evaluate.go:496 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] evaluated RequestLease command header:<kvnemesis_seq:<> key:"\365\211\367\034x\022\272\226\323\226\225\203H\000\377\200\000\377\000\377\000\377\r\010yk\000\001" > lease:<start:<wall_time:1685299706457607449 > expiration:<wall_time:1685299712457607449 > replica:<node_id:1 store_id:1 r..., txn=<nil> : resp=header:<> , err=<nil>›
     0.424ms      0.022ms    event:‹kv/kvserver/replica_proposal.go:771 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] need consensus on write batch with op count=2›
     1.232ms      0.808ms    event:‹kv/kvserver/replica_raft.go:125 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] evaluated request›
     1.278ms      0.046ms    event:‹kv/kvserver/replica_raft.go:169 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] proposing command 14043b6a9f42d6df to write 0 new keys, 0 new values, 0 new intents, write batch size=177 bytes›
     1.299ms      0.021ms    event:‹kv/kvserver/replica_raft.go:274 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] acquiring proposal quota›
     1.310ms      0.011ms    event:‹kv/kvserver/replica_raft.go:280 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] acquired proposal quota›
     1.320ms      0.010ms    event:‹kv/kvserver/replica_raft.go:460 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] proposing command›
     1.344ms      0.024ms    event:‹kv/kvserver/replica_proposal_buf.go:239 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] acquiring proposal buffer read lock for command 14043b6a9f42d6df›
     1.356ms      0.012ms    event:‹kv/kvserver/replica_proposal_buf.go:254 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] allocating proposal buffer index for command 14043b6a9f42d6df›
     1.372ms      0.015ms    event:‹kv/kvserver/replica_proposal_buf.go:263 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] submitting proposal for command 14043b6a9f42d6df at proposal buffer index 0›
    23.672ms     22.300ms    event:‹kv/kvserver/replica_proposal_buf.go:544 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] flushing proposal for command 14043b6a9f42d6df to Raft›
  2563.758ms   2540.087ms    event:‹kv/kvserver/replica_raft.go:1457 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] re-submitting command 31343034336236613966343264366466 (MLI 0, CT <nil>): reasonTicks›
  2563.798ms      0.040ms    event:‹kv/kvserver/replica_proposal_buf.go:544 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] flushing proposal for command 14043b6a9f42d6df to Raft›
  2954.006ms    390.208ms    event:‹kv/kvserver/replica_raft.go:1457 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] re-submitting command 31343034336236613966343264366466 (MLI 0, CT <nil>): reasonNewLeader›
  2954.233ms      0.228ms    event:‹kv/kvserver/replica_send.go:200 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] replica.Send got error: [NotLeaseHolderError] refusing to acquire lease on follower; r285: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=0 start=0,0 exp=<nil>›

erikgrinaker · 2023-05-29T08:31:21Z

We should also be careful to avoid quadratic log growth under unavailability, since the Raft scheduler will attempt to reacquire leases every 3 seconds. This is similar to the buildup we've seen with replica circuit breaker probes. See #103908.

tbg · 2023-07-03T12:43:02Z

I've looked at the above trace and the code and while no thought experiment survives contact with reality, at least in theory our reproposal should already be near optimal for reducing latency.

When no leader is known, we don't repropose (actually we do, but raft just drops the proposals). The first thing a leader does is append an empty entry. When anyone sees an empty entry being appended, we repropose everything¹². So it's reactive, and should be really snappy in practice (if a follower is behind on the raft log, it might as well catch up a bit before adding more work to the system... but technically it is not exactly as reactive as it could be here, for it learns of the leader before seeing the empty entry).

The trace above seems likely to have originated from a slow follower. I am assuming it's in the wake of a split (can't tell from the trace). We see that the follower requests a lease, likely because the lease in the split looks expired to the follower - which makes sense if it is applying the split "much later" than it happened in real time. The proposal is flushed to raft and we don't see if it's actually sent or dropped, but either way, nothing happens for a bit. It is flushed again, notably with a 390ms delay, so probably this node is pretty slow, and then is rejected because by now there is a raft leader (note how we didn't hit this condition in the first round, so likely there wasn't a raft leader then). It's also not unlikely that the reproposal was actually triggered by the aforementioned reasonNewLeaderOrConfChange.

All in all, there is something interesting going on here - looks like we were perhaps waiting for a raft election, in which case the first attempt at proposal likely saw it dropped. (With 6b8391c there would be an entry in the trace for that, but alas).

I'll run an experiment locally to see if I can get "bad" lease latencies by just splitting a bunch.

erikgrinaker · 2023-07-03T12:48:38Z

Definitely possible that we hit an election tie and/or election timeout here. May not be anything to do here if we flush pending proposals on leader changes.

tbg · 2023-07-03T12:58:04Z

On a 3node standard GCE roachprod cluster,

roachprod ssh tobias-split:1 -- ./cockroach workload run kv --init --splits 3000 --read-percent 0 --max-rate 500

produced this satisfying pattern:

This isn't using expiration-based leases everywhere, so I switched on kv.expiration_leases_only.enabled at 12:51 and stopped the workload. This notably changes the histogram, since we are now acquiring many more leases.

After letting it settle in for a few minutes, I ran the split+workload again at around 12:53:30. If this had any effect, it wasn't visible from the histogram. In particular, the pMax seems about the same, at ~70ms.

tbg · 2023-07-03T13:04:00Z

My take-away here is that there's nothing "obviously" broken and that the problems are likely a layer below, having to do with raft election stale-mates or missed places where we ought to call .Campaign. I also tested on master as of today, so this has all of your latest work, which may also have fixed whatever you were originally seeing.

I think we should focus our energy on improving observability for "all things below raft". I'll add metrics for reproposals and replication latencies.

In the above experiment, we did see dropped proposals, though they were all dropped on followers:

I assume (but can't prove due to lack of replication latency metric) that these were all reactively reproposed once a leader was known, and so that the delay incurred was near optimal (assuming the leader was known soon thereafter). #83262 would help prove this, so it's something we should get in.

erikgrinaker · 2023-07-03T13:20:29Z

The original experiments where we saw this had a cluster with 200 ms RTTs and heavy load (TPCC import). We probably hit a few election stalemates following election timeouts. Reproposing on leader change seems sufficient, I think we can close this for now.

erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels May 28, 2023

This was referenced May 28, 2023

kvserver: prioritize lease requests through Raft #98124

Closed

kvserver: don't apply quota pool to lease requests #104026

Merged

exalate-issue-sync bot assigned tbg May 30, 2023

erikgrinaker closed this as completed Jul 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: more aggressive reproposals for lease requests #104024

kvserver: more aggressive reproposals for lease requests #104024

erikgrinaker commented May 28, 2023 •

edited

Loading

blathers-crl bot commented May 28, 2023

erikgrinaker commented May 28, 2023

erikgrinaker commented May 29, 2023

tbg commented Jul 3, 2023 •

edited

Loading

erikgrinaker commented Jul 3, 2023

tbg commented Jul 3, 2023

tbg commented Jul 3, 2023

erikgrinaker commented Jul 3, 2023

kvserver: more aggressive reproposals for lease requests #104024

kvserver: more aggressive reproposals for lease requests #104024

Comments

erikgrinaker commented May 28, 2023 • edited Loading

blathers-crl bot commented May 28, 2023

erikgrinaker commented May 28, 2023

erikgrinaker commented May 29, 2023

tbg commented Jul 3, 2023 • edited Loading

Footnotes

erikgrinaker commented Jul 3, 2023

tbg commented Jul 3, 2023

tbg commented Jul 3, 2023

erikgrinaker commented Jul 3, 2023

erikgrinaker commented May 28, 2023 •

edited

Loading

tbg commented Jul 3, 2023 •

edited

Loading