Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: more aggressive reproposals for lease requests #104024

Closed
erikgrinaker opened this issue May 28, 2023 · 8 comments
Closed

kvserver: more aggressive reproposals for lease requests #104024

erikgrinaker opened this issue May 28, 2023 · 8 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented May 28, 2023

In #98124 we saw that lease requests would sometimes get reproposed, typically due to leader changes and often following range splits. Since lease requests are latency-sensitive, especially with expiration lease extensions, we should consider more aggressively reproposing them (e.g. after 1 second).

Jira issue: CRDB-28319

Epic CRDB-25199

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels May 28, 2023
@blathers-crl
Copy link

blathers-crl bot commented May 28, 2023

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor Author

Here's an example trace:

     0.000ms      0.000ms    === operation:request-lease-async _verbose:‹1› node:‹1› store:‹1› range:‹285/1:/Table/1{09/1/728…-10/1}›
     0.053ms      0.053ms    event:‹kv/kvserver/replica_send.go:185 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] read-write path›
     0.073ms      0.020ms    event:‹kv/kvserver/concurrency/concurrency_manager.go:195 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] sequencing request›
     0.084ms      0.011ms    event:‹kv/kvserver/concurrency/concurrency_manager.go:239 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] not acquiring latches›
     0.100ms      0.016ms    event:‹kv/kvserver/replica_write.go:164 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] applied timestamp cache›
     0.114ms      0.014ms    event:‹kv/kvserver/replica_write.go:403 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] executing read-write batch›
     0.402ms      0.288ms    event:‹kv/kvserver/replica_evaluate.go:496 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] evaluated RequestLease command header:<kvnemesis_seq:<> key:"\365\211\367\034x\022\272\226\323\226\225\203H\000\377\200\000\377\000\377\000\377\r\010yk\000\001" > lease:<start:<wall_time:1685299706457607449 > expiration:<wall_time:1685299712457607449 > replica:<node_id:1 store_id:1 r..., txn=<nil> : resp=header:<> , err=<nil>›
     0.424ms      0.022ms    event:‹kv/kvserver/replica_proposal.go:771 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] need consensus on write batch with op count=2›
     1.232ms      0.808ms    event:‹kv/kvserver/replica_raft.go:125 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] evaluated request›
     1.278ms      0.046ms    event:‹kv/kvserver/replica_raft.go:169 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] proposing command 14043b6a9f42d6df to write 0 new keys, 0 new values, 0 new intents, write batch size=177 bytes›
     1.299ms      0.021ms    event:‹kv/kvserver/replica_raft.go:274 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] acquiring proposal quota›
     1.310ms      0.011ms    event:‹kv/kvserver/replica_raft.go:280 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] acquired proposal quota›
     1.320ms      0.010ms    event:‹kv/kvserver/replica_raft.go:460 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] proposing command›
     1.344ms      0.024ms    event:‹kv/kvserver/replica_proposal_buf.go:239 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] acquiring proposal buffer read lock for command 14043b6a9f42d6df›
     1.356ms      0.012ms    event:‹kv/kvserver/replica_proposal_buf.go:254 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] allocating proposal buffer index for command 14043b6a9f42d6df›
     1.372ms      0.015ms    event:‹kv/kvserver/replica_proposal_buf.go:263 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] submitting proposal for command 14043b6a9f42d6df at proposal buffer index 0›
    23.672ms     22.300ms    event:‹kv/kvserver/replica_proposal_buf.go:544 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] flushing proposal for command 14043b6a9f42d6df to Raft›
  2563.758ms   2540.087ms    event:‹kv/kvserver/replica_raft.go:1457 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] re-submitting command 31343034336236613966343264366466 (MLI 0, CT <nil>): reasonTicks›
  2563.798ms      0.040ms    event:‹kv/kvserver/replica_proposal_buf.go:544 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] flushing proposal for command 14043b6a9f42d6df to Raft›
  2954.006ms    390.208ms    event:‹kv/kvserver/replica_raft.go:1457 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] re-submitting command 31343034336236613966343264366466 (MLI 0, CT <nil>): reasonNewLeader›
  2954.233ms      0.228ms    event:‹kv/kvserver/replica_send.go:200 [n1,s1,r285/1:/Table/1{09/1/728…-10/1}] replica.Send got error: [NotLeaseHolderError] refusing to acquire lease on follower; r285: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=0 start=0,0 exp=<nil>›

@erikgrinaker
Copy link
Contributor Author

We should also be careful to avoid quadratic log growth under unavailability, since the Raft scheduler will attempt to reacquire leases every 3 seconds. This is similar to the buildup we've seen with replica circuit breaker probes. See #103908.

@tbg
Copy link
Member

tbg commented Jul 3, 2023

I've looked at the above trace and the code and while no thought experiment survives contact with reality, at least in theory our reproposal should already be near optimal for reducing latency.

When no leader is known, we don't repropose (actually we do, but raft just drops the proposals). The first thing a leader does is append an empty entry. When anyone sees an empty entry being appended, we repropose everything12. So it's reactive, and should be really snappy in practice (if a follower is behind on the raft log, it might as well catch up a bit before adding more work to the system... but technically it is not exactly as reactive as it could be here, for it learns of the leader before seeing the empty entry).

The trace above seems likely to have originated from a slow follower. I am assuming it's in the wake of a split (can't tell from the trace). We see that the follower requests a lease, likely because the lease in the split looks expired to the follower - which makes sense if it is applying the split "much later" than it happened in real time. The proposal is flushed to raft and we don't see if it's actually sent or dropped, but either way, nothing happens for a bit. It is flushed again, notably with a 390ms delay, so probably this node is pretty slow, and then is rejected because by now there is a raft leader (note how we didn't hit this condition in the first round, so likely there wasn't a raft leader then). It's also not unlikely that the reproposal was actually triggered by the aforementioned reasonNewLeaderOrConfChange.

All in all, there is something interesting going on here - looks like we were perhaps waiting for a raft election, in which case the first attempt at proposal likely saw it dropped. (With 6b8391c there would be an entry in the trace for that, but alas).

I'll run an experiment locally to see if I can get "bad" lease latencies by just splitting a bunch.

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/0a1f8ff21f33c2030509b5d089a49603f63412db/pkg/kv/kvserver/replica_raft.go#L1079-L1085

  2. https://github.com/cockroachdb/cockroach/blob/0a1f8ff21f33c2030509b5d089a49603f63412db/pkg/kv/kvserver/replica_raft.go#L1439-L1443

@erikgrinaker
Copy link
Contributor Author

Definitely possible that we hit an election tie and/or election timeout here. May not be anything to do here if we flush pending proposals on leader changes.

@tbg
Copy link
Member

tbg commented Jul 3, 2023

On a 3node standard GCE roachprod cluster,

roachprod ssh tobias-split:1 -- ./cockroach workload run kv --init --splits 3000 --read-percent 0 --max-rate 500

produced this satisfying pattern:

image

This isn't using expiration-based leases everywhere, so I switched on kv.expiration_leases_only.enabled at 12:51 and stopped the workload. This notably changes the histogram, since we are now acquiring many more leases.

image

After letting it settle in for a few minutes, I ran the split+workload again at around 12:53:30. If this had any effect, it wasn't visible from the histogram. In particular, the pMax seems about the same, at ~70ms.

image

@tbg
Copy link
Member

tbg commented Jul 3, 2023

My take-away here is that there's nothing "obviously" broken and that the problems are likely a layer below, having to do with raft election stale-mates or missed places where we ought to call .Campaign. I also tested on master as of today, so this has all of your latest work, which may also have fixed whatever you were originally seeing.

I think we should focus our energy on improving observability for "all things below raft". I'll add metrics for reproposals and replication latencies.

In the above experiment, we did see dropped proposals, though they were all dropped on followers:

image

I assume (but can't prove due to lack of replication latency metric) that these were all reactively reproposed once a leader was known, and so that the delay incurred was near optimal (assuming the leader was known soon thereafter). #83262 would help prove this, so it's something we should get in.

@erikgrinaker
Copy link
Contributor Author

The original experiments where we saw this had a cluster with 200 ms RTTs and heavy load (TPCC import). We probably hit a few election stalemates following election timeouts. Reproposing on leader change seems sufficient, I think we can close this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

2 participants