Skip to content

Commit

Permalink
kv: Don't evict from leaseholder cache due to context cancellations
Browse files Browse the repository at this point in the history
This was a major contributor to the hundreds of NotLeaseHolderErrors per
second that we see whenever we run tpc-c at scale. A non-essential batch
request like a QueryTxn would get cancelled, causing the range to be
evicted from the leaseholder cache and the next request to that range to
have to guess at the leaseholder.

Release note: None
  • Loading branch information
a-robinson committed Sep 13, 2018
1 parent 1be2119 commit 397409c
Showing 1 changed file with 15 additions and 14 deletions.
29 changes: 15 additions & 14 deletions pkg/kv/dist_sender.go
Original file line number Diff line number Diff line change
Expand Up @@ -1324,20 +1324,21 @@ func (ds *DistSender) sendToReplicas(
ambiguousError = err
}
log.VErrEventf(ctx, 2, "RPC error: %s", err)
if storeID, ok := ds.leaseHolderCache.Lookup(ctx, rangeID); ok && curReplica.StoreID == storeID {
// If the down replica is cached as the lease holder, evict
// it. The only other eviction happens below on
// NotLeaseHolderError, but if the next replica is the
// actual lease holder, we're never going to receive one of
// those and will thus pay the price of trying the down node
// first forever.
//
// NB: we could consider instead adding a successful reply
// from the next replica into the cache, but without a
// leaseholder (and taking into account that the local
// node can't be down) it won't take long until we talk
// to a replica that tells us who the leaseholder is.
ds.leaseHolderCache.Update(ctx, rangeID, 0 /* evict */)

// If the error wasn't just a context cancellation and the down replica
// is cached as the lease holder, evict it. The only other eviction
// happens below on NotLeaseHolderError, but if the next replica is the
// actual lease holder, we're never going to receive one of those and
// will thus pay the price of trying the down node first forever.
//
// NB: we should consider instead adding a successful reply from the next
// replica into the cache, but without a leaseholder (and taking into
// account that the local node can't be down) it won't take long until we
// talk to a replica that tells us who the leaseholder is.
if ctx.Err() == nil {
if storeID, ok := ds.leaseHolderCache.Lookup(ctx, rangeID); ok && curReplica.StoreID == storeID {
ds.leaseHolderCache.Update(ctx, rangeID, 0 /* evict */)
}
}
} else {
propagateError := false
Expand Down

0 comments on commit 397409c

Please sign in to comment.