kv: Don't evict from leaseholder cache on context cancellations #30163

a-robinson · 2018-09-12T23:18:56Z

This was a major contributor to the hundreds of NotLeaseHolderErrors per
second that we see whenever we run tpc-c at scale. A non-essential batch
request like a QueryTxn would get cancelled, causing the range to be
evicted from the leaseholder cache and the next request to that range to
have to guess at the leaseholder.

This is effectively an extension of #26764 that we should have thought to inspect more closely at the time.

Actually fixes #23543, which was not fully fixed before. Although I still haven't seen the errors drop all the way to 0, so I'm letting tpc-c 10k continue to run for a while longer to verify that they do. They are continuing to decrease about 15 minutes in. I don't think getting to 0 will be possible because there are still occasional splits and lease transfers), but it looks like it should be able to get down to single digit errors per second from the hundreds it was at before this change.

Also, avoid doing unnecessary sorting by latency of replicas in the dist_sender in the common case when we know who the leaseholder is and plan on sending our request there.

cockroach-teamcity · 2018-09-12T23:19:01Z

This change is

a-robinson · 2018-09-12T23:25:28Z

It may be a fluke, but this also has me hitting higher tpmC efficiencies than before (98.7 vs 94.6 on a 5 minute run; 99.7% on a 15 minute run).

nvanbenschoten

nice catch! I've also run into significant numbers of NotLeaseHolderErrors on larger clusters, so I'm very happy we tracked this down.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/kv/dist_sender.go, line 1338 at r2 (raw file):

			// account that the local node can't be down) it won't take long until we
			// talk to a replica that tells us who the leaseholder is.
			if ctx.Err() == nil {

nit: is this preferable to if err != context.Canceled && err != context.DeadlineExceeded {?

nvanbenschoten · 2018-09-13T03:37:37Z

Looks like this was actually somewhat new #23885.

a-robinson

Reviewable status: complete! 1 of 0 LGTMs obtained

pkg/kv/dist_sender.go, line 1338 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: is this preferable to if err != context.Canceled && err != context.DeadlineExceeded {?

An argument could be made either way. I like this way because with all the wrapping of errors that we do throughout our code base I don't necessarily trust SendNext to have returned the context's error directly. Also this matches the check that we do for ctx.Err() below to determine whether to mark the error as a SendError.

If you prefer the other way for similarly compelling reasons, I can change it.

nvanbenschoten

Reviewable status: complete! 1 of 0 LGTMs obtained

pkg/kv/dist_sender.go, line 1338 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

An argument could be made either way. I like this way because with all the wrapping of errors that we do throughout our code base I don't necessarily trust SendNext to have returned the context's error directly. Also this matches the check that we do for ctx.Err() below to determine whether to mark the error as a SendError.

If you prefer the other way for similarly compelling reasons, I can change it.

The wrapping argument is valid. Let's keep this as is.

This avoids a bunch of unnecessary latency lookups and sorting in the common case. Release note: None

This was a major contributor to the hundreds of NotLeaseHolderErrors per second that we see whenever we run tpc-c at scale. A non-essential batch request like a QueryTxn would get cancelled, causing the range to be evicted from the leaseholder cache and the next request to that range to have to guess at the leaseholder. Release note: None

a-robinson · 2018-09-13T19:24:17Z

bors r+

29692: ui: various glue fixes r=vilterp,couchand a=benesch This PR restores the "no UI installed" message in short binaries: ![image](https://user-images.githubusercontent.com/882976/45196553-e713b880-b22a-11e8-928a-06c7a2da0f63.png) I came across a few minor nits that seemed worth fixing too. 30135: sql: add '2.0' setting for distsql r=jordanlewis a=jordanlewis The 2.0 setting for distsql (both a cluster setting and a session setting) instructs the executor to use the 2.0 method of determining how to execute a query: the query runs via local sql unless the query is both distributable and recommended to be distributed, in which case it runs via the distsql and is actually distributed. Release note (sql change): add the '2.0' value for both the distsql session setting and the sql.defaults.distsql cluster setting, which instructs the database to use the 2.0 'auto' behavior for determining whether queries run via distsql or not. 30148: storage: add new metrics for the RaftEntryCache r=nvanbenschoten a=nvanbenschoten Four new metrics are introduced: - `raft.entrycache.bytes` - `raft.entrycache.size` - `raft.entrycache.accesses` - `raft.entrycache.hits` 30163: kv: Don't evict from leaseholder cache on context cancellations r=a-robinson a=a-robinson This was a major contributor to the hundreds of NotLeaseHolderErrors per second that we see whenever we run tpc-c at scale. A non-essential batch request like a QueryTxn would get cancelled, causing the range to be evicted from the leaseholder cache and the next request to that range to have to guess at the leaseholder. This is effectively an extension of #26764 that we should have thought to inspect more closely at the time. Actually fixes #23543, which was not fully fixed before. Although I still haven't seen the errors drop all the way to 0, so I'm letting tpc-c 10k continue to run for a while longer to verify that they do. They are continuing to decrease about 15 minutes in. I don't think getting to 0 will be possible because there are still occasional splits and lease transfers), but it looks like it should be able to get down to single digit errors per second from the hundreds it was at before this change. Also, avoid doing unnecessary sorting by latency of replicas in the dist_sender in the common case when we know who the leaseholder is and plan on sending our request there. 30197: sql/parser: fix the action for empty rules r=knz a=knz Fixes #30141. Co-authored-by: Nikhil Benesch <[email protected]> Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Alex Robinson <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

craig · 2018-09-13T20:18:48Z

Build succeeded

GitHub CI (Cockroach)

a-robinson requested review from nvanbenschoten and a team September 12, 2018 23:18

nvanbenschoten approved these changes Sep 13, 2018

View reviewed changes

a-robinson commented Sep 13, 2018

View reviewed changes

nvanbenschoten approved these changes Sep 13, 2018

View reviewed changes

a-robinson added 2 commits September 13, 2018 14:23

kv: Don't sort replicas by latency if we know the leaseholder

1be2119

This avoids a bunch of unnecessary latency lookups and sorting in the common case. Release note: None

a-robinson force-pushed the notleaseholder branch from 5437d1c to 397409c Compare September 13, 2018 19:24

craig bot merged commit 397409c into cockroachdb:master Sep 13, 2018

This was referenced Sep 13, 2018

backport-2.1: kv: Don't evict from leaseholder cache on context cancellations #30214

Merged

TPC-C 50k #30284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: Don't evict from leaseholder cache on context cancellations #30163

kv: Don't evict from leaseholder cache on context cancellations #30163

a-robinson commented Sep 12, 2018

cockroach-teamcity commented Sep 12, 2018

a-robinson commented Sep 12, 2018

nvanbenschoten left a comment

nvanbenschoten commented Sep 13, 2018

a-robinson left a comment

nvanbenschoten left a comment

a-robinson commented Sep 13, 2018

craig bot commented Sep 13, 2018

kv: Don't evict from leaseholder cache on context cancellations #30163

kv: Don't evict from leaseholder cache on context cancellations #30163

Conversation

a-robinson commented Sep 12, 2018

cockroach-teamcity commented Sep 12, 2018

a-robinson commented Sep 12, 2018

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Sep 13, 2018

a-robinson left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

a-robinson commented Sep 13, 2018

craig bot commented Sep 13, 2018

Build succeeded