perf: excessive not lease holder errors running tpcc #23543

petermattis · 2018-03-07T18:08:48Z

When starting up tpcc we see a spike in not lease holder errors that eventually settles down to a steady stream that never completely goes away. This was noticed on a cluster running tpcc-10k. I'm not sure if it happens on smaller tpcc datasets. There might be a simple explanation for this (e.g. the cluster has 54k ranges). My simple attempts to ensure that the leaseholder cache is fully populated on each node in the cluster were unable to completely eliminate the errors.

The text was updated successfully, but these errors were encountered:

bdarnell · 2018-03-07T18:38:20Z

The caching in DistSender is not very smart. We're pretty quick to invalidate the cache if we get any error (especially a timeout). So if the cluster is just barely able to handle its load, we might get a timeout, invalidate the cache, and then the next query would have a good chance of a NotLeaseHolderError. I don't think we're setting short deadlines on the TPCC requests themselves, but we do have short deadlines on some internal operations such as liveness updates.

petermattis · 2018-03-07T20:11:30Z

The not leaseholder errors seem to have decreased over time as the cluster has remained up. They are now down in the 1-2/sec range. So perhaps my efforts to populate the leaseholder cache were failing.

jordanlewis · 2018-03-07T20:19:12Z

This is what it looks like on TPCC-1000:

jordanlewis · 2018-03-07T20:19:49Z

Not a great screenshot. This is 1000, 2000, and 3000 next to each other on a cluster that didn't have any splits or partitioning.

nvanbenschoten · 2018-03-07T20:24:40Z

So perhaps my efforts to populate the leaseholder cache were failing.

What were these efforts exactly?

petermattis · 2018-03-07T20:28:17Z

What were these efforts exactly?

Running select count(*) on the various tables pointed at the various nodes. But perhaps I need to do that with distsql=off. I know distsql populates the range cache, but perhaps not 100% of the time.

tbg · 2018-03-13T22:46:40Z

@petermattis what's the appropriate (i.e. smallest) setting in which you think I could repro this?

petermattis · 2018-03-13T22:52:49Z

@tschottdorf I don't know. I'm going to set up a 3-node cluster for tpcc-1k tonight. I'll point you at it if it repros there. I don't recall seeing this problem at tpcc-2k or tpcc-5k, but I also wasn't paying attention.

nvanbenschoten · 2018-03-13T23:40:24Z

I set up a TPCC-1 3-node cluster and ran the following on all nodes before applying load:

USE tpcc;
SET DISTSQL = OFF;
SELECT COUNT(*) FROM warehouse;
...
SELECT COUNT(*) FROM order_line;

After doing so, I turned on the loadgen. I saw only one not leaseholder error spike, which amounted to about 8 errors. After this point, I did not see a single error.

One thing I'm curious about is how many nodes were in the cluster that saw this initial spike. @a-robinson, would you expect rebalancing to be less stable on a larger cluster?

Also, doing this experiment brought up an interesting problem. When I tried it initially on a TPCC-5k cluster, it took about half an hour because of the large amount of data that each COUNT(*) needed to fetch. I think it would be worth adding a SQL command to fetch the leaseholder for all ranges in a table and prime the leaseholder cache without needing to actually perform a COUNT(*) and scan anything from disk. This command would probably also have the effect of leasing the table's descriptor, which would help out with #23510. We could then run this on all tables before starting a load generator and avoid any initial performance hit.

petermattis · 2018-03-14T12:21:39Z

One thing I'm curious about is how many nodes were in the cluster that saw this initial spike.

I saw this on a 24-node cluster. Testing tpcc-2k last night on an 8-node cluster also showed a spike, but it went away quickly.

I think it would be worth adding a SQL command to fetch the leaseholder for all ranges in a table and prime the leaseholder cache without needing to actually perform a COUNT(*) and scan anything from disk. This command would probably also have the effect of leasing the table's descriptor, which would help out with #23510. We could then run this on all tables before starting a load generator and avoid any initial performance hit.

Agreed.

petermattis · 2018-03-14T12:22:02Z

Moving to 2.1 as I don't think there is anything to do here for 2.0.

a-robinson · 2018-03-14T14:32:16Z

One thing I'm curious about is how many nodes were in the cluster that saw this initial spike. @a-robinson, would you expect rebalancing to be less stable on a larger cluster?

No, I wouldn't expect it to be any less stable. I wish I had written something down about it, but I half-remember looking at @petermattis's roachprod cluster after he complained about this on slack and not seeing any rebalancing activity.

Moving to 2.1 as I don't think there is anything to do here for 2.0.

Agreed given where we are in the release cycle, but we should be more actively updating the leaseholder cache when a replica successfully serves a request, as alluded to by the discussion on #23601.

petermattis · 2018-07-13T14:46:44Z

@a-robinson Didn't you fix some bugs related to leaseholder errors? Can this issue be closed?

a-robinson · 2018-07-13T20:32:14Z

I did fix a couple problems, although I also found a problem that it looks like I never opened an issue for that can cause large spikes of the errors.

During a lease transfer, there can be a period of time in which the old leaseholder redirects requests to the new leaseholder, but the new leaseholder hasn't applied the lease yet and thus redirects requests to the old leaseholder. This can cause requests to ping pong back and forth from the dist sender to one replica then the other, over and over and over. When network latencies are low, this can lead to thousands of not-leaseholder errors.

I managed to grab these verbose logs from logspy while looking into the spikes: https://gist.github.com/a-robinson/eb931e4987219060f3107e6bd371c625

Here's a brief excerpt demonstrating the problem:

I180616 05:52:17.728179 20503404 kv/dist_sender.go:1317  [client=10.142.0.54:39264,user=root,txn=61557e35,ts=1529128299.977196718,4,n12] error: (err: [NotLeaseHolderError] r55752: replica (n30,s30):5 not lease holder; current lease is repl=(n16,s16):1 seq=0 start=1529128337.569175562,0 epo=9 pro=1529128337.569181772,0) <nil>; trying next peer (n16,s16):1
I180616 05:52:17.767118 20503404 kv/leaseholder_cache.go:97  [client=10.142.0.54:39264,user=root,txn=61557e35,ts=1529128299.977196718,4,n12] r55752: updating leaseholder: 30
I180616 05:52:17.767203 20503404 kv/dist_sender.go:1317  [client=10.142.0.54:39264,user=root,txn=61557e35,ts=1529128299.977196718,4,n12] error: (err: [NotLeaseHolderError] r55752: replica (n16,s16):1 not lease holder; current lease is repl=(n30,s30):5 seq=25 start=1529123107.545581542,0 epo=9 pro=1529123107.545589819,0) <nil>; trying next peer (n30,s30):5
I180616 05:52:17.771239 20503404 kv/leaseholder_cache.go:97  [client=10.142.0.54:39264,user=root,txn=61557e35,ts=1529128299.977196718,4,n12] r55752: updating leaseholder: 16
I180616 05:52:17.771327 20503404 kv/dist_sender.go:1317  [client=10.142.0.54:39264,user=root,txn=61557e35,ts=1529128299.977196718,4,n12] error: (err: [NotLeaseHolderError] r55752: replica (n30,s30):5 not lease holder; current lease is repl=(n16,s16):1 seq=0 start=1529128337.569175562,0 epo=9 pro=1529128337.569181772,0) <nil>; trying next peer (n16,s16):1
I180616 05:52:17.774656 20503404 kv/leaseholder_cache.go:97  [client=10.142.0.54:39264,user=root,txn=61557e35,ts=1529128299.977196718,4,n12] r55752: updating leaseholder: 30
I180616 05:52:17.774740 20503404 kv/dist_sender.go:1317  [client=10.142.0.54:39264,user=root,txn=61557e35,ts=1529128299.977196718,4,n12] error: (err: [NotLeaseHolderError] r55752: replica (n16,s16):1 not lease holder; current lease is repl=(n30,s30):5 seq=25 start=1529123107.545581542,0 epo=9 pro=1529123107.545589819,0) <nil>; trying next peer (n30,s30):5
I180616 05:52:17.841658 20503404 kv/leaseholder_cache.go:97  [client=10.142.0.54:39264,user=root,txn=61557e35,ts=1529128299.977196718,4,n12] r55752: updating leaseholder: 16

The spike of errors:

Proof of a lease transfer happening right around that time on that range:

nvanbenschoten · 2018-07-13T21:25:18Z

Yeah, this is exactly what I saw in #22837. We'll want to do something about this sometime soon.

a-robinson · 2018-07-14T02:35:58Z

Alright, I'm going to close this in favor of #22837 then. Thanks @nvanbenschoten.

29692: ui: various glue fixes r=vilterp,couchand a=benesch This PR restores the "no UI installed" message in short binaries: ![image](https://user-images.githubusercontent.com/882976/45196553-e713b880-b22a-11e8-928a-06c7a2da0f63.png) I came across a few minor nits that seemed worth fixing too. 30135: sql: add '2.0' setting for distsql r=jordanlewis a=jordanlewis The 2.0 setting for distsql (both a cluster setting and a session setting) instructs the executor to use the 2.0 method of determining how to execute a query: the query runs via local sql unless the query is both distributable and recommended to be distributed, in which case it runs via the distsql and is actually distributed. Release note (sql change): add the '2.0' value for both the distsql session setting and the sql.defaults.distsql cluster setting, which instructs the database to use the 2.0 'auto' behavior for determining whether queries run via distsql or not. 30148: storage: add new metrics for the RaftEntryCache r=nvanbenschoten a=nvanbenschoten Four new metrics are introduced: - `raft.entrycache.bytes` - `raft.entrycache.size` - `raft.entrycache.accesses` - `raft.entrycache.hits` 30163: kv: Don't evict from leaseholder cache on context cancellations r=a-robinson a=a-robinson This was a major contributor to the hundreds of NotLeaseHolderErrors per second that we see whenever we run tpc-c at scale. A non-essential batch request like a QueryTxn would get cancelled, causing the range to be evicted from the leaseholder cache and the next request to that range to have to guess at the leaseholder. This is effectively an extension of #26764 that we should have thought to inspect more closely at the time. Actually fixes #23543, which was not fully fixed before. Although I still haven't seen the errors drop all the way to 0, so I'm letting tpc-c 10k continue to run for a while longer to verify that they do. They are continuing to decrease about 15 minutes in. I don't think getting to 0 will be possible because there are still occasional splits and lease transfers), but it looks like it should be able to get down to single digit errors per second from the hundreds it was at before this change. Also, avoid doing unnecessary sorting by latency of replicas in the dist_sender in the common case when we know who the leaseholder is and plan on sending our request there. 30197: sql/parser: fix the action for empty rules r=knz a=knz Fixes #30141. Co-authored-by: Nikhil Benesch <[email protected]> Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Alex Robinson <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

petermattis added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Mar 7, 2018

petermattis added this to the 2.0 milestone Mar 7, 2018

petermattis assigned nvanbenschoten, petermattis and a-robinson Mar 7, 2018

a-robinson mentioned this issue Mar 12, 2018

sql: distsql plans against unavailable node #23601

Closed

petermattis modified the milestones: 2.0, 2.1 Mar 14, 2018

cockroachdb deleted a comment from a-robinson Mar 14, 2018

nvanbenschoten added the A-kv-distribution Relating to rebalancing and leasing. label Apr 24, 2018

a-robinson closed this as completed Jul 14, 2018

a-robinson mentioned this issue Sep 12, 2018

kv: Don't evict from leaseholder cache on context cancellations #30163

Merged

a-robinson mentioned this issue Sep 13, 2018

backport-2.1: kv: Don't evict from leaseholder cache on context cancellations #30214

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: excessive not lease holder errors running tpcc #23543

perf: excessive not lease holder errors running tpcc #23543

petermattis commented Mar 7, 2018

bdarnell commented Mar 7, 2018

petermattis commented Mar 7, 2018

jordanlewis commented Mar 7, 2018

jordanlewis commented Mar 7, 2018

nvanbenschoten commented Mar 7, 2018

petermattis commented Mar 7, 2018

tbg commented Mar 13, 2018

petermattis commented Mar 13, 2018

nvanbenschoten commented Mar 13, 2018

petermattis commented Mar 14, 2018

petermattis commented Mar 14, 2018

a-robinson commented Mar 14, 2018

petermattis commented Jul 13, 2018

a-robinson commented Jul 13, 2018

nvanbenschoten commented Jul 13, 2018

a-robinson commented Jul 14, 2018

perf: excessive not lease holder errors running tpcc #23543

perf: excessive not lease holder errors running tpcc #23543

Comments

petermattis commented Mar 7, 2018

bdarnell commented Mar 7, 2018

petermattis commented Mar 7, 2018

jordanlewis commented Mar 7, 2018

jordanlewis commented Mar 7, 2018

nvanbenschoten commented Mar 7, 2018

petermattis commented Mar 7, 2018

tbg commented Mar 13, 2018

petermattis commented Mar 13, 2018

nvanbenschoten commented Mar 13, 2018

petermattis commented Mar 14, 2018

petermattis commented Mar 14, 2018

a-robinson commented Mar 14, 2018

petermattis commented Jul 13, 2018

a-robinson commented Jul 13, 2018

nvanbenschoten commented Jul 13, 2018

a-robinson commented Jul 14, 2018