Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: Don't treat context cancellation as a SendError #26764

Merged
merged 1 commit into from
Jun 18, 2018

Conversation

a-robinson
Copy link
Contributor

SendErrors cause range descriptors to be evicted from the range cache,
but happen innocuously all the time as requests like HeartbeatTxns are
cancelled because they're no longer needed.

This was part of the cause of range 1 getting thousands of qps on 30
node TPC-C testing, as initially called out in the comments on #26608.

Release note: None


kv is probably the package I'm least familiar with all the details of, so this may be a bad way of accomplishing this. If so, please let me know. If it's reasonable, I'll add a test that context cancellations don't affect the contents of the range cache.

@a-robinson a-robinson requested review from nvanbenschoten and a team June 16, 2018 04:52
@cockroach-teamcity
Copy link
Member

This change is Reviewable

a-robinson added a commit to a-robinson/cockroach that referenced this pull request Jun 16, 2018
When running TPC-C 10k on a 30 node cluster without partitioning, range
1 was receiving thousands of qps while all other ranges were receiving
no more than low hundreds of qps (more details in cockroachdb#26608. Part of it was
context cancellations causing range descriptors to be evicted from the
range cache (cockroachdb#26764), but an even bigger part of it was HeartbeatTxns
being sent for transactions with no anchor key, accounting for thousands
of QPS even after cockroachdb#26764 was fixed.

This causes the same outcome as the old code without the load, because
without this change we'd just send the request and get back a
REASON_TXN_NOT_FOUND error, which would cause the function to return
true.

It's possible that we should instead avoid the heartbeat loop at all for
transactions without a key, or that we should put in more effort to
prevent such requests from even counting as transactions (a la cockroachdb#26741,
which perhaps makes this change unnecessary?). Advice would be great.

Release note: None
@a-robinson a-robinson force-pushed the context-evict branch 2 times, most recently from c564b90 to c6c7ab1 Compare June 16, 2018 06:17
@nvanbenschoten
Copy link
Member

:lgtm: thanks for tracking this interaction down.


Review status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


Comments from Reviewable

@bdarnell
Copy link
Contributor

:lgtm:

I'm surprised that context cancellations are occurring that frequently. Maybe we're sending HeartbeatTxn too aggressively?


Review status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale)


pkg/kv/dist_sender.go, line 1309 at r1 (raw file):

			// cause range cache evictions. Context cancellations just mean the
			// sender changed its mind or the request timed out.
			return nil, errors.New(errMsg)

If it's unambiguous, I think it's better to return ctx.Err() instead of a new error (we should at least wrap it instead of errors.New). I'm not sure if there's explicit handling for context errors on this path, but I know we do it in some places.


Comments from Reviewable

@a-robinson
Copy link
Contributor Author

Review status: :shipit: complete! 2 of 0 LGTMs obtained


pkg/kv/dist_sender.go, line 1309 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If it's unambiguous, I think it's better to return ctx.Err() instead of a new error (we should at least wrap it instead of errors.New). I'm not sure if there's explicit handling for context errors on this path, but I know we do it in some places.

Done.


Comments from Reviewable

@a-robinson a-robinson force-pushed the context-evict branch 2 times, most recently from 03c775b to b0847f0 Compare June 18, 2018 20:16
@a-robinson
Copy link
Contributor Author

Added a test case that covers this.

SendErrors cause range descriptors to be evicted from the range cache,
but happen innocuously all the time as requests like HeartbeatTxns are
cancelled because they're no longer needed.

This was part of the cause of range 1 getting thousands of qps on 30
node TPC-C testing, as initially called out in the comments on cockroachdb#26608.

Release note: None
@a-robinson
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 18, 2018

Build failed (retrying...)

craig bot pushed a commit that referenced this pull request Jun 18, 2018
26764: kv: Don't treat context cancellation as a SendError r=a-robinson a=a-robinson

SendErrors cause range descriptors to be evicted from the range cache,
but happen innocuously all the time as requests like HeartbeatTxns are
cancelled because they're no longer needed.

This was part of the cause of range 1 getting thousands of qps on 30
node TPC-C testing, as initially called out in the comments on #26608.

Release note: None

---------------------------

`kv` is probably the package I'm least familiar with all the details of, so this may be a bad way of accomplishing this. If so, please let me know. If it's reasonable, I'll add a test that context cancellations don't affect the contents of the range cache.

Co-authored-by: Alex Robinson <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jun 18, 2018

Build succeeded

@craig craig bot merged commit 947fa79 into cockroachdb:master Jun 18, 2018
craig bot pushed a commit that referenced this pull request Sep 13, 2018
29692: ui: various glue fixes r=vilterp,couchand a=benesch

This PR restores the "no UI installed" message in short binaries:

![image](https://user-images.githubusercontent.com/882976/45196553-e713b880-b22a-11e8-928a-06c7a2da0f63.png)

I came across a few minor nits that seemed worth fixing too.

30135: sql: add '2.0' setting for distsql r=jordanlewis a=jordanlewis

The 2.0 setting for distsql (both a cluster setting and a session
setting) instructs the executor to use the 2.0 method of determining how
to execute a query: the query runs via local sql unless the query is
both distributable and recommended to be distributed, in which case it
runs via the distsql and is actually distributed.

Release note (sql change): add the '2.0' value for both the distsql
session setting and the sql.defaults.distsql cluster setting, which
instructs the database to use the 2.0 'auto' behavior for determining
whether queries run via distsql or not.

30148: storage: add new metrics for the RaftEntryCache r=nvanbenschoten a=nvanbenschoten

Four new metrics are introduced:
- `raft.entrycache.bytes`
- `raft.entrycache.size`
- `raft.entrycache.accesses`
- `raft.entrycache.hits`

30163: kv: Don't evict from leaseholder cache on context cancellations r=a-robinson a=a-robinson

This was a major contributor to the hundreds of NotLeaseHolderErrors per
second that we see whenever we run tpc-c at scale. A non-essential batch
request like a QueryTxn would get cancelled, causing the range to be
evicted from the leaseholder cache and the next request to that range to
have to guess at the leaseholder.

This is effectively an extension of #26764 that we should have thought to inspect more closely at the time.

Actually fixes #23543, which was not fully fixed before. Although I still haven't seen the errors drop all the way to 0, so I'm letting tpc-c 10k continue to run for a while longer to verify that they do. They are continuing to decrease about 15 minutes in. I don't think getting to 0 will be possible because there are still occasional splits and lease transfers), but it looks like it should be able to get down to single digit errors per second from the hundreds it was at before this change.

Also, avoid doing unnecessary sorting by latency of replicas in the dist_sender in the common case when we know who the leaseholder is and plan on sending our request there.

30197: sql/parser: fix the action for empty rules r=knz a=knz

Fixes #30141.


Co-authored-by: Nikhil Benesch <[email protected]>
Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Alex Robinson <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants