backport-2.1: kv: try next replica on RangeNotFoundError #31250

tbg · 2018-10-11T08:22:20Z

Backport 5/5 commits from #31013.

/cc @cockroachdb/release

Previously, if a Batch RPC came back with a RangeNotFoundError, we would
immediately stop trying to send to more replicas, evict the range
descriptor, and start a new attempt after a back-off.

This new attempt could end up using the same replica, so if the
RangeNotFoundError persisted for some amount of time, so would the
unsuccessful retries for requests to it as DistSender doesn't aggressively
shuffle the replicas.

It turns out that there are such situations, and the election-after-restart
roachtest spuriously hit one of them:

new replica receives a preemptive snapshot and the ConfChange
cluster restarts
now the new replica is in this state until the range wakes
up, which may not happen for some time. 4. the first request to the range
runs into the above problem

@nvanbenschoten: I think there is an issue to be filed about the tendency
of DistSender to get stuck in unfortunate configurations.

Fixes #30613.

Release note (bug fix): Avoid repeatedly trying a replica that was found to
be in the process of being added.

cockroach-teamcity · 2018-10-11T08:22:28Z

This change is

bdarnell

Reviewed 1 of 1 files at r1, 8 of 10 files at r3, 1 of 3 files at r4, 6 of 6 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained

This was hiding the output if the invocation itself failed, which is when you wanted it most. Release note: None

Release note: None

@nvanbenschoten

Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes cockroachdb#30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added.

Whenever a successful response is received from an RPC that we know has to contact the leaseholder to succeed, update the leaseholder cache. The immediate motivation for this is to be able to land the preceding commits, which greatly exacerbated (as in, added a much faster failure mode to) ``` make stress PKG=./pkg/sql/logictest TESTS=TestPlannerLogic/5node-dist/distsql_interleaved_join ``` However, the change is one we've wanted to make for a while; our caching and in particular the eviction of leaseholders has been deficient essentially ever since it was first introduced. Touches cockroachdb#31068. Release note: None

tbg requested review from bdarnell and a team October 11, 2018 08:22

tbg requested a review from a team as a code owner October 11, 2018 08:22

tbg requested a review from a team October 11, 2018 08:22

tbg mentioned this pull request Oct 11, 2018

storage: more proactively replicaGC replicas with stuck commands #26952

Closed

bdarnell approved these changes Oct 11, 2018

View reviewed changes

tbg force-pushed the backport2.1-31013 branch 3 times, most recently from 78c9981 to 2ebe157 Compare October 11, 2018 20:41

tbg added 5 commits October 12, 2018 21:39

roachtest: print output on failure in election test

629267e

This was hiding the output if the invocation itself failed, which is when you wanted it most. Release note: None

kv: improve a trace event about descriptor eviction

4d0fb44

Release note: None

storage: report optional StoreID with RangeNotFoundError

9998c7a

Release note: None

tbg force-pushed the backport2.1-31013 branch from 2ebe157 to fdca0bd Compare October 12, 2018 19:39

tbg merged commit a7a3cc1 into cockroachdb:release-2.1 Oct 12, 2018

tbg deleted the backport2.1-31013 branch October 12, 2018 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backport-2.1: kv: try next replica on RangeNotFoundError #31250

backport-2.1: kv: try next replica on RangeNotFoundError #31250

tbg commented Oct 11, 2018

cockroach-teamcity commented Oct 11, 2018

bdarnell left a comment

backport-2.1: kv: try next replica on RangeNotFoundError #31250

backport-2.1: kv: try next replica on RangeNotFoundError #31250

Conversation

tbg commented Oct 11, 2018

cockroach-teamcity commented Oct 11, 2018

bdarnell left a comment

Choose a reason for hiding this comment