-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: try next replica on RangeNotFoundError #31013
kv: try next replica on RangeNotFoundError #31013
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find! These kinds of issues are very tricky to track down.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 10 of 10 files at r4, 1 of 1 files at r5, 2 of 2 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/kv/dist_sender.go, line 1117 at r3 (raw file):
// descriptor. Invalidate the cache and try again with the new // metadata. log.Eventf(ctx, "evicting range descriptor on %T and backoff for re-lookup: %+v", tErr, desc)
Should we make this a VEventf
?
pkg/kv/dist_sender.go, line 1360 at r6 (raw file):
// These errors are likely to be unique to the replica that reported // them, so no action is required before the next retry. case *roachpb.RangeNotFoundError:
The effect of this change is that cases where all replicas return RangeNotFoundErrors
will now result in a SendError
once the transport is exhausted. I think that's ok because we treat them the same in sendPartialBatch
, but I want to make sure that's your intention.
pkg/kv/dist_sender.go, line 1362 at r6 (raw file):
case *roachpb.RangeNotFoundError: // The store we routed to doesn't have this replica. This can happen when // our descriptor is outright outdated, but it can also be caused a replica
"caused by"
pkg/kv/dist_sender.go, line 1371 at r6 (raw file):
// up (which may never happen if the target range is dormant). if tErr.StoreID != 0 { cachedStoreID, found := ds.leaseHolderCache.Lookup(ctx, tErr.RangeID)
In your test failure, was the replica that the request continued to get routed to the leaseholder? I'm wondering if we should propagate the error immediately if we think it is the leaseholder but it returns a RangeNotFoundError
.
pkg/kv/dist_sender.go, line 1374 at r6 (raw file):
match := cachedStoreID == tErr.StoreID evicting := found && match log.Eventf(ctx, "evicting leaseholder s%d for r%d after RangeNotFound: %t (found=%t, match=%t)", tErr.StoreID, tErr.RangeID, evicting, found, match)
Should this be moved a line down?
pkg/kv/dist_sender.go, line 1391 at r6 (raw file):
if replicas.FindReplica(lh.StoreID) == -1 { // Replace NotLeaseHolderError with RangeNotFoundError. br.Error = roachpb.NewError(roachpb.NewRangeNotFoundError(rangeID, curReplica.StoreID))
It's surprising that a NotLeaseHolderError
will be converted to a RangeNotFoundError
and propagated immediately but a RangeNotFoundError
won't be propagated (see comment above). Should we change this to a SendError and remove the handling of RangeNotFoundErrors
entirely from sendPartialBatch
?
pkg/storage/store.go, line 144 at r5 (raw file):
Settings: st, AmbientCtx: log.AmbientContext{Tracer: st.Tracer}, Clock: clock,
Is this a gofmt 1.10 vs 1.11 thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2, 7 of 10 files at r4, 1 of 1 files at r5, 1 of 2 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/election.go, line 60 at r6 (raw file):
// eagerly that multiple elections conflict with each other). start = timeutil.Now() buf, err := c.RunWithBuffer(ctx, c.l, c.Node(1), `env COCKROACH_CONNECT_TIMEOUT=120 ./cockroach sql --insecure -e "
Why so long? The point of this test is that the cluster should be able to recover quickly.
pkg/kv/dist_sender.go, line 1391 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It's surprising that a
NotLeaseHolderError
will be converted to aRangeNotFoundError
and propagated immediately but aRangeNotFoundError
won't be propagated (see comment above). Should we change this to a SendError and remove the handling ofRangeNotFoundErrors
entirely fromsendPartialBatch
?
+1 (I think, haven't thought through this part fully)
aae8c5b
to
ab08379
Compare
ab08379
to
8b083e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/election.go, line 60 at r6 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Why so long? The point of this test is that the cluster should be able to recover quickly.
All I know is that 5s is too short. I assume this is because Gossip needs a second to bootstrap (we start the processes asynchronously). I'll take this out and we can look at that separately.
pkg/kv/dist_sender.go, line 1117 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we make this a
VEventf
?
Sure, why not. Doubt it matters, but doesn't hurt.
pkg/kv/dist_sender.go, line 1360 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The effect of this change is that cases where all replicas return
RangeNotFoundErrors
will now result in aSendError
once the transport is exhausted. I think that's ok because we treat them the same insendPartialBatch
, but I want to make sure that's your intention.
Yes, I think that's fine.
pkg/kv/dist_sender.go, line 1362 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"caused by"
Done.
pkg/kv/dist_sender.go, line 1371 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
In your test failure, was the replica that the request continued to get routed to the leaseholder? I'm wondering if we should propagate the error immediately if we think it is the leaseholder but it returns a
RangeNotFoundError
.
No, in my testing this code never fired. But if it were cached as leaseholder we'd try it first on every subsequent attempt. If the real leaseholder were, say, the second replica, we'd never evict the cache and always pay a round-trip to that first node.
The real fix here is updating the leaseholder cache on successful replies, but I want to keep this diff small.
pkg/kv/dist_sender.go, line 1374 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should this be moved a line down?
This message was in the right place when I needed it (since I wanted to know what was going on) but now it seems right to move it. Done.
pkg/kv/dist_sender.go, line 1391 at r6 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
+1 (I think, haven't thought through this part fully)
Done.
pkg/storage/store.go, line 144 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is this a gofmt 1.10 vs 1.11 thing?
Not sure, but CI doesn't like it, so removed.
8b083e1
to
1db9737
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/election.go, line 60 at r6 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
All I know is that 5s is too short. I assume this is because Gossip needs a second to bootstrap (we start the processes asynchronously). I'll take this out and we can look at that separately.
OK. We should probably increase the default connect timeout then. 5s is too short, since connections in the real world may need to wait for the lease timeout for the cluster to become live. (Also I see that that env var is defined as a String interpreted as integer seconds when it should probably be a Duration instead)
pkg/storage/store.go, line 144 at r5 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Not sure, but CI doesn't like it, so removed.
Yes, it is gofmt 1.11. You can configure your editor to use cockroach/bin/gofmt or crlfmt to make sure it uses the right version.
Ack. Holding off on the merge since #31068 fails like clockwork on this PR in CI. |
9123dcf
to
50bad5f
Compare
c0010ee
to
faea6f4
Compare
This should be green soon. I added a commit that populates the leaseholder cache on successful responses, PTAL! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 11 of 11 files at r7, 1 of 1 files at r8, 10 of 10 files at r9, 3 of 3 files at r10, 5 of 5 files at r11.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/kv/dist_sender.go, line 1391 at r6 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Done.
The other half of this is that we no longer need a case for RangeNotFoundError
in sendPartialBatch
at all. Is there a blocker that's preventing us from removing that now?
pkg/kv/dist_sender.go, line 1371 at r11 (raw file):
// When a request that we know could only succeed on the leaseholder comes // back as successful, make sure the leaseholder cache reflects this // replica. In steady state, this should almost never be the case, and so we
This wording is a bit strange. This should almost always be the case, which is why we gate it, right?
pkg/kv/dist_sender.go, line 1386 at r11 (raw file):
// replica that has just been added but needs a snapshot to be caught up. // // We'll try other replicas which typically gives us the leaseholder, either
Glad we could remove that code.
pkg/kv/dist_sender.go, line 1396 at r11 (raw file):
// the range descriptor), but that's OK - this kind of error should be rare // but when it happens, we want to populate the cache with something useful // ASAP without waiting for another round trip.
Can we change the leaseHolder
object to avoid this?
pkg/sql/logictest/testdata/logic_test/explain_analyze_plans, line 50 at r11 (raw file):
/5 NULL 6 {5} 5 # Verify data placement.
Expand on this comment a bit. Why does the output change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/logictest/testdata/logic_test/explain_analyze_plans, line 50 at r11 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Expand on this comment a bit. Why does the output change?
It's a different table (kv
vs kw
).
And to answer @tschottdorf's original question in #31068 (comment), I imagine this makes the tests pass because it ensures that all the ranges/leases are in the DistSender's caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 11 files at r7, 8 of 10 files at r9, 1 of 3 files at r10, 5 of 5 files at r11.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/kv/dist_sender.go, line 433 at r11 (raw file):
} type leaseHolderInfo struct {
These two things don't seem to fit together very well in one struct. I think it would make more sense to make them two separate parameters to sendToReplicas (or just recompute needsLeaseHolder on the fly from the batch when needed).
This was hiding the output if the invocation itself failed, which is when you wanted it most. Release note: None
Release note: None
Release note: None
Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes cockroachdb#30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added.
Whenever a successful response is received from an RPC that we know has to contact the leaseholder to succeed, update the leaseholder cache. The immediate motivation for this is to be able to land the preceding commits, which greatly exacerbated (as in, added a much faster failure mode to) ``` make stress PKG=./pkg/sql/logictest TESTS=TestPlannerLogic/5node-dist/distsql_interleaved_join ``` However, the change is one we've wanted to make for a while; our caching and in particular the eviction of leaseholders has been deficient essentially ever since it was first introduced. Touches cockroachdb#31068. Release note: None
faea6f4
to
857b9c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, ready for a final look. Happy with how this turned out.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/election.go, line 60 at r6 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
OK. We should probably increase the default connect timeout then. 5s is too short, since connections in the real world may need to wait for the lease timeout for the cluster to become live. (Also I see that that env var is defined as a String interpreted as integer seconds when it should probably be a Duration instead)
I'll keep this on the radar, but I'm going to leave as-is on the off-change that these timeouts were actually the same bug, hitting very early in one of the internal queries ran by ./cockroach sql
.
pkg/kv/dist_sender.go, line 1391 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The other half of this is that we no longer need a case for
RangeNotFoundError
insendPartialBatch
at all. Is there a blocker that's preventing us from removing that now?
Good point, done.
pkg/kv/dist_sender.go, line 433 at r11 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
These two things don't seem to fit together very well in one struct. I think it would make more sense to make them two separate parameters to sendToReplicas (or just recompute needsLeaseHolder on the fly from the batch when needed).
Done.
pkg/kv/dist_sender.go, line 1371 at r11 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This wording is a bit strange. This should almost always be the case, which is why we gate it, right?
Oops, fixed.
pkg/kv/dist_sender.go, line 1386 at r11 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Glad we could remove that code.
Done.
pkg/kv/dist_sender.go, line 1396 at r11 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Can we change the
leaseHolder
object to avoid this?
Done.
pkg/sql/logictest/testdata/logic_test/explain_analyze_plans, line 50 at r11 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
It's a different table (
kv
vskw
).And to answer @tschottdorf's original question in #31068 (comment), I imagine this makes the tests pass because it ensures that all the ranges/leases are in the DistSender's caches.
Yeah, my only explanation here is that without this query, some ranges may just never have been touched and so the leaseholders aren't cached.
On the flip side, I don't know why this lease placement here is deterministic. Do these tests apply constraints or something like that? I mentioned this trick to @andreimatei and his eyebrow raised visibly, so perhaps he wants to comment.
(I'm not planning to block this PR on that, though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 10 files at r14, 1 of 3 files at r15, 6 of 6 files at r16.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
TFTR! bors r=nvanbenschoten,bdarnell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/sql/logictest/testdata/logic_test/explain_analyze_plans, line 50 at r11 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Yeah, my only explanation here is that without this query, some ranges may just never have been touched and so the leaseholders aren't cached.
On the flip side, I don't know why this lease placement here is deterministic. Do these tests apply constraints or something like that? I mentioned this trick to @andreimatei and his eyebrow raised visibly, so perhaps he wants to comment.
(I'm not planning to block this PR on that, though).
why wouldn't the placement be deterministic? Doesn't the EXPERIMENTAL_RELOCATE
do deterministic moves? And after that runs, I assume there's nothing to cause a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/sql/logictest/testdata/logic_test/explain_analyze_plans, line 50 at r11 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
why wouldn't the placement be deterministic? Doesn't the
EXPERIMENTAL_RELOCATE
do deterministic moves? And after that runs, I assume there's nothing to cause a change.
Lease placement is deterministic because that's what EXPERIMENTAL_RELOCATE
guarantees. It always transfers the lease to the first listed store, as specified in the original proposal https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20170314_sql_split_syntax.md#3-alter-tableindex-experimental_relocate
Thanks for the clarifications, @a-robinson and @andreimatei. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/sql/logictest/testdata/logic_test/explain_analyze_plans, line 50 at r11 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
Lease placement is deterministic because that's what
EXPERIMENTAL_RELOCATE
guarantees. It always transfers the lease to the first listed store, as specified in the original proposal https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20170314_sql_split_syntax.md#3-alter-tableindex-experimental_relocate
right, so there's no mystery here, is there?
these show experimental_ranges
queries could be replaced with select count(1)
for the purposes of populating the caches, but I don't know if that'd be better or worse for the test.
I'm mildly surprised that show experimental_ranges
populates any caches, but I don't know how it's implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 10 files at r14, 1 of 3 files at r15, 6 of 6 files at r16.
Reviewable status: complete! 1 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/sql/logictest/testdata/logic_test/explain_analyze_plans, line 50 at r11 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
right, so there's no mystery here, is there?
theseshow experimental_ranges
queries could be replaced withselect count(1)
for the purposes of populating the caches, but I don't know if that'd be better or worse for the test.
I'm mildly surprised thatshow experimental_ranges
populates any caches, but I don't know how it's implemented.
show ranges
works for populating the caches because in order to populate the lease_holder
column it has to send a LeaseInfo
batch request to every range.
31013: kv: try next replica on RangeNotFoundError r=nvanbenschoten,bdarnell a=tschottdorf Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes #30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added. 31187: roachtest: add synctest r=bdarnell a=tschottdorf This new roachtest sets up a charybdefs on a single (Ubuntu) node and runs the `synctest` cli command against a nemesis that injects random I/O errors. The synctest command is new. It simulates a Raft log and can be directed at a filesystem that is being hit with random failures. The workload essentially writes ascending keys (flushing each one to disk synchronously) until an I/O error occurs, at which point it re-opens the instance to verify that all persisted writes are still there. If the RocksDB instance was permanently corrupted, it switches to a new, pristine, directory. This is used in the roachtest, but is also useful for manual use in user deployments in which we suspect there is a failure to persist data to disk. This hasn't found anything, but it's fun to watch and also shows us a number of errors that we know and love from sentry. Release note: None 31215: storage: deflake TestStoreRangeMergeWatcher r=tschottdorf a=benesch This test could deadlock if the LHS replica on store2 was shut down before it processed the split at "b". Teach the test to wait for the LHS replica on store2 to process the split before blocking Raft traffic to it. Fixes #31096. Fixes #31149. Fixes #31160. Fixes #31167. Release note: None 31217: importccl: add explicit default to mysql testdata timestamp r=dt a=dt this makes the testdata work on mysql 8.0.2+, where the timestamp type no longer has the implicit defaults. Release note: none. 31221: cluster: Create final cluster version for 2.1 r=bdarnell a=bdarnell Release note: None Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Nikhil Benesch <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Ben Darnell <[email protected]>
Build succeeded |
This is no longer flaky, perhaps as a result of cockroachdb#31013. Verified via: $ roachprod create tobias-stress -n 10 --gce-machine-type=n1-standard-8 --local-ssd=false $ make roachprod-stressrace PKG=./pkg/sql/distsqlplan TESTS=TestSpanResolverUsesCaches CLUSTER=tobias-stress ... 8431 runs so far, 0 failures, over 15m15s Release note: None
31726: storage: unskip TestSpanResolverUsesCaches r=petermattis a=tschottdorf This is no longer flaky, perhaps as a result of #31013. Verified via: $ roachprod create tobias-stress -n 10 --gce-machine-type=n1-standard-8 --local-ssd=false $ make roachprod-stressrace PKG=./pkg/sql/distsqlplan TESTS=TestSpanResolverUsesCaches CLUSTER=tobias-stress ... 8431 runs so far, 0 failures, over 15m15s Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Previously, if a Batch RPC came back with a RangeNotFoundError, we would
immediately stop trying to send to more replicas, evict the range
descriptor, and start a new attempt after a back-off.
This new attempt could end up using the same replica, so if the
RangeNotFoundError persisted for some amount of time, so would the
unsuccessful retries for requests to it as DistSender doesn't aggressively
shuffle the replicas.
It turns out that there are such situations, and the election-after-restart
roachtest spuriously hit one of them:
up, which may not happen for some time. 4. the first request to the range
runs into the above problem
@nvanbenschoten: I think there is an issue to be filed about the tendency
of DistSender to get stuck in unfortunate configurations.
Fixes #30613.
Release note (bug fix): Avoid repeatedly trying a replica that was found to
be in the process of being added.