Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: try next replica on RangeNotFoundError #31013

Merged
merged 5 commits into from
Oct 10, 2018

Conversation

tbg
Copy link
Member

@tbg tbg commented Oct 5, 2018

Previously, if a Batch RPC came back with a RangeNotFoundError, we would
immediately stop trying to send to more replicas, evict the range
descriptor, and start a new attempt after a back-off.

This new attempt could end up using the same replica, so if the
RangeNotFoundError persisted for some amount of time, so would the
unsuccessful retries for requests to it as DistSender doesn't aggressively
shuffle the replicas.

It turns out that there are such situations, and the election-after-restart
roachtest spuriously hit one of them:

  1. new replica receives a preemptive snapshot and the ConfChange
  2. cluster restarts
  3. now the new replica is in this state until the range wakes
    up, which may not happen for some time. 4. the first request to the range
    runs into the above problem

@nvanbenschoten: I think there is an issue to be filed about the tendency
of DistSender to get stuck in unfortunate configurations.

Fixes #30613.

Release note (bug fix): Avoid repeatedly trying a replica that was found to
be in the process of being added.

@tbg tbg requested a review from a team October 5, 2018 16:57
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from nvanbenschoten October 5, 2018 16:57
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! These kinds of issues are very tricky to track down.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 10 of 10 files at r4, 1 of 1 files at r5, 2 of 2 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/kv/dist_sender.go, line 1117 at r3 (raw file):

			// descriptor. Invalidate the cache and try again with the new
			// metadata.
			log.Eventf(ctx, "evicting range descriptor on %T and backoff for re-lookup: %+v", tErr, desc)

Should we make this a VEventf?


pkg/kv/dist_sender.go, line 1360 at r6 (raw file):

				// These errors are likely to be unique to the replica that reported
				// them, so no action is required before the next retry.
			case *roachpb.RangeNotFoundError:

The effect of this change is that cases where all replicas return RangeNotFoundErrors will now result in a SendError once the transport is exhausted. I think that's ok because we treat them the same in sendPartialBatch, but I want to make sure that's your intention.


pkg/kv/dist_sender.go, line 1362 at r6 (raw file):

			case *roachpb.RangeNotFoundError:
				// The store we routed to doesn't have this replica. This can happen when
				// our descriptor is outright outdated, but it can also be caused a replica

"caused by"


pkg/kv/dist_sender.go, line 1371 at r6 (raw file):

				// up (which may never happen if the target range is dormant).
				if tErr.StoreID != 0 {
					cachedStoreID, found := ds.leaseHolderCache.Lookup(ctx, tErr.RangeID)

In your test failure, was the replica that the request continued to get routed to the leaseholder? I'm wondering if we should propagate the error immediately if we think it is the leaseholder but it returns a RangeNotFoundError.


pkg/kv/dist_sender.go, line 1374 at r6 (raw file):

					match := cachedStoreID == tErr.StoreID
					evicting := found && match
					log.Eventf(ctx, "evicting leaseholder s%d for r%d after RangeNotFound: %t (found=%t, match=%t)", tErr.StoreID, tErr.RangeID, evicting, found, match)

Should this be moved a line down?


pkg/kv/dist_sender.go, line 1391 at r6 (raw file):

					if replicas.FindReplica(lh.StoreID) == -1 {
						// Replace NotLeaseHolderError with RangeNotFoundError.
						br.Error = roachpb.NewError(roachpb.NewRangeNotFoundError(rangeID, curReplica.StoreID))

It's surprising that a NotLeaseHolderError will be converted to a RangeNotFoundError and propagated immediately but a RangeNotFoundError won't be propagated (see comment above). Should we change this to a SendError and remove the handling of RangeNotFoundErrors entirely from sendPartialBatch?


pkg/storage/store.go, line 144 at r5 (raw file):

		Settings:                    st,
		AmbientCtx:                  log.AmbientContext{Tracer: st.Tracer},
		Clock:                       clock,

Is this a gofmt 1.10 vs 1.11 thing?

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, 7 of 10 files at r4, 1 of 1 files at r5, 1 of 2 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/election.go, line 60 at r6 (raw file):

			// eagerly that multiple elections conflict with each other).
			start = timeutil.Now()
			buf, err := c.RunWithBuffer(ctx, c.l, c.Node(1), `env COCKROACH_CONNECT_TIMEOUT=120 ./cockroach sql --insecure -e "

Why so long? The point of this test is that the cluster should be able to recover quickly.


pkg/kv/dist_sender.go, line 1391 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's surprising that a NotLeaseHolderError will be converted to a RangeNotFoundError and propagated immediately but a RangeNotFoundError won't be propagated (see comment above). Should we change this to a SendError and remove the handling of RangeNotFoundErrors entirely from sendPartialBatch?

+1 (I think, haven't thought through this part fully)

@tbg tbg force-pushed the roachtest/election-after-restart branch 3 times, most recently from aae8c5b to ab08379 Compare October 8, 2018 11:17
@tbg tbg force-pushed the roachtest/election-after-restart branch from ab08379 to 8b083e1 Compare October 8, 2018 11:34
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/election.go, line 60 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why so long? The point of this test is that the cluster should be able to recover quickly.

All I know is that 5s is too short. I assume this is because Gossip needs a second to bootstrap (we start the processes asynchronously). I'll take this out and we can look at that separately.


pkg/kv/dist_sender.go, line 1117 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we make this a VEventf?

Sure, why not. Doubt it matters, but doesn't hurt.


pkg/kv/dist_sender.go, line 1360 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The effect of this change is that cases where all replicas return RangeNotFoundErrors will now result in a SendError once the transport is exhausted. I think that's ok because we treat them the same in sendPartialBatch, but I want to make sure that's your intention.

Yes, I think that's fine.


pkg/kv/dist_sender.go, line 1362 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"caused by"

Done.


pkg/kv/dist_sender.go, line 1371 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

In your test failure, was the replica that the request continued to get routed to the leaseholder? I'm wondering if we should propagate the error immediately if we think it is the leaseholder but it returns a RangeNotFoundError.

No, in my testing this code never fired. But if it were cached as leaseholder we'd try it first on every subsequent attempt. If the real leaseholder were, say, the second replica, we'd never evict the cache and always pay a round-trip to that first node.

The real fix here is updating the leaseholder cache on successful replies, but I want to keep this diff small.


pkg/kv/dist_sender.go, line 1374 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this be moved a line down?

This message was in the right place when I needed it (since I wanted to know what was going on) but now it seems right to move it. Done.


pkg/kv/dist_sender.go, line 1391 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

+1 (I think, haven't thought through this part fully)

Done.


pkg/storage/store.go, line 144 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this a gofmt 1.10 vs 1.11 thing?

Not sure, but CI doesn't like it, so removed.

@tbg tbg force-pushed the roachtest/election-after-restart branch from 8b083e1 to 1db9737 Compare October 8, 2018 13:09
Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/election.go, line 60 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

All I know is that 5s is too short. I assume this is because Gossip needs a second to bootstrap (we start the processes asynchronously). I'll take this out and we can look at that separately.

OK. We should probably increase the default connect timeout then. 5s is too short, since connections in the real world may need to wait for the lease timeout for the cluster to become live. (Also I see that that env var is defined as a String interpreted as integer seconds when it should probably be a Duration instead)


pkg/storage/store.go, line 144 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Not sure, but CI doesn't like it, so removed.

Yes, it is gofmt 1.11. You can configure your editor to use cockroach/bin/gofmt or crlfmt to make sure it uses the right version.

@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

Ack. Holding off on the merge since #31068 fails like clockwork on this PR in CI.

@tbg tbg requested a review from a team October 8, 2018 17:50
@tbg tbg force-pushed the roachtest/election-after-restart branch 2 times, most recently from 9123dcf to 50bad5f Compare October 9, 2018 12:48
@tbg tbg requested a review from a team as a code owner October 9, 2018 12:48
@tbg tbg force-pushed the roachtest/election-after-restart branch 2 times, most recently from c0010ee to faea6f4 Compare October 9, 2018 14:55
@tbg
Copy link
Member Author

tbg commented Oct 9, 2018

This should be green soon. I added a commit that populates the leaseholder cache on successful responses, PTAL!
I'm moderately excited about that commit because we've been wanting it for a while and it was actually easy. I'm less excited about the point in the cycle at which it's happening.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 11 files at r7, 1 of 1 files at r8, 10 of 10 files at r9, 3 of 3 files at r10, 5 of 5 files at r11.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/kv/dist_sender.go, line 1391 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Done.

The other half of this is that we no longer need a case for RangeNotFoundError in sendPartialBatch at all. Is there a blocker that's preventing us from removing that now?


pkg/kv/dist_sender.go, line 1371 at r11 (raw file):

				// When a request that we know could only succeed on the leaseholder comes
				// back as successful, make sure the leaseholder cache reflects this
				// replica. In steady state, this should almost never be the case, and so we

This wording is a bit strange. This should almost always be the case, which is why we gate it, right?


pkg/kv/dist_sender.go, line 1386 at r11 (raw file):

				// replica that has just been added but needs a snapshot to be caught up.
				//
				// We'll try other replicas which typically gives us the leaseholder, either

Glad we could remove that code.


pkg/kv/dist_sender.go, line 1396 at r11 (raw file):

					// the range descriptor), but that's OK - this kind of error should be rare
					// but when it happens, we want to populate the cache with something useful
					// ASAP without waiting for another round trip.

Can we change the leaseHolder object to avoid this?


pkg/sql/logictest/testdata/logic_test/explain_analyze_plans, line 50 at r11 (raw file):

/5         NULL     6         {5}       5

# Verify data placement.

Expand on this comment a bit. Why does the output change?

Copy link
Contributor

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/logictest/testdata/logic_test/explain_analyze_plans, line 50 at r11 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Expand on this comment a bit. Why does the output change?

It's a different table (kv vs kw).

And to answer @tschottdorf's original question in #31068 (comment), I imagine this makes the tests pass because it ensures that all the ranges/leases are in the DistSender's caches.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 11 files at r7, 8 of 10 files at r9, 1 of 3 files at r10, 5 of 5 files at r11.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/kv/dist_sender.go, line 433 at r11 (raw file):

}

type leaseHolderInfo struct {

These two things don't seem to fit together very well in one struct. I think it would make more sense to make them two separate parameters to sendToReplicas (or just recompute needsLeaseHolder on the fly from the batch when needed).

tbg added 5 commits October 10, 2018 18:23
This was hiding the output if the invocation itself failed, which is
when you wanted it most.

Release note: None
Previously, if a Batch RPC came back with a RangeNotFoundError,
we would immediately stop trying to send to more replicas, evict the
range descriptor, and start a new attempt after a back-off.

This new attempt could end up using the same replica, so if the
RangeNotFoundError persisted for some amount of time, so would the
unsuccessful retries for requests to it as DistSender doesn't
aggressively shuffle the replicas.

It turns out that there are such situations, and the
election-after-restart roachtest spuriously hit one of them:

1. new replica receives a preemptive snapshot and the ConfChange
2. cluster restarts
3. now the new replica is in this state until the range wakes
   up, which may not happen for some time.
4. the first request to the range runs into the above problem

@nvanbenschoten: I think there is an issue to be filed about the
tendency of DistSender to get stuck in unfortunate configurations.

Fixes cockroachdb#30613.

Release note (bug fix): Avoid repeatedly trying a replica that was found
to be in the process of being added.
Whenever a successful response is received from an RPC that we know has
to contact the leaseholder to succeed, update the leaseholder cache.

The immediate motivation for this is to be able to land the preceding
commits, which greatly exacerbated (as in, added a much faster failure
mode to)

```
make stress PKG=./pkg/sql/logictest TESTS=TestPlannerLogic/5node-dist/distsql_interleaved_join
```

However, the change is one we've wanted to make for a while; our caching
and in particular the eviction of leaseholders has been deficient
essentially ever since it was first introduced.

Touches cockroachdb#31068.

Release note: None
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, ready for a final look. Happy with how this turned out.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/election.go, line 60 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

OK. We should probably increase the default connect timeout then. 5s is too short, since connections in the real world may need to wait for the lease timeout for the cluster to become live. (Also I see that that env var is defined as a String interpreted as integer seconds when it should probably be a Duration instead)

I'll keep this on the radar, but I'm going to leave as-is on the off-change that these timeouts were actually the same bug, hitting very early in one of the internal queries ran by ./cockroach sql.


pkg/kv/dist_sender.go, line 1391 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The other half of this is that we no longer need a case for RangeNotFoundError in sendPartialBatch at all. Is there a blocker that's preventing us from removing that now?

Good point, done.


pkg/kv/dist_sender.go, line 433 at r11 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

These two things don't seem to fit together very well in one struct. I think it would make more sense to make them two separate parameters to sendToReplicas (or just recompute needsLeaseHolder on the fly from the batch when needed).

Done.


pkg/kv/dist_sender.go, line 1371 at r11 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This wording is a bit strange. This should almost always be the case, which is why we gate it, right?

Oops, fixed.


pkg/kv/dist_sender.go, line 1386 at r11 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Glad we could remove that code.

Done.


pkg/kv/dist_sender.go, line 1396 at r11 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we change the leaseHolder object to avoid this?

Done.


pkg/sql/logictest/testdata/logic_test/explain_analyze_plans, line 50 at r11 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

It's a different table (kv vs kw).

And to answer @tschottdorf's original question in #31068 (comment), I imagine this makes the tests pass because it ensures that all the ranges/leases are in the DistSender's caches.

Yeah, my only explanation here is that without this query, some ranges may just never have been touched and so the leaseholders aren't cached.

On the flip side, I don't know why this lease placement here is deterministic. Do these tests apply constraints or something like that? I mentioned this trick to @andreimatei and his eyebrow raised visibly, so perhaps he wants to comment.

(I'm not planning to block this PR on that, though).

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 8 of 10 files at r14, 1 of 3 files at r15, 6 of 6 files at r16.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@tbg
Copy link
Member Author

tbg commented Oct 10, 2018

TFTR!

bors r=nvanbenschoten,bdarnell

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/sql/logictest/testdata/logic_test/explain_analyze_plans, line 50 at r11 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Yeah, my only explanation here is that without this query, some ranges may just never have been touched and so the leaseholders aren't cached.

On the flip side, I don't know why this lease placement here is deterministic. Do these tests apply constraints or something like that? I mentioned this trick to @andreimatei and his eyebrow raised visibly, so perhaps he wants to comment.

(I'm not planning to block this PR on that, though).

why wouldn't the placement be deterministic? Doesn't the EXPERIMENTAL_RELOCATE do deterministic moves? And after that runs, I assume there's nothing to cause a change.

Copy link
Contributor

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/sql/logictest/testdata/logic_test/explain_analyze_plans, line 50 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

why wouldn't the placement be deterministic? Doesn't the EXPERIMENTAL_RELOCATE do deterministic moves? And after that runs, I assume there's nothing to cause a change.

Lease placement is deterministic because that's what EXPERIMENTAL_RELOCATE guarantees. It always transfers the lease to the first listed store, as specified in the original proposal https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20170314_sql_split_syntax.md#3-alter-tableindex-experimental_relocate

@tbg
Copy link
Member Author

tbg commented Oct 10, 2018

Thanks for the clarifications, @a-robinson and @andreimatei.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/sql/logictest/testdata/logic_test/explain_analyze_plans, line 50 at r11 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Lease placement is deterministic because that's what EXPERIMENTAL_RELOCATE guarantees. It always transfers the lease to the first listed store, as specified in the original proposal https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20170314_sql_split_syntax.md#3-alter-tableindex-experimental_relocate

right, so there's no mystery here, is there?
these show experimental_ranges queries could be replaced with select count(1) for the purposes of populating the caches, but I don't know if that'd be better or worse for the test.
I'm mildly surprised that show experimental_ranges populates any caches, but I don't know how it's implemented.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 10 files at r14, 1 of 3 files at r15, 6 of 6 files at r16.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

Copy link
Contributor

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/sql/logictest/testdata/logic_test/explain_analyze_plans, line 50 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

right, so there's no mystery here, is there?
these show experimental_ranges queries could be replaced with select count(1) for the purposes of populating the caches, but I don't know if that'd be better or worse for the test.
I'm mildly surprised that show experimental_ranges populates any caches, but I don't know how it's implemented.

show ranges works for populating the caches because in order to populate the lease_holder column it has to send a LeaseInfo batch request to every range.

craig bot pushed a commit that referenced this pull request Oct 10, 2018
31013: kv: try next replica on RangeNotFoundError r=nvanbenschoten,bdarnell a=tschottdorf

Previously, if a Batch RPC came back with a RangeNotFoundError, we would
immediately stop trying to send to more replicas, evict the range
descriptor, and start a new attempt after a back-off.

This new attempt could end up using the same replica, so if the
RangeNotFoundError persisted for some amount of time, so would the
unsuccessful retries for requests to it as DistSender doesn't aggressively
shuffle the replicas.

It turns out that there are such situations, and the election-after-restart
roachtest spuriously hit one of them:

1. new replica receives a preemptive snapshot and the ConfChange
2. cluster restarts
3. now the new replica is in this state until the range wakes
   up, which may not happen for some time. 4. the first request to the range
   runs into the above problem

@nvanbenschoten: I think there is an issue to be filed about the tendency
of DistSender to get stuck in unfortunate configurations.

Fixes #30613.

Release note (bug fix): Avoid repeatedly trying a replica that was found to
be in the process of being added.

31187: roachtest: add synctest r=bdarnell a=tschottdorf

This new roachtest sets up a charybdefs on a single (Ubuntu) node and runs
the `synctest` cli command against a nemesis that injects random I/O
errors.

The synctest command is new. It simulates a Raft log and can be directed at a
filesystem that is being hit with random failures.

The workload essentially writes ascending keys (flushing each one to disk
synchronously) until an I/O error occurs, at which point it re-opens the
instance to verify that all persisted writes are still there. If the
RocksDB instance was permanently corrupted, it switches to a new, pristine,
directory.
This is used in the roachtest, but is also useful for manual use in user
deployments in which we suspect there is a failure to persist data to disk.

This hasn't found anything, but it's fun to watch and also shows us a
number of errors that we know and love from sentry.

Release note: None

31215: storage: deflake TestStoreRangeMergeWatcher r=tschottdorf a=benesch

This test could deadlock if the LHS replica on store2 was shut down
before it processed the split at "b". Teach the test to wait for the LHS
replica on store2 to process the split before blocking Raft traffic to
it.

Fixes #31096.
Fixes #31149.
Fixes #31160.
Fixes #31167.

Release note: None

31217: importccl: add explicit default to mysql testdata timestamp r=dt a=dt

this makes the testdata work on mysql 8.0.2+, where the timestamp type no longer has the implicit defaults.

Release note: none.

31221: cluster: Create final cluster version for 2.1 r=bdarnell a=bdarnell

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: Nikhil Benesch <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Ben Darnell <[email protected]>
@craig
Copy link
Contributor

craig bot commented Oct 10, 2018

Build succeeded

@craig craig bot merged commit 857b9c0 into cockroachdb:master Oct 10, 2018
tbg added a commit to tbg/cockroach that referenced this pull request Oct 23, 2018
This is no longer flaky, perhaps as a result of cockroachdb#31013.

Verified via:

    $ roachprod create tobias-stress -n 10 --gce-machine-type=n1-standard-8 --local-ssd=false
    $ make roachprod-stressrace PKG=./pkg/sql/distsqlplan TESTS=TestSpanResolverUsesCaches CLUSTER=tobias-stress
    ...
    8431 runs so far, 0 failures, over 15m15s

Release note: None
craig bot pushed a commit that referenced this pull request Oct 23, 2018
31726: storage: unskip TestSpanResolverUsesCaches r=petermattis a=tschottdorf

This is no longer flaky, perhaps as a result of #31013.

Verified via:

    $ roachprod create tobias-stress -n 10 --gce-machine-type=n1-standard-8 --local-ssd=false
    $ make roachprod-stressrace PKG=./pkg/sql/distsqlplan TESTS=TestSpanResolverUsesCaches CLUSTER=tobias-stress
    ...
    8431 runs so far, 0 failures, over 15m15s

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@tbg tbg deleted the roachtest/election-after-restart branch March 13, 2019 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: election-after-restart failed
6 participants