Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all: fix various badness when resuscitating a node #8163

Merged

Conversation

petermattis
Copy link
Collaborator

@petermattis petermattis commented Jul 31, 2016

Prior to this PR, zerosum would experience a significant hiccup when
restarting a node:

chaos: starting node 1
     35s      9425       26203    789.3       15        0        0        6              2 6 5 5
     36s      9430       26319    115.4       42        0        0        6              2 6 5 5
     37s      9430       26319      0.0       42        0        0        6              2 6 5 5
     38s      9430       26319      0.0       42        0        0        6              2 6 5 5
     39s      9430       26319      0.0       43        0        0       -1              0 0 0 0
     40s      9430       26319      0.0       43        0        0        6              2 6 5 5
     41s      9430       26319      0.0       43        0        0        6              2 6 5 5
     42s      9430       26319      0.0       43        0        0        6              2 6 5 5
     43s      9430       26319      0.0       43        0        0        6              2 6 5 5
     44s      9430       26319      0.0       43        0        0        6              2 6 5 5
     45s      9444       26528    208.7       43        0        0        6              2 6 5 5
     46s      9480       27312    783.9       43        0        0        6              2 6 5 5

And now:

chaos: starting node 1
     35s      9399       24615    873.5        6        0        0        6              4 5 5 4
     36s      9399       24663     47.8        6        0        0        6              4 5 5 4
     37s      9399       24663      0.0        6        0        0        6              4 5 5 4
     38s      9425       25124    460.9        7        0        0       -1              0 0 0 0
     39s      9463       25890    765.9        7        0        0        6              4 5 5 4

This change is Reviewable

@tamird
Copy link
Contributor

tamird commented Jul 31, 2016

I don't understand what the first commit is doing.

In the third commit: s/following/fouling/


Reviewed 1 of 1 files at r1, 2 of 2 files at r2, 3 of 3 files at r3.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


cmd/zerosum/cluster.go, line 182 [r3] (raw file):

  if err != nil {
      if log.V(1) {
          log.Infof(context.Background(), "scan failed: %s", err)

this looks unrelated, and i can't tell why it's being changed from the commit message


kv/dist_sender.go, line 205 [r2] (raw file):

                  var desc roachpb.RangeDescriptor
                  if err := value.GetProto(&desc); err != nil {
                      log.Errorf(context.Background(),

extract a local context?

also, this needs a test. ah, i see you filed an issue.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/evict-first-range branch from 6f69bd7 to 1aa6ad6 Compare July 31, 2016 23:18
@petermattis
Copy link
Collaborator Author

The first commit is fixing a silliness Spencer introduced in RaftSender.SendAsync where it was retrieving a gRPC connection on every call instead of only when a queue is being created.


Review status: 2 of 6 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


cmd/zerosum/cluster.go, line 182 [r3] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this looks unrelated, and i can't tell why it's being changed from the commit message

`internal/client` operations now fail-fast instead of retrying until they hit a deadline.

kv/dist_sender.go, line 205 [r2] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

extract a local context?

also, this needs a test. ah, i see you filed an issue.

Sure on the local context.

I tried valiantly to test this, but the test infrastructure in dist_sender_test.go beat me down.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jul 31, 2016

Right, but why does that matter?


Reviewed 4 of 4 files at r4, 3 of 3 files at r5.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

It doesn't matter for correctness, just something I noticed. Why should we retrieve the connection when the active queue already has the connection? When the connection closes the associated streams will be closed closing the queue and the next time RaftSender.SendAsync is called we'll re-open the connection.


Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jul 31, 2016

Ah, I see. Could you expand the comment?


Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending.


Comments from Reviewable

@bdarnell
Copy link
Contributor

:lgtm:


Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending.


Comments from Reviewable

Previously we were getting the cached grpc connection on every call to
RaftSender.SendAsync even though it was only needed when a queue was
being created. This was unnecessary waste. In the majority of cases it
was retrieving the cached grpc connection and never bothering to use
it. Note that when the grpc connection is closed (e.g. because the
remote terminates) the associated streams will be closed which will in
turn cause the queues to be removed.
We want heartbeat RPCs to be fail-fast so that we get notified of
transport failures and close the connection. Failure to do this left the
gRPC client connections permanently open and trying to reconnect to down
nodes, fouling up the circuit breaker expectations in Raft transport.

Changed client.NewSender to create a separate gRPC connection which will
not be heartbeat by the rpc.Context. We don't want the heartbeat service
and we don't want these connections closed.

Fixes cockroachdb#8130
@petermattis petermattis force-pushed the pmattis/evict-first-range branch from 1aa6ad6 to a79b1b9 Compare August 1, 2016 00:46
@petermattis
Copy link
Collaborator Author

Expanded the first commit message. I also reverted the usage of fail-fast by client.sender. The acceptance tests are currently relying on fail-fast=false.


Review status: 1 of 5 files reviewed at latest revision, all discussions resolved, some commit checks pending.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Aug 1, 2016

:lgtm:


Reviewed 1 of 1 files at r1, 5 of 5 files at r6, 2 of 2 files at r7, 2 of 2 files at r8.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@petermattis petermattis merged commit 1e94f89 into cockroachdb:master Aug 1, 2016
@petermattis petermattis deleted the pmattis/evict-first-range branch August 1, 2016 01:05
tbg added a commit to tbg/cockroach that referenced this pull request Aug 3, 2016
I briefly stress tested this to make sure it isn't obviously flaky, and also
verified that it fails without cockroachdb#8163.

Fixes cockroachdb#8164.
tbg added a commit to tbg/cockroach that referenced this pull request Aug 23, 2016
I briefly stress tested this to make sure it isn't obviously flaky, and also
verified that it fails without cockroachdb#8163.

Fixes cockroachdb#8164.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants