all: fix various badness when resuscitating a node #8163

petermattis · 2016-07-31T23:00:43Z

Prior to this PR, zerosum would experience a significant hiccup when
restarting a node:

chaos: starting node 1
     35s      9425       26203    789.3       15        0        0        6              2 6 5 5
     36s      9430       26319    115.4       42        0        0        6              2 6 5 5
     37s      9430       26319      0.0       42        0        0        6              2 6 5 5
     38s      9430       26319      0.0       42        0        0        6              2 6 5 5
     39s      9430       26319      0.0       43        0        0       -1              0 0 0 0
     40s      9430       26319      0.0       43        0        0        6              2 6 5 5
     41s      9430       26319      0.0       43        0        0        6              2 6 5 5
     42s      9430       26319      0.0       43        0        0        6              2 6 5 5
     43s      9430       26319      0.0       43        0        0        6              2 6 5 5
     44s      9430       26319      0.0       43        0        0        6              2 6 5 5
     45s      9444       26528    208.7       43        0        0        6              2 6 5 5
     46s      9480       27312    783.9       43        0        0        6              2 6 5 5

And now:

chaos: starting node 1
     35s      9399       24615    873.5        6        0        0        6              4 5 5 4
     36s      9399       24663     47.8        6        0        0        6              4 5 5 4
     37s      9399       24663      0.0        6        0        0        6              4 5 5 4
     38s      9425       25124    460.9        7        0        0       -1              0 0 0 0
     39s      9463       25890    765.9        7        0        0        6              4 5 5 4

This change is

tamird · 2016-07-31T23:13:02Z

I don't understand what the first commit is doing.

In the third commit: s/following/fouling/

Reviewed 1 of 1 files at r1, 2 of 2 files at r2, 3 of 3 files at r3.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.

cmd/zerosum/cluster.go, line 182 [r3] (raw file):

  if err != nil {
      if log.V(1) {
          log.Infof(context.Background(), "scan failed: %s", err)

this looks unrelated, and i can't tell why it's being changed from the commit message

kv/dist_sender.go, line 205 [r2] (raw file):

                  var desc roachpb.RangeDescriptor
                  if err := value.GetProto(&desc); err != nil {
                      log.Errorf(context.Background(),

extract a local context?

also, this needs a test. ah, i see you filed an issue.

Comments from Reviewable

petermattis · 2016-07-31T23:19:21Z

The first commit is fixing a silliness Spencer introduced in RaftSender.SendAsync where it was retrieving a gRPC connection on every call instead of only when a queue is being created.

Review status: 2 of 6 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.

cmd/zerosum/cluster.go, line 182 [r3] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this looks unrelated, and i can't tell why it's being changed from the commit message

`internal/client` operations now fail-fast instead of retrying until they hit a deadline.

kv/dist_sender.go, line 205 [r2] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

extract a local context?

also, this needs a test. ah, i see you filed an issue.

Sure on the local context.

I tried valiantly to test this, but the test infrastructure in dist_sender_test.go beat me down.

Comments from Reviewable

tamird · 2016-07-31T23:23:03Z

Right, but why does that matter?

Reviewed 4 of 4 files at r4, 3 of 3 files at r5.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending.

Comments from Reviewable

petermattis · 2016-07-31T23:26:00Z

It doesn't matter for correctness, just something I noticed. Why should we retrieve the connection when the active queue already has the connection? When the connection closes the associated streams will be closed closing the queue and the next time RaftSender.SendAsync is called we'll re-open the connection.

Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending.

Comments from Reviewable

tamird · 2016-07-31T23:28:18Z

Ah, I see. Could you expand the comment?

Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending.

Comments from Reviewable

bdarnell · 2016-07-31T23:31:53Z

Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending.

Comments from Reviewable

Previously we were getting the cached grpc connection on every call to RaftSender.SendAsync even though it was only needed when a queue was being created. This was unnecessary waste. In the majority of cases it was retrieving the cached grpc connection and never bothering to use it. Note that when the grpc connection is closed (e.g. because the remote terminates) the associated streams will be closed which will in turn cause the queues to be removed.

See cockroachdb#8130

We want heartbeat RPCs to be fail-fast so that we get notified of transport failures and close the connection. Failure to do this left the gRPC client connections permanently open and trying to reconnect to down nodes, fouling up the circuit breaker expectations in Raft transport. Changed client.NewSender to create a separate gRPC connection which will not be heartbeat by the rpc.Context. We don't want the heartbeat service and we don't want these connections closed. Fixes cockroachdb#8130

petermattis · 2016-08-01T00:48:33Z

Expanded the first commit message. I also reverted the usage of fail-fast by client.sender. The acceptance tests are currently relying on fail-fast=false.

Review status: 1 of 5 files reviewed at latest revision, all discussions resolved, some commit checks pending.

Comments from Reviewable

tamird · 2016-08-01T01:03:00Z

Reviewed 1 of 1 files at r1, 5 of 5 files at r6, 2 of 2 files at r7, 2 of 2 files at r8.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.

Comments from Reviewable

I briefly stress tested this to make sure it isn't obviously flaky, and also verified that it fails without cockroachdb#8163. Fixes cockroachdb#8164.

This was referenced Jul 31, 2016

kv: test first-range eviction #8164

Closed

rpc: test transport failure detection #8165

Closed

petermattis force-pushed the pmattis/evict-first-range branch from 6f69bd7 to 1aa6ad6 Compare July 31, 2016 23:18

petermattis added 3 commits July 31, 2016 20:37

kv: evict the first range descriptor when it changes

5c39e1a

See cockroachdb#8130

petermattis force-pushed the pmattis/evict-first-range branch from 1aa6ad6 to a79b1b9 Compare August 1, 2016 00:46

petermattis merged commit 1e94f89 into cockroachdb:master Aug 1, 2016

petermattis deleted the pmattis/evict-first-range branch August 1, 2016 01:05

tbg added a commit to tbg/cockroach that referenced this pull request Aug 3, 2016

kv: test eviction on first range gossip

44b02f4

I briefly stress tested this to make sure it isn't obviously flaky, and also verified that it fails without cockroachdb#8163. Fixes cockroachdb#8164.

tbg added a commit to tbg/cockroach that referenced this pull request Aug 23, 2016

kv: test eviction on first range gossip

d2cbc79

I briefly stress tested this to make sure it isn't obviously flaky, and also verified that it fails without cockroachdb#8163. Fixes cockroachdb#8164.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all: fix various badness when resuscitating a node #8163

all: fix various badness when resuscitating a node #8163

petermattis commented Jul 31, 2016 •

edited

Loading

tamird commented Jul 31, 2016

petermattis commented Jul 31, 2016

tamird commented Jul 31, 2016

petermattis commented Jul 31, 2016

tamird commented Jul 31, 2016

bdarnell commented Jul 31, 2016

petermattis commented Aug 1, 2016

tamird commented Aug 1, 2016

all: fix various badness when resuscitating a node #8163

all: fix various badness when resuscitating a node #8163

Conversation

petermattis commented Jul 31, 2016 • edited Loading

tamird commented Jul 31, 2016

petermattis commented Jul 31, 2016

tamird commented Jul 31, 2016

petermattis commented Jul 31, 2016

tamird commented Jul 31, 2016

bdarnell commented Jul 31, 2016

petermattis commented Aug 1, 2016

tamird commented Aug 1, 2016

petermattis commented Jul 31, 2016 •

edited

Loading