Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsql: query failures with no inbound stream connection error #31361

Closed
asubiotto opened this issue Oct 15, 2018 · 64 comments
Closed

distsql: query failures with no inbound stream connection error #31361

asubiotto opened this issue Oct 15, 2018 · 64 comments
Assignees
Labels
A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss

Comments

@asubiotto
Copy link
Contributor

tl;dr The current workaround is to SET CLUSTER SETTING sql.distsql.flow_stream_timeout=<high value> to increase the wait period. This results in swapping errors for higher latencies in affected queries.

This issue has been observed in the wild a couple of times in 16+ node clusters with high inter-node latencies or other network slowdowns.

What happens is that the gateway sends out two or more FlowSetupRequests to nodes that will be communicating with each other, but when setting up a stream, either a sending or a receiving node does not set up their flow in time (10s default) so the query errors out with no inbound stream connection. This is an issue because this 10s timeout should never be hit in practice.

This is not due to #27746, because this seems to happen even when flows do not hit the 500 limit.

#27753 would probably fix this but it would be good to get down to the cause of this issue. i.e. why is it taking so long for a node to set up a flow?

One theory is that we only have 16 workers on the gateway node that send out SetupFlowRequest RPCs. Maybe those are being flooded with too many requests and some are not being sent out in time. Network slowdowns could also exacerbate this.

So far it has been very difficult to reproduce this. I managed to do so with n1-standard-16 machines running tpcc only after 2h30min by adding network delay between the nodes:

roachprod run <cluster> "sudo tc qdisc add dev ens5 root netem delay 120ms"

Decreasing the flow timeout

SET CLUSTER SETTING sql.distsql.flow_stream_timeout='2s';

And running tpcc with a stockLevel=1 mix, to stress flow creation

./workload run tpcc --warehouses=1000 --wait=false --mix=stockLevel=1

cc @nvanbenschoten @jordanlewis @andreimatei

@asubiotto asubiotto added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-execution Relating to SQL execution. S-1 High impact: many users impacted, serious risk of high unavailability or data loss labels Oct 15, 2018
@asubiotto asubiotto self-assigned this Oct 15, 2018
@andreimatei
Copy link
Contributor

I think @nvanbenschoten claimed that he can repro this reliably on a geo-distributed cluster.

@asubiotto
Copy link
Contributor Author

Yeah, I was trying to simulate that with the network latencies. Trying to reproduce with some extra instrumentation but it might be worth it to set up what @nvanbenschoten had exactly.

@asubiotto
Copy link
Contributor Author

Nothing came up today with the instrumented binary. Will attempt to reproduce @nvanbenschoten's setup tomorrow.

One interesting thing that jumped out to me is that if there are no workers from the 16 available (with high network latencies this is very probable), the main connection goroutine will send out the SetupFlowRequest to the remote node. Since we have no notion of priority this can lead to a scenario in which every single SetupFlowRequest is sent serially, and only once every single one has been setup, do we setup the gateway flow, which obviously has inbound streams. However, with a worst case of assuming that a flow uses all other 15 nodes and all of them are ~150ms away we have a worst case of 14 x 150ms = ~2.1s (this is an upper bound, the timeout only starts after the first node that has to connect back sends an RPC back to the gateway, by which time we've probably sent another RPC to the second node). This is not nearly enough to hit the timeout, so what's going on with the remaining time?

@asubiotto
Copy link
Contributor Author

I haven't been able to set up a multi-regional cluster because I'm having issues restoring tpcc 5k (due to #31172). Hopefully will do so as soon as that's fixed.

@asubiotto
Copy link
Contributor Author

Restore has completed, running tpcc-5k on a 12-node geo-distributed cluster. No repro so far. Verified that

root@localhost:26257/defaultdb> SET CLUSTER SETTING sql.distsql.flow_stream_timeout='0s';

immediately fails the workload. @nvanbenschoten, any further guidance you can provide? I remember you said this error occurred relatively quickly

@nvanbenschoten
Copy link
Member

@asubiotto mind posting the command you're using to run the load?

@asubiotto
Copy link
Contributor Author

Running ./bin/roachtest. I removed the --tolerate-errors flag:

> /Users/asubiotto/go/bin/roachprod run alfonso-1539704375-tpccbench-nodes-12-cpu-16-multi-region:8 -- ./workload run tpcc --warehouses=5000 --active-warehouses=550  --ramp=3m0s --duration=15m0s --partitions=4 --partition-affinity=1 --split {pgurl:5-7}
> /Users/asubiotto/go/bin/roachprod run alfonso-1539704375-tpccbench-nodes-12-cpu-16-multi-region:16 -- ./workload run tpcc --warehouses=5000 --active-warehouses=550  --ramp=3m0s --duration=15m0s --partitions=4 --partition-affinity=3 --split {pgurl:13-15}
> /Users/asubiotto/go/bin/roachprod run alfonso-1539704375-tpccbench-nodes-12-cpu-16-multi-region:4 -- ./workload run tpcc --warehouses=5000 --active-warehouses=550  --ramp=3m0s --duration=15m0s --partitions=4 --partition-affinity=0 --split {pgurl:1-3}
> /Users/asubiotto/go/bin/roachprod run alfonso-1539704375-tpccbench-nodes-12-cpu-16-multi-region:12 -- ./workload run tpcc --warehouses=5000 --active-warehouses=550  --ramp=3m0s --duration=15m0s --partitions=4 --partition-affinity=2 --split {pgurl:9-11}

@nvanbenschoten
Copy link
Member

I think those regions are wrong. Take a look at roachprod list alfonso-1539704375-tpccbench-nodes-12-cpu-16-multi-region --details. The ordering is pretty strange. I would expect one of the invocations to look like:

/Users/asubiotto/go/bin/roachprod run alfonso-1539704375-tpccbench-nodes-12-cpu-16-multi-region:2 -- ./workload run tpcc --warehouses=5000 --active-warehouses=550  --ramp=3m0s --duration=15m0s --partitions=4 --partition-affinity=1 --split {pgurl:1,9,13}

@asubiotto
Copy link
Contributor Author

Thanks for noticing that! They were definitely off. I tried again today after correcting the topology but no failure. Will try again on Monday.

@asubiotto
Copy link
Contributor Author

Got a reproduction on alfonso--multi-region. Node 10 (alfonso--multi-region:10) set up a flow:

37629:I181115 16:29:48.229313 38132612 sql/distsql_running.go:236  [n10,client=10.146.0.3:52728,user=root] about to run multi node flow 45366cd9-da63-4529-a1ac-bd7eaef90add
37630:I181115 16:29:48.229332 99 sql/distsql_running.go:90  sending setup flow request for flow 45366cd9-da63-4529-a1ac-bd7eaef90add to node 6
37643:I181115 16:29:48.318352 38132612 sql/distsql_running.go:280  [n10,client=10.146.0.3:52728,user=root] setting up local portion of multinode flow 45366cd9-da63-4529-a1ac-bd7eaef90add, time since before setup 89.033889ms
38743:E181115 16:29:58.319815 50448493 sql/distsqlrun/flow_registry.go:232  [n10,client=10.146.0.3:52728,user=root] flow id:45366cd9-da63-4529-a1ac-bd7eaef90add : 1 inbound streams timed out after 10s; propagated error throughout flow

Node 6 (alfonso--multi-region:14) received the setup flow request:

I181115 16:29:48.273862 41455089 sql/distsqlrun/server.go:541  received setup flow request for flow 45366cd9-da63-4529-a1ac-bd7eaef90add

But for some reason the node 10 never received a FlowStream request. This invalidates my assumption that something was wrong on the sending side (thus the lack of logging on the receiving side). It seems that that part is fine and the issue lies on the receiving side. As stated in the original post, the number of flows never exceeds ~5 so it is not a flow scheduler backup issue.

Adding more logging in suspicious areas and will try to reproduce while I attempt to puzzle this out.

Additional information: node 10 is in asia-northeast while node 6 is in us-west. There is a round-trip latency of around 90ms.

@nvanbenschoten
Copy link
Member

Great to see this reproducing!

@asubiotto
Copy link
Contributor Author

asubiotto commented Nov 17, 2018

Another failure but this time with a different profile. Node 8 sets up a flow:

55459:I181116 19:49:29.380685 54653 sql/distsql_running.go:236  [n8,client=10.138.0.5:45134,user=root] about to run multi node flow 8984e4df-0ef0-41c9-856f-6b9d9f34faa8, total num flows: 3
55460:I181116 19:49:29.380705 73 sql/distsql_running.go:90  sending setup flow request for flow 8984e4df-0ef0-41c9-856f-6b9d9f34faa8 to node 9
55461:I181116 19:49:29.380713 65 sql/distsql_running.go:90  sending setup flow request for flow 8984e4df-0ef0-41c9-856f-6b9d9f34faa8 to node 1

Node 9 receives it around the same time:

23983:I181116 19:49:29.381124 83747431 sql/distsqlrun/server.go:542  received setup flow request for flow 8984e4df-0ef0-41c9-856f-6b9d9f34faa8

Node 1 receives it 19s later:

9093:I181116 19:49:48.032835 62981109 sql/distsqlrun/server.go:542  received setup flow request for flow 8984e4df-0ef0-41c9-856f-6b9d9f34faa8

The gateway is stuck on waiting for the RPC to be received by node 1. In this time, node 9 attempts to set up a stream, waits 10 seconds, and gives up.

Let’s look at a different failure (same run). The interesting thing here is that we’re not too far apart in time. Note that the failure of the previous load generator does not affect the load in this different region.

This is the failure profile of a flow that looks very similar to what I posted in my previous comment. Node 1 attempts to set up a flow:

6363:I181116 19:49:27.126146 41536 sql/distsql_running.go:236  [n1,client=10.154.0.3:35054,user=root] about to run multi node flow 024cb85e-3c43-4bae-955c-f6a1402a3ea8, total num flows: 2
6364:I181116 19:49:27.126175 93 sql/distsql_running.go:90  sending setup flow request for flow 024cb85e-3c43-4bae-955c-f6a1402a3ea8 to node 12
6507:I181116 19:49:27.348255 41536 sql/distsql_running.go:280  [n1,client=10.154.0.3:35054,user=root] setting up local portion of multinode flow 024cb85e-3c43-4bae-955c-f6a1402a3ea8, time since before setup 222.094696ms
6508:I181116 19:49:27.348326 41536 sql/distsqlrun/server.go:451  [n1,client=10.154.0.3:35054,user=root] setting up flow 024cb85e-3c43-4bae-955c-f6a1402a3ea8
6510:I181116 19:49:27.348383 41536 sql/distsqlrun/flow_registry.go:208  registering 1 inbound streams for flow 024cb85e-3c43-4bae-955c-f6a1402a3ea8

Node 12 gets the setup message fine and apparently sends back a FlowStream message:

50298:I181116 19:49:27.235961 74399000 sql/distsqlrun/outbox.go:420  [n12] starting outbox for flow 024cb85e-3c43-4bae-955c-f6a1402a3ea8
50299:I181116 19:49:27.235969 74399000 sql/distsqlrun/outbox.go:221  [n12] dialing node 1 for flow 024cb85e-3c43-4bae-955c-f6a1402a3ea8
50300:I181116 19:49:27.235981 74399000 sql/distsqlrun/outbox.go:223  [n12] DONE dialed node 1 for flow 024cb85e-3c43-4bae-955c-f6a1402a3ea8
50301:I181116 19:49:27.235987 74399000 sql/distsqlrun/outbox.go:236  [n12] outbox: calling FlowStream 024cb85e-3c43-4bae-955c-f6a1402a3ea8
50302:I181116 19:49:27.236029 74399000 sql/distsqlrun/outbox.go:242  [n12] outbox: FlowStream returned 024cb85e-3c43-4bae-955c-f6a1402a3ea8

But Node 1 never seems to get this message. No error is returned from FlowStream, otherwise node 12 would have logged this error. If no error occurs, FlowStream on the receiving side logs something like:

60614:I181116 20:20:05.994694 66543202 sql/distsqlrun/server.go:575  [n8] connecting inbound stream a24d5f76/2

The weird part is that nothing of the sort was logged, implying that the initial Send on that FlowStream blocked. node 1 seems to have received other FlowStream messages and successfully set up the stream in the same time frame. It also seems to have successfully set up multi-region flows from nodes in the same region as node 12, e.g.:

I181116 19:49:27.785167 81 sql/distsql_running.go:90  sending setup flow request for flow 59c90652-516e-4879-802b-78f715276240 to node 11
 7808 I181116 19:49:30.440835 62974363 sql/distsqlrun/server.go:583  [n1] connected inbound stream 59c90652/2

Albeit slowly.

Both these examples seem to point to something fishy going on at the grpc/network level. @nvanbenschoten, @andreimatei, do you have any suggestions as to how to proceed with the debugging/offer more insight into this part of the stack? Looking at GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info but tbh I’m having a hard time finding the output.

Put up my branch in https://github.com/asubiotto/cockroach/tree/timeout

@andreimatei
Copy link
Contributor

My only suggestion regarding gRPC itself is what I was telling you about the other day - to try digging into the connection window management code in grpc and see if new HTTP2 streams are being starved for connection window. Although I guess if new streams couldn't be established, then the cluster as a whole would have bigger issues...
What I would probably do is create another RPC service and have nodes blast each other with requests continuously and collect statistics on those requests (using the same RPCContext/conn as DistSQL and everything else uses), and see if the results confirm networking as a problem.

@asubiotto
Copy link
Contributor Author

asubiotto commented Dec 19, 2018

Found one "bug". It seems that the outbox doesn't retry Dialing a connection when a breaker is tripped. Retrying in distsql is something we still need to flesh out but I have some temporary code to fix this case until we can make a more general plan. Anyway, I think this is more a symptom than a cause.

I've been able to reproduce the cases above with a bit more instrumentation and it does seem like we're hitting some issue with flow control. Notably, one load generator fails with:

Error: error in payment: pq: result is ambiguous (error=rpc error: code = Canceled desc = grpc: the client connection is closing [exhausted])

Which is telling us that the connection closed with a ResourceExhausted error code, which seems to be returned by grpc when a flow control limit is hit when receiving data:
https://github.com/grpc/grpc-go/blob/1b41b79fd11cd00e51be5b0cb24cc2a518a1ed94/internal/transport/http2_client.go#L932

there seem to be a couple of limits. I've also been looking at a blocking limit called writeQuota which causes the stream to block on the sending side:
https://github.com/grpc/grpc-go/blob/1b41b79fd11cd00e51be5b0cb24cc2a518a1ed94/internal/transport/http2_client.go#L835
Unfortunately, due to human error, the build I reproduced the issue with didn't include the instrumentation I added to grpc other than request tracing, but I guess this is enough to gather that there is an issue with flow control.

So knowing there is an issue with flow control, I initially thought that we might not be closing streams well in general. As this commit states (grpc/grpc-go@365770f), calling CloseSend on a stream is not sufficient to actually close it and we seem to be running into this issue e.g.

if err := stream.CloseSend(); err != nil {

So I added more functionality to our current RPC StatsHandler to count the number of Active RPCs through stats.Begin and stats.End. It seems like the number of active RPCs stays pretty close to 0 (second over second) so this seems less likely to be the issue, although it might be more interesting to see if we can actually get some insight into the data that is clogging up grpc (and if it is doing so unexpectedly). More reading related to closing streams: grpc/grpc-go#2015.

Having RPC tracing on in this cluster did seem to show some weird stuff. Slowly but surely, we seemed to be having longer and longer RPC durations. Most notably in MultiRaft.{RaftSnapshot, RaftMessageBatch} (moreso RaftMessageBatch towards the actual occurrence of the FlowStream issue):
screen shot 2018-12-18 at 10 27 58 pm
screen shot 2018-12-18 at 10 17 28 pm
screen shot 2018-12-19 at 11 12 11 am

@nvanbenschoten, @bdarnell are the stream.Sends not supposed to be lasting this long or is this a red herring?

This is pretty much where I'm at now, there is no other new information. The next step is to see whether I can find a way to keep track of how much data is read/written from a certain stream and see if I can associate that with a grpc method name.

@tbg
Copy link
Member

tbg commented Dec 19, 2018 via email

@asubiotto
Copy link
Contributor Author

But I believe this traces each Send call on the stream itself, so we're seeing some stream.Send calls take upwards of a minute.

@tbg
Copy link
Member

tbg commented Dec 19, 2018

Oh, that would be unexpected. The items sent on that stream are batches of Raft requests. In regular operation you don't expect very large blobs there, though during import/restore I don't doubt it could happen because we submit ~22mb Raft commands.
This dashboard looks interesting. Could you write a few lines on where it is and how to enable it? BTW, I seem to remember that we disabled gRPC tracing at some point because it would hold on to data from the requests indefinitely, and that data could be snapshots and so it was effectively mem-leaking us into OOMs. Sure hope that isn't the case any more.

@asubiotto
Copy link
Contributor Author

asubiotto commented Dec 19, 2018

Yes, thanks for reminding me! It's setting this line to true (although I believe true is the default):

grpc.EnableTracing = false

The comment mentions the issue about the high memory usage: grpc/grpc-go#695

The rpc requests then show up in /debug/requests

@asubiotto
Copy link
Contributor Author

asubiotto commented Dec 19, 2018

Something that jumps out to me (although it's just speculation) is that in:

for {
resp, err := stream.Recv()
if err != nil {
return err
}
atomic.AddInt64(&stats.clientRecv, 1)
handler, ok := t.getHandler(resp.ToReplica.StoreID)
if !ok {
log.Warningf(ctx, "no handler found for store %s in response %s",
resp.ToReplica.StoreID, resp)
continue
}
if err := handler.HandleRaftResponse(ctx, resp); err != nil {
return err
}
}

We seem to be calling
if err := handler.HandleRaftResponse(ctx, resp); err != nil {

Which looks like it has the potential to take a fair amount of time. This is time not spent Recving so this could potentially result in a back up if messages are being sent quicker than they are being processed?

@tbg
Copy link
Member

tbg commented Dec 19, 2018

HandleRaftResponse is O(1) (unless things are really broken).

cockroach/pkg/storage/store.go

Lines 3580 to 3656 in 3678d85

func (s *Store) HandleRaftResponse(ctx context.Context, resp *RaftMessageResponse) error {
ctx = s.AnnotateCtx(ctx)
repl, replErr := s.GetReplica(resp.RangeID)
if replErr == nil {
// Best-effort context annotation of replica.
ctx = repl.AnnotateCtx(ctx)
}
switch val := resp.Union.GetValue().(type) {
case *roachpb.Error:
switch tErr := val.GetDetail().(type) {
case *roachpb.ReplicaTooOldError:
if replErr != nil {
// RangeNotFoundErrors are expected here; nothing else is.
if _, ok := replErr.(*roachpb.RangeNotFoundError); !ok {
log.Error(ctx, replErr)
}
return nil
}
repl.mu.Lock()
// If the replica ID in the error does not match then we know
// that the replica has been removed and re-added quickly. In
// that case, we don't want to add it to the replicaGCQueue.
if tErr.ReplicaID != repl.mu.replicaID {
repl.mu.Unlock()
log.Infof(ctx, "replica too old response with old replica ID: %s", tErr.ReplicaID)
return nil
}
// If the replica ID in the error does match, we know the replica
// will be removed and we can cancel any pending commands. This is
// sometimes necessary to unblock PushTxn operations that are
// necessary for the replica GC to succeed.
repl.cancelPendingCommandsLocked()
// The replica will be garbage collected soon (we are sure
// since our replicaID is definitely too old), but in the meantime we
// already want to bounce all traffic from it. Note that the replica
// could be re-added with a higher replicaID, in which this error is
// cleared in setReplicaIDRaftMuLockedMuLocked.
if repl.mu.destroyStatus.IsAlive() {
storeID := repl.store.StoreID()
repl.mu.destroyStatus.Set(roachpb.NewRangeNotFoundError(repl.RangeID, storeID), destroyReasonRemovalPending)
}
repl.mu.Unlock()
if _, err := s.replicaGCQueue.Add(repl, replicaGCPriorityRemoved); err != nil {
log.Errorf(ctx, "unable to add to replica GC queue: %s", err)
}
case *roachpb.RaftGroupDeletedError:
if replErr != nil {
// RangeNotFoundErrors are expected here; nothing else is.
if _, ok := replErr.(*roachpb.RangeNotFoundError); !ok {
log.Error(ctx, replErr)
}
return nil
}
// If the replica is talking to a replica that's been deleted, it must be
// out of date. While this may just mean it's slightly behind, it can
// also mean that it is so far behind it no longer knows where any of the
// other replicas are (#23994). Add it to the replica GC queue to do a
// proper check.
if _, err := s.replicaGCQueue.Add(repl, replicaGCPriorityDefault); err != nil {
log.Errorf(ctx, "unable to add to replica GC queue: %s", err)
}
case *roachpb.StoreNotFoundError:
log.Warningf(ctx, "raft error: node %d claims to not contain store %d for replica %s: %s",
resp.FromReplica.NodeID, resp.FromReplica.StoreID, resp.FromReplica, val)
return val.GetDetail() // close Raft connection
default:
log.Warningf(ctx, "got error from r%d, replica %s: %s",
resp.RangeID, resp.FromReplica, val)
}
default:
log.Warningf(ctx, "got unknown raft response type %T from replica %s: %s", val, resp.FromReplica, val)
}
return nil
}

@tbg
Copy link
Member

tbg commented Dec 19, 2018

The more interesting direction is HandleRaftRequest

if pErr := t.handleRaftRequest(ctx, req, stream); pErr != nil {
but that also makes sure to add to a queue and it will start dropping messages if that queue is full, so it shouldn't ever backpressure)

@petermattis
Copy link
Collaborator

Which is telling us that the connection closed with a ResourceExhausted error code, which seems to be returned by grpc when a flow control limit is hit when receiving data.

My reading of this is that ResourceExhausted should never occur unless there is a bug in gRPC. The sender shouldn't send more data than the receiver has room for in its flow control buffer.

@asubiotto
Copy link
Contributor Author

@petermattis that's a good point...

@tbg I added some log lines for > 10ms handle raft requests:

start := time.Now()
if pErr := t.handleRaftRequest(ctx, req, stream); pErr != nil {
	atomic.AddInt64(&stats.serverSent, 1)
	log.Infof(ctx, "handle raft request error time since before handle raft request is %s", time.Since(start))
	if err := stream.Send(newRaftMessageResponse(req, pErr)); err != nil {
		return err
	}
}
if time.Since(start) > 10*time.Millisecond {
	log.Infof(ctx, "time since before handle raft request is %s", time.Since(start))
}

And it seems like it's printing out pretty consistently:

logs/cockroach.alfonso-multi-region-0015.asubiotto.2018-12-19T14_21_39Z.024261.log:38657:I181219 14:22:48.397741 1046 storage/raft_transport.go:350  time since before handle raft request is 11.408513ms
logs/cockroach.alfonso-multi-region-0015.asubiotto.2018-12-19T14_21_39Z.024261.log:38658:I181219 14:22:48.397762 1079 storage/raft_transport.go:350  time since before handle raft request is 11.408201ms
logs/cockroach.alfonso-multi-region-0015.asubiotto.2018-12-19T14_21_39Z.024261.log:38659:I181219 14:22:48.397807 1282 storage/raft_transport.go:350  time since before handle raft request is 11.323127ms
logs/cockroach.alfonso-multi-region-0015.asubiotto.2018-12-19T14_21_39Z.024261.log:38660:I181219 14:22:48.397831 931 storage/raft_transport.go:350  time since before handle raft request is 11.34724ms
logs/cockroach.alfonso-multi-region-0015.asubiotto.2018-12-19T14_21_39Z.024261.log:38661:I181219 14:22:48.397851 830 storage/raft_transport.go:350  time since before handle raft request is 11.343122ms
logs/cockroach.alfonso-multi-region-0015.asubiotto.2018-12-19T14_21_39Z.024261.log:38662:I181219 14:22:48.397919 1255 storage/raft_transport.go:350  time since before handle raft request is 11.1765ms
logs/cockroach.alfonso-multi-region-0015.asubiotto.2018-12-19T14_21_39Z.024261.log:38663:I181219 14:22:48.397947 828 storage/raft_transport.go:350  time since before handle raft request is 10.915496ms
logs/cockroach.alfonso-multi-region-0015.asubiotto.2018-12-19T14_21_39Z.024261.log:38664:I181219 14:22:48.397972 903 storage/raft_transport.go:350  time since before handle raft request is 10.857759ms
logs/cockroach.alfonso-multi-region-0015.asubiotto.2018-12-19T14_21_39Z.024261.log:42107:I181219 14:22:55.399206 1079 storage/raft_transport.go:350  time since before handle raft request is 10.90981ms

Although the cluster is running fine for now. This definitely surprises me as that's a really long time for handleRaftRequest to take, right?

ajwerner added a commit to ajwerner/cockroach that referenced this issue Jan 15, 2019
Before this PR we would sometimes call in to the nodedialer with a canceled
context and inadvertently end up recording a failure into the circuit breaker.
This behavior led to some amount of noise when debugging cockroachdb#31361.

Release note: None
@ajwerner
Copy link
Contributor

The circuit breaker I believe is something of a red herring. We were tripping it in cases where we should not be because we were not first checking for context cancellation (see #34026).

The description of this issue cites that #27746 is not the cause of the issue but I question how that conclusion was reached. In my reproduction which follows from @asubiotto's branch based on Nathan's configuration I was able to reliably reproduce the failure but observed with vmodule logging on that we are indeed hitting the queue limit (even though the number of flows active according to the admin UI never seemed to be 500). How was it determined that we were not hitting the limit?

I've seen a number of

ajwerner-1547481579-tpccbench-nodes-12-cpu-4-multi-region-0005> I190115 17:16:45.421604 339 sql/distsqlrun/flow_scheduler.go:94  [n10] flow scheduler running flow 6f478e39-6c41-4041-91f2-52101fb9cfc7, currently
running 499

My immediate plan is to add several new metrics to give us visibility into how many flows are running, how many flows are queued, and what does the distribution of wait time for queued flows look like. Such a change should be safe for back porting and can inform clients as to when they should increase
sql.distsql.max_running_flows.

Longer term we should come up for a more graceful handling of overload in situations where flows are queued for too long before starting. Ideally we'd have some mechanism to reject flows from being queued at all which would propagate to the gateway and lead to flow teardown everywhere.
My supposition is that when the circuit breaker issues have been resolved, this issue boils down to #27746.

@tbg
Copy link
Member

tbg commented Jan 15, 2019

Thanks for looking into this. Seems promising. Just to double check, have you tried without vmodule but with targeted logging? I'm asking because vmodule can make the logging path slow enough to change global behavior. I hope this isn't relevant here, though.

@ajwerner
Copy link
Contributor

I'll reproduce with no vmodule logging with these new metrics. One concern I have is that my "reproduction" was #27746 and @asubiotto was seeing something different but I'm not sure I'd tell.

craig bot pushed a commit that referenced this issue Jan 15, 2019
34026: rpc/nodedialer: check context for cancellation before dialing r=ajwerner a=ajwerner

Before this PR we would sometimes call in to the nodedialer with a canceled
context and inadvertently end up recording a failure into the circuit breaker.
This behavior led to some amount of noise when debugging #31361.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
@asubiotto
Copy link
Contributor Author

Thanks for looking into this @ajwerner. I initially came to the conclusion that this wasn’t #27746 only due to the number of flows shown in the admin UI. However, since then, I added logging around the GRPC calls that show unexpectedly long times between a send and a receive of a SetupFlow request and mysteriously dropped FlowStream RPCs (#31361 (comment)).

If the issue was a flow scheduler backup, we would see these remote nodes receive the SetupFlow request normally but then never send a corresponding FlowStream RPC. Since I observed both that during reproductions that the SetupFlow RPCs took an egregious amount of time (longer than the timeout), and that in failure cases when the remote node did receive a SetupFlow RPC and attempted to send a FlowStream RPC, but which is never received by the gateway node, I came to the conclusion that the cause must lie with RPC/networking.

ajwerner added a commit to ajwerner/cockroach that referenced this issue Jan 23, 2019
Before this PR we would sometimes call in to the nodedialer with a canceled
context and inadvertently end up recording a failure into the circuit breaker.
This behavior led to some amount of noise when debugging cockroachdb#31361.

Release note: None
@asubiotto
Copy link
Contributor Author

asubiotto commented Jan 24, 2019

Attempted to reproduce this yesterday with @ajwerner with an updated master (with the circuitbreaker changes) and although the load generator did not observe a no inbound stream connection error, we still got a context canceled error related to transport issues that seems to be closely related (an outbox error was also observed in the logs due to a closing transport, indicating a possible failure to send a FlowStream message).

GRPC flow control seems to be normal (apart from streams showing long times to actually read data, although the amount of pending data is not large enough to warrant flow control issues).

31076 W190123 11:01:00.230141 263831061 storage/intent_resolver.go:756  [n2,s2,r31883/4:/Table/58/1/16{00/0-10/0}] failed to cleanup transaction intents: failed to resolve intents: the batch experienced mixed success and failure: result is ambiguous (context canceled)
31077 I190123 11:01:00.230221 89 sql/distsql_running.go:79  sending setup flow request for flow ff16ac21-fca0-4112-91b8-b0b5d0d42b32 to node 12 took 7.25217303s
31078 I190123 11:01:00.231552 86 sql/distsql_running.go:79  sending setup flow request for flow 717dedaf-1942-4d92-8f75-1856d0001f90 to node 12 took 7.298748902s
31079 W190123 11:01:00.260277 381 storage/raft_transport.go:599  [n2] while processing outgoing Raft queue to node 12: EOF:
31080 W190123 11:01:00.260743 263835008 storage/raft_transport.go:594  [n2] creating batch client for node 12 failed: rpc error: code = Unavailable desc = transport is closing
31081 W190123 11:01:00.260772 564 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {alfonso-multi-region-0011:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
31082 I190123 11:01:00.261311 263835111 vendor/google.golang.org/grpc/picker_wrapper.go:218  blockingPicker: the picked transport is not ready, loop back to repick
31083 I190123 11:01:00.262314 263828006 vendor/google.golang.org/grpc/picker_wrapper.go:218  blockingPicker: the picked transport is not ready, loop back to repick
31084 I190123 11:01:00.262448 563 vendor/google.golang.org/grpc/pickfirst.go:71  pickfirstBalancer: HandleSubConnStateChange: 0xc0059d5c40, TRANSIENT_FAILURE
31085 I190123 11:01:00.262461 563 vendor/google.golang.org/grpc/pickfirst.go:71  pickfirstBalancer: HandleSubConnStateChange: 0xc0059d5c40, CONNECTING
31086 I190123 11:01:00.262469 563 vendor/google.golang.org/grpc/pickfirst.go:71  pickfirstBalancer: HandleSubConnStateChange: 0xc0059d5c40, TRANSIENT_FAILURE
31087 W190123 11:01:00.262539 564 vendor/google.golang.org/grpc/clientconn.go:1440  grpc: addrConn.transportMonitor exits due to: context canceled
31088 W190123 11:01:00.262542 263835111 storage/raft_transport.go:594  [n2] creating batch client for node 12 failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing cannot reuse client connection"

Notice also the elevated SetupFlow latencies.

The outbox error occurred because of a closing transport:

I190123 11:00:59.237056 263827966 sql/distsqlrun/outbox.go:441  [n2] 29075753-0097-4ae2-b44c-fb68e3143760 err from main loop rpc error: code = Unavailable desc = transport is closing

Which happens before the above log messages. No mention of a canceled context though. Are transport issues causing the canceled context or is a canceled context causing transport issues? Reproducing now with instrumentation in context.go to figure out where the cancellation is coming from.

The theory that a transport issue was causing context cancellation was briefly explored, focusing on rpc heartbeat failures canceling goCtx:

cancel()

Although we've moved away from that as we would probably be seeing deadline exceeded errors, and I'm not sure this cancellation would propagate anywhere.

We also briefly looked at whether the TCP connection was saturated, @ajwerner suggested getting some throughput numbers and it seems that in this new run between the offending nodes, we reach 7.9MB/s for a 159MB file after 20s (note the dynamic window size slowly increases this number over time). The loss measured through ping was 0. Maybe these numbers differ at the actual time of reproduction, but I'm wondering how to pinpoint this information. However, the admin UI shows a steady send of 6MB/s from n2 to (I assume) all nodes. Here's the output of ss -ti for the current run:

ESTAB      0      71300                10.164.0.2:60134                             10.146.0.26:26257
	 cubic wscale:7,7 rto:428 rtt:226.948/0.507 ato:40 mss:1408 cwnd:528 ssthresh:521 bytes_acked:9303611892 bytes_received:3122731772 segs_out:47945028 segs_in:27899274 send 26.2Mbps lastrcv:56 pacing_rate 31.4Mbps unacked:328 retrans:0/7 reordering:6 rcv_rtt:227.765 rcv_space:113583

I wonder if anything weird jumps out to anyone here.

One interesting thing that we noticed was a drop in heartbeat latencies for a period, going back up to previous levels at exactly the same time as the reproduction:
screen shot 2019-01-23 at 11 25 33 am

Unfortunately the logs had rolled over (due to extra instrumentation) so we weren't able to observe anything strange in the logs.

This is pretty weird, we're wondering what's going on here and how it would affect what we're seeing (cc @petermattis).

In my current run, I'm observing a similar (but inverse) effect with no errors yet (RTT in the elevated stage is 370ms between concerned nodes):
screen shot 2019-01-24 at 6 00 23 am

Hopefully will get more insight in this next reproduction as to the context cancelation but posting this in case any of this information stands out to anyone.

cc @ajwerner in case you want to add anything

@asubiotto
Copy link
Contributor Author

asubiotto commented Jan 26, 2019

So it seems that the context cancellation occurs because GRPC keepalives are timing out which results in a closed connection. Adding a stack trace when a connection closes shows the following:

goroutine 90 [running]:
runtime/debug.Stack(0x3, 0x1, 0x0)
    /usr/local/go/src/runtime/debug/stack.go:24 +0xa7
runtime/debug.PrintStack()
    /usr/local/go/src/runtime/debug/stack.go:16 +0x22
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/transport.(*http2Client).Close(0xc00a398900, 0xc00a657e90, 0x2)
    /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/transport/http2_client.go:701 +0x34
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/transport.(*http2Client).keepalive(0xc00a398900)
    /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/transport/http2_client.go:1224 +0x307
created by github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/transport.newHTTP2Client
    /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/transport/http2_client.go:266 +0x11f9

The node that closes the connection doesn't seem to exhibit slow requests before this, however the receiving node seems to have been experiencing slow node liveness heartbeats (note that the node that closes the connection is the leaseholder) for 7s before that:

alfonso-multi-region-0009> W190125 01:01:45.084215 579 storage/node_liveness.go:522  [n10,hb] slow heartbeat took 4.5s
alfonso-multi-region-0009> I190125 01:01:45.084308 579 storage/node_liveness.go:523  [n10,hb]  9223372036854.777ms  9223372036854.777ms    operation:liveness heartbeat hb: node:10 sb:1
 9223372036854.777ms      0.044ms        operation:txn coordinator send hb: node:10 sb:1 txnID:7069192f-eeff-4326-bb94-f542c4b14fa7
 9223372036854.777ms      0.028ms            operation:dist sender send hb: node:10 sb:1 txn:7069192f txnID:7069192f-eeff-4326-bb94-f542c4b14fa7
 9223372036854.777ms      0.028ms            event:[n10,hb,txn=7069192f] querying next range at /System/NodeLiveness/10
 9223372036854.777ms      0.047ms            event:[n10,hb,txn=7069192f] r3: sending batch 1 CPut, 1 EndTxn to (n3,s3):35
 9223372036854.777ms      0.008ms            event:[n10,hb,txn=7069192f] sending request to alfonso-multi-region-0007:26257
 9223372036854.777ms      0.010ms                operation:/cockroach.roachpb.Internal/Batch component:gRPC error:true response_class:5xx response_code:DeadlineExceeded sb:1 span.kind:client txnID:7069192f-eeff-4326-bb94-f542c4b14fa7
 9223372036854.777ms   4500.029ms                event:error message:rpc error: code = DeadlineExceeded desc = context deadline exceeded
 9223372036854.777ms      0.018ms            event:[n10,hb,txn=7069192f] RPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
 9223372036854.777ms      0.005ms            event:[n10,hb,txn=7069192f] context done during DistSender.Send: context deadline exceeded
 9223372036854.777ms      0.005ms            event:[n10,hb,txn=7069192f] result is ambiguous (context done during DistSender.Send: context deadline exceeded)
 9223372036854.777ms      0.023ms            event:[n10,hb,txn=7069192f] reply error [txn: 7069192f], ConditionalPut [/System/NodeLiveness/10,/Min), EndTransaction(commit:true) [/System/NodeLiveness/10]: result is ambiguous (context done during DistSender.Send: context deadline exceeded)
 9223372036854.777ms      0.037ms        event:[n10,hb,txn=7069192f] failed batch: result is ambiguous (context done during DistSender.Send: context deadline exceeded)
 9223372036854.777ms      0.014ms    event:[n10,hb] rolling back transaction
 9223372036854.777ms      0.027ms    event:storage/node_liveness.go:781 [n10,hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (context done during DistSender.Send: context deadline exceeded)
 9223372036854.777ms      0.019ms    event:storage/node_liveness.go:522 [n10,hb] slow heartbeat took 4.5s

The fact that pings are timing out definitely rules out GRPC flow control, as these are not subject to flow control. What could slow them down on the sending side at this level seems to be possible an overloaded controlBuffer (the queue messages are put on before being dequeued by the loopyWriter and written to the framer) or some other lower-level transport issue. However, it seems to be more likely that the node that receives the ping is for some reason taking a long time to write out the response to the ping as evidenced by slow requests on its side and not the sending side. Any slowness here might also be due to an overfull controlBuffer or some other slow network issue. Will add instrumentation to see how many items we have on the control buffer as well as printing out some blocking, mutex, and 2s cpu profiles when this happens on the client side and enabling periodic (3s) CPU profiles with extra blocking profiles to hopefully catch anything interesting on the server side.

Here are the merged (client n3 and server n10) logs of the failure in case anyone is curious and wants to take a look (and maybe sees something I didn't):

merged.log

Something interesting I don't know what to think of is that n10 still sees a bunch of slow heartbeats after the connection closes on the client side that slowly recover back to <1s (does this recovery pattern tell us anything?)

@tbg
Copy link
Member

tbg commented Mar 20, 2019

What's the status here? We're seeing this in tpcc/nodes=3/w=max (on AWS). That test blocks the release.

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  14m53s        0           37.1           48.0   6710.9   9126.8   9663.7   9663.7 delivery
  14m53s        0          593.7          473.2   7784.6  11274.3  13421.8  19327.4 newOrder
  14m53s        0           51.1           48.2   2281.7   3489.7   3758.1   4160.7 orderStatus
  14m53s        0          122.3          480.1   5368.7  12348.0  13958.6  13958.6 payment
  14m53s        0           68.2           48.3   2684.4   7784.6   8321.5  12884.9 stockLevel
  14m54s        0           45.9           48.0   5368.7   9663.7  10200.5  10200.5 delivery
  14m54s        0          923.5          473.7   5637.1  10737.4  12884.9  20401.1 newOrder
  14m54s        0           35.9           48.2   2013.3   2415.9   2550.1   2550.1 orderStatus
  14m54s        0          553.1          480.2   4831.8  11811.2  14495.5  19327.4 payment
  14m54s        0           31.9           48.3   2818.6   6174.0   7247.8   7247.8 stockLevel
  14m55s        0           46.2           48.0   4831.8  11274.3  14495.5  14495.5 delivery
  14m55s        0          335.3          473.6   6710.9  12348.0  16106.1  21474.8 newOrder
  14m55s        0           37.1           48.2   1677.7   2550.1   4831.8   4831.8 orderStatus
  14m55s        0          214.8          479.9   6174.0  12348.0  13958.6  16106.1 payment
  14m55s        0           27.1           48.2   2281.7   5905.6   5905.6   5905.6 stockLevel
  14m56s        0           59.0           48.0   4563.4  13958.6  14495.5  14495.5 delivery
  14m56s        0          357.2          473.5   6442.5  11274.3  13958.6  15032.4 newOrder
  14m56s        0           31.0           48.1   1208.0   1543.5   1610.6   1610.6 orderStatus
  14m56s        0          324.2          479.8   7247.8  13958.6  17179.9  18253.6 payment
  14m56s        0           26.0           48.2   1409.3   5905.6   6442.5   6442.5 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  14m57s        0            6.0           48.0   6442.5  15569.3  15569.3  15569.3 delivery
  14m57s        0            1.0          472.9  12884.9  12884.9  12884.9  12884.9 newOrder
  14m57s        0            0.0           48.1      0.0      0.0      0.0      0.0 orderStatus
  14m57s        0            0.0          479.2      0.0      0.0      0.0      0.0 payment
  14m57s        0            0.0           48.2      0.0      0.0      0.0      0.0 stockLevel
  14m58s        0            0.0           47.9      0.0      0.0      0.0      0.0 delivery
  14m58s        0            0.0          472.4      0.0      0.0      0.0      0.0 newOrder
  14m58s        0            0.0           48.0      0.0      0.0      0.0      0.0 orderStatus
  14m58s        0            0.0          478.7      0.0      0.0      0.0      0.0 payment
  14m58s        0            0.0           48.1      0.0      0.0      0.0      0.0 stockLevel
  14m59s        0            0.0           47.9      0.0      0.0      0.0      0.0 delivery
  14m59s        0            0.0          471.9      0.0      0.0      0.0      0.0 newOrder
  14m59s        0            0.0           48.0      0.0      0.0      0.0      0.0 orderStatus
  14m59s        0            0.0          478.2      0.0      0.0      0.0      0.0 payment
  14m59s        0            0.0           48.1      0.0      0.0      0.0      0.0 stockLevel
   15m0s        0            0.0           47.8      0.0      0.0      0.0      0.0 delivery
   15m0s        0            0.0          471.4      0.0      0.0      0.0      0.0 newOrder
   15m0s        0            0.0           47.9      0.0      0.0      0.0      0.0 orderStatus
   15m0s        0            0.0          477.6      0.0      0.0      0.0      0.0 payment
   15m0s        0            0.0           48.0      0.0      0.0      0.0      0.0 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   15m1s        0            0.0           47.8      0.0      0.0      0.0      0.0 delivery
   15m1s        0            0.0          470.8      0.0      0.0      0.0      0.0 newOrder
   15m1s        0            0.0           47.9      0.0      0.0      0.0      0.0 orderStatus
   15m1s        0            0.0          477.1      0.0      0.0      0.0      0.0 payment
   15m1s        0            0.0           47.9      0.0      0.0      0.0      0.0 stockLevel
   15m2s        0            0.0           47.7      0.0      0.0      0.0      0.0 delivery
   15m2s        0            0.0          470.3      0.0      0.0      0.0      0.0 newOrder
   15m2s        0            0.0           47.8      0.0      0.0      0.0      0.0 orderStatus
   15m2s        0            0.0          476.6      0.0      0.0      0.0      0.0 payment
   15m2s        0            0.0           47.9      0.0      0.0      0.0      0.0 stockLevel
   15m3s        0            0.0           47.7      0.0      0.0      0.0      0.0 delivery
   15m3s        0            0.0          469.8      0.0      0.0      0.0      0.0 newOrder
   15m3s        0            0.0           47.8      0.0      0.0      0.0      0.0 orderStatus
   15m3s        0            0.0          476.0      0.0      0.0      0.0      0.0 payment
   15m3s        0            0.0           47.8      0.0      0.0      0.0      0.0 stockLevel
Error: error in delivery: ERROR: no inbound stream connection (SQLSTATE XXUUU)
Error:  exit status 1

@tbg
Copy link
Member

tbg commented Mar 20, 2019

FWIW I ran this manually yesterday and got a pass:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 7200.0s    28634.2  96.8%    768.9    604.0   1342.2   1811.9   4563.4  47244.6

@asubiotto
Copy link
Contributor Author

Nothing has been done since I last posted; this slipped in my priority queue given that there didn't seem an urgent need to fix especially with the circuit breaker fixes.

no inbound stream connection is the symptom of many possible errors so it's not necessarily the case that the tpcc max failure is the same as what this issue is tracking. We're currently investigating similar failures with a workload generator (high load, inability to connect streams in time not necessarily due to cluster overload) so the plan is to track those down as they are relatively easily reproducible and see if tpcc max is suffering from the same issue. Let's discuss further on the test failure issue.

@awoods187
Copy link
Contributor

@jordanlewis have we looked at this recently? seems like we should make sure we can address this sooner rather than later

@tbg
Copy link
Member

tbg commented Jul 1, 2019

I yet have to dig into the details of this, but it seems that the with the PR #38484, we have a roachtest that reproduces this readily. I'm starting to look into that now.

@tbg
Copy link
Member

tbg commented Jul 2, 2019

^- it turns out (details in #38484) that it's not the same. It was related to a circuit breaker preventing the *outbox from dialing the recipient. The recipient would then time out with a no inbound stream error. Filed #38602

@ajwerner
Copy link
Contributor

^- it turns out (details in #38484) that it's not the same. It was related to a circuit breaker preventing the *outbox from dialing the recipient. The recipient would then time out with a no inbound stream error. Filed #38602

I'm a little bit late to this party. I too find this to be a problem. In debugging node liveness outages I noticed that client connections from time to time just spontaneously close. I suspect that's related to grpc/grpc-go#1905 and am in the process of updating gRPC #39041. I suspect that these unexpected connection closing events exacerbated this no inbound stream connection error precisely because they were hitting the breaker after the connection entered TRANSIENT_FAILURE.

@tbg
Copy link
Member

tbg commented Jul 23, 2019

It feels like we're using breakers in too many places. We baked them into the process of dialing to make sure everybody felt the effect of the breaker, but now it turns out that that's sometimes the wrong thing to do.

@ajwerner
Copy link
Contributor

Closing this with #39041 and moving the remainder of no inbound stream errors to be tracked by #34229 and #38602.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss
Projects
None yet
Development

No branches or pull requests

8 participants