distsql: flow scheduler back up could result in failed streams #27746

asubiotto · 2018-07-19T15:54:09Z

When scheduling a flow, the flowScheduler checks how many running flows there are

cockroach/pkg/sql/distsqlrun/flow_scheduler.go

Line 90 in a9410c5

return fs.mu.numRunning < fs.mu.maxRunningFlows

and enqueues a flow to be run later if there are too many running . This is a potential issue because if there are two nodes with uneven load, node A with a low load and node B with a high load, both could receive a request to set up a flow, node A could try to connect to node B but node B would have this flow backed up in its queue. If it takes too long to dequeue the flow on node B, the whole flow will fail.

A solution might be to set up these connections early regardless of the load (i.e. registering a flow through RegisterFlow as soon as we receive a request to set up a flow prior to a possible enqueue)

The text was updated successfully, but these errors were encountered:

Long queuing can lead to errors like described in cockroachdb#27746. This change should give us more visibility into when queuing is occurring and how problematic it is. Release note: None

33690: server: refactor the grpc server r=andreimatei a=andreimatei This patch creates the grpcServer object, to encapsulate the grpc.Server object, the serveMode object and the interceptor that our server.Server uses to filter out RPCs. The idea is to work towards decoupling the gprc.Server from our server.Server object. I'd like to have the grpc server be created before Server.Start(): I'd like the cluster id and node id to be available by the time Server.Start() is called so that we can get rid of the nodeIDCountainer that everybody and their dog uses. But for getting these ids a grpc server needs to be running early to handle the "init" rpc which bootstraps a cluster. This patch doesn't accomplish too much - it doesn't do anything about actually starting to serve any requests without a Server (i.e. create the needed listeners), but I think the patch stands on its own too as good refactoring. Release note: None 34027: distsqlrun: add metrics for queue size and wait duration r=ajwerner a=ajwerner Long queuing can lead to errors like described in #27746. This change should give us more visibility into when queuing is occurring and how problematic it is. Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

34027: distsqlrun: add metrics for queue size and wait duration r=ajwerner a=ajwerner Long queuing can lead to errors like described in #27746. This change should give us more visibility into when queuing is occurring and how problematic it is. Release note: None Co-authored-by: Andrew Werner <[email protected]>

ajwerner · 2019-01-17T16:06:45Z

This behavior has become a regular nuisance. Having a fixed limit in the first place is probably wrong. The queuing can be okay if the expected wait time is low (probably not more than some multiple of the expected life of the flow itself). As the queue builds up the failure mode it triggers is to lead flow setup to timeout from the perspective of the upstream node. This ultimately results in a cryptic error to the user (no inboud stream connection) after the connection setup times out (10s!). There is essentially no back pressure mechanism whatsoever at this level. This is all made worse by the fact that when this flow setup fails there’s no mechanism to go and clean up all the other associated flow state.

We need a mechanism to determine when there are too many flows and that the cluster is overloaded and gracefully report that to the user.

ajwerner · 2019-01-17T18:06:08Z

In the very short term should we just question the default value of 500? Is it justified? My expectation is that the resource we're preserving here is RAM? What is a simplistic RAM model for each flow?

petermattis · 2019-01-17T18:57:25Z

We're limiting network and CPU use as well, though RAM is probably the most important (network and CPU will get slower, RAM will hit a wall and cause the node to die).

The flow limit is a form of admission control, but it is interesting that this is happening at an internal layer and not when the query first arrives. My intuition is that once we decide to process a query, we should process it.

Bumping the value of sql.distsql.max_running_flows significantly in the short term sounds reasonable. Medium term, we should figure out a better story for admission control / queuing / back pressure.

Cc @RaduBerinde and @andreimatei for their insight on the current design.
Cc @andy-kimball who has had thoughts about admission control.

ajwerner · 2019-01-17T19:04:33Z

My understanding is that @andreimatei and @andy-kimball have been discussing this throughout the day. I'm looking forward to seeing what falls out of the various discussions.

awoods187 · 2019-06-06T20:26:09Z

@jordanlewis any update on addressing this item?

ajwerner · 2019-06-07T01:03:47Z

I intend to explore putting an admission controller here. The basic approach that’s being applied to kv should drop in nicely. There are some differences, especially that there’s not clear expectation that running flows will finish in a relatively timely manner the way that there is for kv reads but even the naive approach should be better than we have today.

asubiotto · 2019-07-29T20:40:28Z

Superseded by #34229

asubiotto added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-execution Relating to SQL execution. labels Jul 19, 2018

asubiotto added this to the 2.1 milestone Jul 19, 2018

nstewart added the S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. label Sep 18, 2018

jordanlewis assigned asubiotto Oct 11, 2018

jordanlewis modified the milestones: 2.1, 2.2 Oct 11, 2018

asubiotto mentioned this issue Oct 15, 2018

distsql: query failures with no inbound stream connection error #31361

Closed

ajwerner mentioned this issue Jan 15, 2019

distsqlrun: add metrics for queue size and wait duration #34027

Merged

ajwerner mentioned this issue Jan 24, 2019

distsqlrun: eliminate queuing in flowScheduler and tune max_running_flows #34229

Closed

awoods187 added the A-admission-control label Feb 13, 2019

jordanlewis removed the S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. label May 15, 2019

asubiotto closed this as completed Jul 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsql: flow scheduler back up could result in failed streams #27746

distsql: flow scheduler back up could result in failed streams #27746

asubiotto commented Jul 19, 2018

ajwerner commented Jan 17, 2019

ajwerner commented Jan 17, 2019

petermattis commented Jan 17, 2019

ajwerner commented Jan 17, 2019

awoods187 commented Jun 6, 2019

ajwerner commented Jun 7, 2019

asubiotto commented Jul 29, 2019

distsql: flow scheduler back up could result in failed streams #27746

distsql: flow scheduler back up could result in failed streams #27746

Comments

asubiotto commented Jul 19, 2018

ajwerner commented Jan 17, 2019

ajwerner commented Jan 17, 2019

petermattis commented Jan 17, 2019

ajwerner commented Jan 17, 2019

awoods187 commented Jun 6, 2019

ajwerner commented Jun 7, 2019

asubiotto commented Jul 29, 2019