Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsql: flow scheduler back up could result in failed streams #27746

Closed
asubiotto opened this issue Jul 19, 2018 · 7 comments
Closed

distsql: flow scheduler back up could result in failed streams #27746

asubiotto opened this issue Jul 19, 2018 · 7 comments
Assignees
Labels
A-admission-control A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Milestone

Comments

@asubiotto
Copy link
Contributor

When scheduling a flow, the flowScheduler checks how many running flows there are

return fs.mu.numRunning < fs.mu.maxRunningFlows

and enqueues a flow to be run later if there are too many running . This is a potential issue because if there are two nodes with uneven load, node A with a low load and node B with a high load, both could receive a request to set up a flow, node A could try to connect to node B but node B would have this flow backed up in its queue. If it takes too long to dequeue the flow on node B, the whole flow will fail.

A solution might be to set up these connections early regardless of the load (i.e. registering a flow through RegisterFlow as soon as we receive a request to set up a flow prior to a possible enqueue)

@asubiotto asubiotto added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-execution Relating to SQL execution. labels Jul 19, 2018
@asubiotto asubiotto added this to the 2.1 milestone Jul 19, 2018
@nstewart nstewart added the S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. label Sep 18, 2018
@jordanlewis jordanlewis modified the milestones: 2.1, 2.2 Oct 11, 2018
ajwerner added a commit to ajwerner/cockroach that referenced this issue Jan 15, 2019
Long queuing can lead to errors like described in cockroachdb#27746.
This change should give us more visibility into when queuing is occurring
and how problematic it is.

Release note: None
craig bot pushed a commit that referenced this issue Jan 15, 2019
33690: server: refactor the grpc server r=andreimatei a=andreimatei

This patch creates the grpcServer object, to encapsulate the grpc.Server
object, the serveMode object and the interceptor that our server.Server
uses to filter out RPCs.
The idea is to work towards decoupling the gprc.Server from our
server.Server object. I'd like to have the grpc server be created before
Server.Start(): I'd like the cluster id and node id to be available by
the time Server.Start() is called so that we can get rid of the
nodeIDCountainer that everybody and their dog uses. But for getting
these ids a grpc server needs to be running early to handle the "init"
rpc which bootstraps a cluster.
This patch doesn't accomplish too much - it doesn't do anything about
actually starting to serve any requests without a Server (i.e. create
the needed listeners), but I think the patch stands on its own too as
good refactoring.

Release note: None

34027: distsqlrun: add metrics for queue size and wait duration r=ajwerner a=ajwerner

Long queuing can lead to errors like described in #27746.
This change should give us more visibility into when queuing is occurring
and how problematic it is.

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
craig bot pushed a commit that referenced this issue Jan 16, 2019
34027: distsqlrun: add metrics for queue size and wait duration r=ajwerner a=ajwerner

Long queuing can lead to errors like described in #27746.
This change should give us more visibility into when queuing is occurring
and how problematic it is.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
@ajwerner
Copy link
Contributor

This behavior has become a regular nuisance. Having a fixed limit in the first place is probably wrong. The queuing can be okay if the expected wait time is low (probably not more than some multiple of the expected life of the flow itself). As the queue builds up the failure mode it triggers is to lead flow setup to timeout from the perspective of the upstream node. This ultimately results in a cryptic error to the user (no inboud stream connection) after the connection setup times out (10s!). There is essentially no back pressure mechanism whatsoever at this level. This is all made worse by the fact that when this flow setup fails there’s no mechanism to go and clean up all the other associated flow state.

We need a mechanism to determine when there are too many flows and that the cluster is overloaded and gracefully report that to the user.

@ajwerner
Copy link
Contributor

In the very short term should we just question the default value of 500? Is it justified? My expectation is that the resource we're preserving here is RAM? What is a simplistic RAM model for each flow?

@petermattis
Copy link
Collaborator

We're limiting network and CPU use as well, though RAM is probably the most important (network and CPU will get slower, RAM will hit a wall and cause the node to die).

The flow limit is a form of admission control, but it is interesting that this is happening at an internal layer and not when the query first arrives. My intuition is that once we decide to process a query, we should process it.

Bumping the value of sql.distsql.max_running_flows significantly in the short term sounds reasonable. Medium term, we should figure out a better story for admission control / queuing / back pressure.

Cc @RaduBerinde and @andreimatei for their insight on the current design.
Cc @andy-kimball who has had thoughts about admission control.

@ajwerner
Copy link
Contributor

My understanding is that @andreimatei and @andy-kimball have been discussing this throughout the day. I'm looking forward to seeing what falls out of the various discussions.

@jordanlewis jordanlewis removed the S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. label May 15, 2019
@awoods187
Copy link
Contributor

@jordanlewis any update on addressing this item?

@ajwerner
Copy link
Contributor

ajwerner commented Jun 7, 2019

I intend to explore putting an admission controller here. The basic approach that’s being applied to kv should drop in nicely. There are some differences, especially that there’s not clear expectation that running flows will finish in a relatively timely manner the way that there is for kv reads but even the naive approach should be better than we have today.

@asubiotto
Copy link
Contributor Author

Superseded by #34229

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

6 participants