distsqlrun: remove queuing behavior from flowScheduler #34430

ajwerner · 2019-01-30T22:23:51Z

Prior to this PR when SetupFlowRequests in excess of max_running_flows were
received they would be queued for processing. This queueing was unbounded and
had no notion of a "full" queue. While the queue was well intentioned and may
have led to desirable behavior in bursty, short-lived workloads, in cases of
high volumes of longer running flows the system would observe arbitrarily long
queuing delays. These delays would ultimately result propagate to the user in
the form of a timeout setting up the other side of the connection. This error
was both delayed and difficult for customers to interpret. The immediate action
is to remove the queuing behavior and increase the flow limit. It is worth
noting that this new limit is still arbitrary and unjustified except to say that
it's larger than the previous limit by a factor of 2. After this PR when the
max_running_flows limit is hit, clients will receive a more meaningful error
faster which hopefully will guide them to adjusting the setting when appropriate
or scaling out.

This commit could use a unit test to ensure that we do indeed see the newly
introduced error and I intend to write one soon. The machinery around setting
up a unit test seemed much more daunting than this change but I wanted to get
this out just to start a discussion about whether this is the right thing to do
and if not, what better approaches y'all might see.

Fixes #34229.

Release note: None

cockroach-teamcity · 2019-01-30T22:23:59Z

This change is

RaduBerinde

I agree that queueing at this layer is a bad idea, we don't want each node to queue flows for the same query independently. Ideally we'd keep track of relevant stats (eg # of flows running on each node) and queue the entire query if needed before setting up any flows.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @asubiotto)

pkg/sql/distsqlrun/flow_scheduler.go, line 126 at r1 (raw file):

			}
			log.VEventf(ctx, 1, "flow scheduler enqueuing flow %s to be run later", f.id)
			fs.metrics.FlowsQueued.Inc(1)

Maybe we want a metric of how many flows errored out this way

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @RaduBerinde)

pkg/sql/distsqlrun/flow_scheduler.go, line 126 at r1 (raw file):

Previously, RaduBerinde wrote…

Maybe we want a metric of how many flows errored out this way

Definitely a good thing to track, will add.

asubiotto · 2019-02-04T15:54:31Z

@RaduBerinde, good suggestion but I think we need to clarify our desired behavior in these scenarios. If a user sends a query and we cannot run it due to overload of some sort, should we immediately return an error or do we still want to queue the query on the gateway? Queuing on the gateway will get rid of inbound stream connection errors as well, but might still have undesirable behavior in the form of long wait times (possibly bounded though, if we do introduce a notion of a "full" query queue).

@ajwerner, I think this change is definitely a good place to start, thank you for taking the initiative. I think that apart from unit tests, we might want to add some integration tests which exercise scenarios in which a SetupFlowRequest errors out but other requests have already been sent to other nodes (so any set up that has been performed there needs to be torn down). As discussed I don't think there should be any difficulty in doing this: a context cancellation on the gateway node should propagate to the RPC handlers which would notify inbound connections. However, we might need to do some thinking on whether this is always the case (setup might be stuck for some reason, or there is a flow that's trying to connect to a non-gateway node).

ajwerner · 2019-02-06T20:31:50Z

@RaduBerinde, good suggestion but I think we need to clarify our desired behavior in these scenarios. If a user sends a query and we cannot run it due to overload of some sort, should we immediately return an error or do we still want to queue the query on the gateway? Queuing on the gateway will get rid of inbound stream connection errors as well, but might still have undesirable behavior in the form of long wait times (possibly bounded though, if we do introduce a notion of a "full" query queue).

I totally buy the idea that it'd be a better and more robust solution to add some sort of queuing mechanism or back pressure mechanism combined with some retry logic in the face of failure due to a resource constraint. My feeling is that this change shouldn't be blocked on the the more complete solution, though it certainly is worth verifying that this change doesn't break existing workloads which may have relied on the queuing behavior. I've been giving some thought to admission control, backpressure, and queueing lately and this is an obvious area for improvement, both at the top level of the query graph as well as at the leaves here. For now this is a generally inadequate stopgap that hopefully has easier to understand failure properties than the previous inadequate stopgap.

As discussed I don't think there should be any difficulty in doing this: a context cancellation on the gateway node should propagate to the RPC handlers which would notify inbound connections. However, we might need to do some thinking on whether this is always the case (setup might be stuck for some reason, or there is a flow that's trying to connect to a non-gateway node).

@asubiotto any chance I could bother you some time this week for advice on how to contrive such flows in a testing setting?

As of now I've added the metric but I haven't gotten around to the testing. I'm digging in to the flows and all that they entail, it looks like they aren't super straightforward to mock out but I may be missing something.

Prior to this PR when SetupFlowRequests in excess of `max_running_flows` were received they would be queued for processing. This queueing was unbounded and had no notion of a "full" queue. While the queue was well intentioned and may have led to desirable behavior in bursty, short-lived workloads, in cases of high volumes of longer running flows the system would observe arbitrarily long queuing delays. These delays would ultimately result propagate to the user in the form of a timeout setting up the other side of the connection. This error was both delayed and difficult for customers to interpret. The immediate action is to remove the queuing behavior and increase the flow limit. It is worth noting that this new limit is still arbitrary and unjustified except to say that it's larger than the previous limit by a factor of 2. After this PR when the `max_running_flows` limit is hit, clients will receive a more meaningful error faster which hopefully will guide them to adjusting the setting when appropriate or scaling out. This commit could use a unit test to ensure that we do indeed see the newly introduced error and I intend to write one soon. The machinery around setting up a unit test seemed much more daunting than this change but I wanted to get this out just to start a discussion about whether this is the right thing to do and if not, what better approaches y'all might see. Fixes cockroachdb#34229. Release note: None

ajwerner · 2022-12-13T00:39:34Z

This happened at some point.

ajwerner requested review from asubiotto and a team January 30, 2019 22:23

RaduBerinde reviewed Jan 31, 2019

View reviewed changes

ajwerner commented Jan 31, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/eliminate-queueing-in-flow-scheduler branch from 1ca6cc2 to cb3612b Compare February 6, 2019 20:33

tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019

asubiotto mentioned this pull request Jun 10, 2020

distsql: FlowScheduler.ScheduleFlow is source of significant mutex contention #50022

Closed

ajwerner closed this Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsqlrun: remove queuing behavior from flowScheduler #34430

distsqlrun: remove queuing behavior from flowScheduler #34430

ajwerner commented Jan 30, 2019

cockroach-teamcity commented Jan 30, 2019

RaduBerinde left a comment

ajwerner left a comment

asubiotto commented Feb 4, 2019

ajwerner commented Feb 6, 2019

ajwerner commented Dec 13, 2022

distsqlrun: remove queuing behavior from flowScheduler #34430

distsqlrun: remove queuing behavior from flowScheduler #34430

Conversation

ajwerner commented Jan 30, 2019

cockroach-teamcity commented Jan 30, 2019

RaduBerinde left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

asubiotto commented Feb 4, 2019

ajwerner commented Feb 6, 2019

ajwerner commented Dec 13, 2022