Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsql: FlowScheduler.ScheduleFlow is source of significant mutex contention #50022

Closed
nvanbenschoten opened this issue Jun 9, 2020 · 1 comment · Fixed by #51055
Closed
Assignees
Labels
A-sql-execution Relating to SQL execution. C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jun 9, 2020

In an instance of TPC-E, I'm seeing that the mutex locking in FlowScheduler.ScheduleFlow here is the single largest source of mutex contention in the system, at a little over 15% of total mutex contention delay. This makes some sense, as TPC-E makes heavy use of DistSQL and this appears to be a serialization point between all DistSQL flows scheduled on a machine.

I don't know this code, so I'm hoping to bring this to the attention of those that do (@asubiotto, @yuzefovich). Are there any easy wins here? Do we need to serialize the call to Flow.Start across all flows on a node? Does this call need to be protected by the mutex at all? If not, is the mutex only protecting fs.mu.numRunning? Can we manipulate this counter using atomics to avoid blocking in the happy path where fs.canRunFlow(f) == true?

@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-sql-execution Relating to SQL execution. labels Jun 9, 2020
@asubiotto
Copy link
Contributor

Good catch, we definitely don't need to lock the mutex for the happy case. In fact, I think we should probably rework the flow scheduler to remove any queueing behavior which would remove the need for the mutex (#34430) but it just hasn't been high priority.

@asubiotto asubiotto self-assigned this Jun 16, 2020
@craig craig bot closed this as completed in e4d5f35 Aug 10, 2020
vxio pushed a commit to vxio/cockroach that referenced this issue Aug 11, 2020
The point of locking the mutex is to check whether a flow the scheduler is
about to run would push the number of running flows over the limit. This has
shown to be a source of significant mutex contention in TPC-E (see cockroachdb#50022).

This commit implements that check using atomics, which is cheaper than locking
the mutex.

Release note (performance improvement): unnecessary mutex contention observed
in heavy read workloads has been removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants