LIFO task slot optimization in worker is potential footgun when a task doesn't yield. #4323

tobz · 2021-12-15T20:15:22Z

Version
tokio 1.14.0

Platform
Linux derp 5.11.0-31-generic #33-Ubuntu SMP Wed Aug 11 13:19:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Description
In a small application used to benchmark the performance of some code that dealt with writing to and reading from an external data source, I encountered a particular mysterious issue related to how Tokio handles the worker scheduler optimization of storing the "next task to poll" in a slot that is not considered by the normal work-stealing algorithm.

Essentially[1], the code has a #[tokio::main] annotated async fn main(), which then spawns two tasks -- one for the reader, and one for the writer -- and runs them until both complete. The complication comes in where the writer, as written, has no need to yield: its work happens off-thread and so it is wrapped in an asynchronous interface (Sink) but never does anything to trigger a yield, or manually yields.

Where this caused an issue is that this writer task was on the same worker as the reader task, and additionally, it actively notifies the reader of progress (via AtomicWaker), which lead to a situation where the worker was holding on the reader task in its "next task to poll" slot, which I've been lead to understand is not considered when the normal work-stealing algorithm runs.

While the documentation, in many places, talks about tasks not yielding as being detrimental, and having the ability to cause other tasks to not be scheduled/polled, it was very unintuitive that having only two tasks spawned onto a multithreaded runtime with 16 worker threads still had no way to push the second task to another worker.

I'm not sure if there's even a reasonable way to avoid this, and maybe the answer is simply having something like tokio-console be able to better surface this issue, but it definitely felt like a quirk, and ultimately required an answer from a core Tokio dev: there was no existing blog post, Github issue, or other piece of information that explained this particular quirk.

The text was updated successfully, but these errors were encountered:

Summary: Our event consumer task is getting stuck in the lifo slot of other long-running tasks, causing events to queue until that long-running task yields. This change adds a separate single-thread tokio runtime to run that consumer task. This avoids it getting stuck in another task's lifo slot. See also tokio-rs/tokio#4323 Reviewed By: swgillespie Differential Revision: D35067199 fbshipit-source-id: 63f68a00b33e45089eb6f02a4df6322c5991dcc6

carllerche · 2022-08-24T16:34:16Z

Closing in favor of #4941

tobz added C-bug Category: This is a bug. A-tokio Area: The main tokio crate labels Dec 15, 2021

Darksonn added the M-runtime Module: tokio/runtime label Dec 15, 2021

carllerche closed this as completed Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LIFO task slot optimization in worker is potential footgun when a task doesn't yield. #4323

LIFO task slot optimization in worker is potential footgun when a task doesn't yield. #4323

tobz commented Dec 15, 2021

carllerche commented Aug 24, 2022

LIFO task slot optimization in worker is potential footgun when a task doesn't yield. #4323

LIFO task slot optimization in worker is potential footgun when a task doesn't yield. #4323

Comments

tobz commented Dec 15, 2021

carllerche commented Aug 24, 2022