Most tasks swallow exceptions #826

llucax · 2023-12-28T08:59:51Z

What happened?

When there is an unhandled exception inside a task spawn by the SDK, the error is silently swallowed, leading to obscure, hard-to-debug bugs.

What did you expect instead?

Unhandled exceptions are at least logged or the whole Python process crashes.

Affected part(s)

Core components (data structures, etc.) (part:core)

Extra information

I discovered this while working at #806. I added a sanity check which failed inside a task that was sending messages to a channel, so no messages was sent, and other task waiting for messages just got stuck, leaving no clues about where the problem might be. This makes the problem really hard to debug.

This issue is actually a problem across all areas of the SDK and projects in general.

A way to cope with it in the SDK would be to extend BackgroundService to provide a create_task() method that automatically adds the task to the task list and then also adds a done callback where we can either log the unhandled exception, or just raise a SystemExit exception to exit the program. We could even give the user the option to decide how to handle unhandled exceptions by either passing a callback or letting them override the default callback as a method of the instance. This callback should also remove the task from the tasks list, something users need to do manually at the moment.

Related issues

The solution to this issue needs to have in mind the following related issues:

The text was updated successfully, but these errors were encountered:

The channel registry is using `Any` as the message type, which is not type safe, as it completely *disables* type checking for the messages. This commit makes the channel registry type-aware, so channels are stored with their message type and the registry checks that the same channel is not used for different message types. This also makes the registry just a plain container for channels, the wrapper methods to create new senders and receivers, and to configure the `resend_latest` flag are removed. The `ReceiverFetcher` abstraction is also not needed if we just return the channel directly, as the channel itself is a `ReceiverFetcher`. Also the method to close and remove a channel is made public and the name more explicit, as it is used in normal code paths. The new registry only provide 2 main methods: * `get_or_create()`: Get or create a channel for the given key, doing the type checking to make sure the requested message type matches the existing channel message type if it already exists. * `close_and_remove()`: Close and remove the channel for the given key. This change uncovered 5 issues: * `MockMigrogrid/Resampler: Fail on unhandled exceptions` (in this PR, but more generally #826) * `Fix checks in tests` (in this PR) * `Fix missing conversion to `QuantityT` (workaroud in this PR, but should be better solved by #821) * #807 * #823 Fixes #806.

llucax · 2024-04-29T09:25:55Z

Maybe we can add a SDK specific create_task() that catches and logs unhandled exceptions, then use that function also in BackgroundService.

llucax · 2024-04-29T10:04:28Z

Oh, yeah, of course: https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.set_exception_handler

llucax · 2024-04-29T10:11:36Z

It is not trivial to override the default handler, because it will store the exception in the task object, and maybe users will inspect this task object in the future and report the exception properly in the future. Because of this, we might want to use a custom exception handler that only prints a debug message about unhandled exceptions, otherwise we might spam the logs with spurious messages about tasks stopping due to unhandled exceptions when exception are really properly handled, just not inside the task function itself, adding more noise to the logs and making things even harder to debug.

According to ChatGPT:

In Python's asyncio library, when a task raises an exception that is not caught within the task itself, the default behavior is somewhat subtle and important to understand:

Exception Capturing: If an exception occurs within an asyncio Task and is not caught, the exception is captured and stored in the Task object. This captured exception will not be raised until the task's result is explicitly retrieved (e.g., using await task, task.result(), or task.exception()).

Logging of Unretrieved Exceptions: If a Task object is destroyed (garbage collected) while still holding an exception that was neither retrieved nor consumed, asyncio logs a warning through the Python logging module. This warning contains the exception's traceback and indicates that the exception was never retrieved. The warning is logged to the asyncio logger with a level of ERROR.

Event Loop Does Not Stop: Unlike some other frameworks that might halt an event loop on unhandled exceptions, asyncio's default behavior is to continue running the event loop even if a task fails with an exception.

So if this is correct and unless we are keeping dead task object around, we should actually get a log error about swallowed exceptions. We got unreported swallowed exceptions in tests, but maybe it is a pytest issue that they are not reported (or were kept alive in tests), as pytest replaces the event loop with a testing one AFAIK.

llucax added type:bug Something isn't working part:core Affects the SDK core components (data structures, etc.) labels Dec 28, 2023

llucax added this to the v1.0.0-rc4 milestone Dec 28, 2023

github-project-automation bot added this to Python SDK Roadmap Dec 28, 2023

github-project-automation bot moved this to To do in Python SDK Roadmap Dec 28, 2023

llucax mentioned this issue Dec 28, 2023

Make the registry type-aware #830

Merged

llucax mentioned this issue Jan 16, 2024

Improve error handling when background services are not initialized #840

Closed

llucax modified the milestones: v1.0.0-rc4, v1.0.0-rc5, v1.0.0-rc6 Feb 1, 2024

llucax modified the milestones: v1.0.0-rc6, v1.0.0-rc7 Mar 26, 2024

llucax added the priority:high Address this as soon as possible label Apr 9, 2024

This was referenced Apr 9, 2024

Replace run_forever implementations with a generic one in _internal #906

Open

Make actor automatic restarting more resilient #632

Open

Improve the Actor interface to handling internal tasks #819

Closed

llucax mentioned this issue Jun 14, 2024

Improve the BackgroundService interface to handling internal tasks frequenz-floss/frequenz-core-python#8

Closed

llucax modified the milestones: v1.0.0-rc800, v1.0.0 Jul 29, 2024

shsms modified the milestones: v1.0.0-rc800, v1.0.0-rc900 Aug 22, 2024

llucax modified the milestones: v1.0.0-rc900, 1.0.0-rc1000 Sep 2, 2024

llucax modified the milestones: v1.0.0-rc1000, v1.0.0-rc1100 Oct 21, 2024

llucax modified the milestones: v1.0.0-rc1100, v1.0.0-rc1200 Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Most tasks swallow exceptions #826

Most tasks swallow exceptions #826

llucax commented Dec 28, 2023 •

edited

Loading

llucax commented Apr 29, 2024

llucax commented Apr 29, 2024

llucax commented Apr 29, 2024 •

edited

Loading

Most tasks swallow exceptions #826

Most tasks swallow exceptions #826

Comments

llucax commented Dec 28, 2023 • edited Loading

What happened?

What did you expect instead?

Affected part(s)

Extra information

Related issues

llucax commented Apr 29, 2024

llucax commented Apr 29, 2024

llucax commented Apr 29, 2024 • edited Loading

llucax commented Dec 28, 2023 •

edited

Loading

llucax commented Apr 29, 2024 •

edited

Loading