Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Most tasks swallow exceptions #826

Open
llucax opened this issue Dec 28, 2023 · 3 comments
Open

Most tasks swallow exceptions #826

llucax opened this issue Dec 28, 2023 · 3 comments
Labels
part:core Affects the SDK core components (data structures, etc.) priority:high Address this as soon as possible type:bug Something isn't working
Milestone

Comments

@llucax
Copy link
Contributor

llucax commented Dec 28, 2023

What happened?

When there is an unhandled exception inside a task spawn by the SDK, the error is silently swallowed, leading to obscure, hard-to-debug bugs.

What did you expect instead?

Unhandled exceptions are at least logged or the whole Python process crashes.

Affected part(s)

Core components (data structures, etc.) (part:core)

Extra information

I discovered this while working at #806. I added a sanity check which failed inside a task that was sending messages to a channel, so no messages was sent, and other task waiting for messages just got stuck, leaving no clues about where the problem might be. This makes the problem really hard to debug.

This issue is actually a problem across all areas of the SDK and projects in general.

A way to cope with it in the SDK would be to extend BackgroundService to provide a create_task() method that automatically adds the task to the task list and then also adds a done callback where we can either log the unhandled exception, or just raise a SystemExit exception to exit the program. We could even give the user the option to decide how to handle unhandled exceptions by either passing a callback or letting them override the default callback as a method of the instance. This callback should also remove the task from the tasks list, something users need to do manually at the moment.

Related issues

The solution to this issue needs to have in mind the following related issues:

@llucax llucax added type:bug Something isn't working part:core Affects the SDK core components (data structures, etc.) labels Dec 28, 2023
@llucax llucax added this to the v1.0.0-rc4 milestone Dec 28, 2023
github-merge-queue bot pushed a commit that referenced this issue Jan 10, 2024
The channel registry is using `Any` as the message type, which is not
type safe, as it completely *disables* type checking for the messages.

This commit makes the channel registry type-aware, so channels are
stored with their message type and the registry checks that the same
channel is not used for different message types.

This also makes the registry just a plain container for channels, the
wrapper methods to create new senders and receivers, and to configure
the `resend_latest` flag are removed. The `ReceiverFetcher` abstraction
is also not needed if we just return the channel directly, as the
channel itself is a `ReceiverFetcher`.

Also the method to close and remove a channel is made public and the
name more explicit, as it is used in normal code paths.

The new registry only provide 2 main methods:

* `get_or_create()`: Get or create a channel for the given key, doing
the type checking to make sure the requested message type matches the
existing channel message type if it already exists.

* `close_and_remove()`: Close and remove the channel for the given key.

This change uncovered 5 issues:

* `MockMigrogrid/Resampler: Fail on unhandled exceptions` (in this PR,
but more generally #826)
* `Fix checks in tests` (in this PR)
* `Fix missing conversion to `QuantityT` (workaroud in this PR, but
should be better solved by #821)
* #807
* #823

Fixes #806.
@llucax llucax modified the milestones: v1.0.0-rc4, v1.0.0-rc5, v1.0.0-rc6 Feb 1, 2024
@llucax llucax modified the milestones: v1.0.0-rc6, v1.0.0-rc7 Mar 26, 2024
@llucax llucax added the priority:high Address this as soon as possible label Apr 9, 2024
@llucax
Copy link
Contributor Author

llucax commented Apr 29, 2024

Maybe we can add a SDK specific create_task() that catches and logs unhandled exceptions, then use that function also in BackgroundService.

@llucax
Copy link
Contributor Author

llucax commented Apr 29, 2024

@llucax
Copy link
Contributor Author

llucax commented Apr 29, 2024

It is not trivial to override the default handler, because it will store the exception in the task object, and maybe users will inspect this task object in the future and report the exception properly in the future. Because of this, we might want to use a custom exception handler that only prints a debug message about unhandled exceptions, otherwise we might spam the logs with spurious messages about tasks stopping due to unhandled exceptions when exception are really properly handled, just not inside the task function itself, adding more noise to the logs and making things even harder to debug.

According to ChatGPT:

In Python's asyncio library, when a task raises an exception that is not caught within the task itself, the default behavior is somewhat subtle and important to understand:

  1. Exception Capturing: If an exception occurs within an asyncio Task and is not caught, the exception is captured and stored in the Task object. This captured exception will not be raised until the task's result is explicitly retrieved (e.g., using await task, task.result(), or task.exception()).

  2. Logging of Unretrieved Exceptions: If a Task object is destroyed (garbage collected) while still holding an exception that was neither retrieved nor consumed, asyncio logs a warning through the Python logging module. This warning contains the exception's traceback and indicates that the exception was never retrieved. The warning is logged to the asyncio logger with a level of ERROR.

  3. Event Loop Does Not Stop: Unlike some other frameworks that might halt an event loop on unhandled exceptions, asyncio's default behavior is to continue running the event loop even if a task fails with an exception.

So if this is correct and unless we are keeping dead task object around, we should actually get a log error about swallowed exceptions. We got unreported swallowed exceptions in tests, but maybe it is a pytest issue that they are not reported (or were kept alive in tests), as pytest replaces the event loop with a testing one AFAIK.

@llucax llucax modified the milestones: v1.0.0-rc1000, v1.0.0-rc1100 Oct 21, 2024
@llucax llucax modified the milestones: v1.0.0-rc1100, v1.0.0-rc1200 Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
part:core Affects the SDK core components (data structures, etc.) priority:high Address this as soon as possible type:bug Something isn't working
Projects
Status: To do
Development

No branches or pull requests

2 participants