Add a safety mechanism to `BackgroundService` to notify about crashed tasks #9

llucax · 2024-06-14T15:52:39Z

What happened?

When there is an unhandled exception inside a task spawn by a BackgroundService task, the error is silently swallowed, leading to obscure, hard-to-debug bugs.

What did you expect instead?

Unhandled exceptions are at least logged or the whole Python process crashes.

Extra information

I discovered this while working at frequenz-floss/frequenz-sdk-python#806. I added a sanity check which failed inside a task that was sending messages to a channel, so no messages was sent, and other task waiting for messages just got stuck, leaving no clues about where the problem might be. This makes the problem really hard to debug.

A way to cope with it in the BackgroundService is to extend it to provide a create_task() method that automatically adds the task to the task list and then also adds a done callback where we can either log the unhandled exception, or just raise a SystemExit exception to exit the program. We could even give the user the option to decide how to handle unhandled exceptions by either passing a callback or letting them override the default callback as a method of the instance. This callback should also remove the task from the tasks list, something users need to do manually at the moment.

Related issues

The solution to this issue needs to have in mind the following related issues:

The text was updated successfully, but these errors were encountered:

The name `BackgroundService` is a bit too verbose now in the context of an asyncio library. Services are usually already understood as running in the background, so the `Background` prefix is redundant and makes the class name too long. The `Service` class is made an abstract base class intended for end users that just want to use services, and a new `ServiceBase` class is added for developers that want to implement services. This way, the documentation can be separated between the two use cases and the `Service` class can be used as a generic type for services. We also avoid leaking implementation details to services users. The `ServiceBase` class also have a new method `create_task` that is used to create tasks in the service, and a new `TaskCreator` protocol is added to allow overriding how tasks are created (for example, using a custom `loop` or a `TaskGroup`). `create_task` automatically adds the task to the internal set, so hopefully users won't forget so easily to register tasks. The `create_task` method also have a new `log_exceptions` parameter that allows to log exceptions raised by tasks. This is useful because services tasks are supposed to run forever, and there are no many opportunities to await for their completion appropriately and handle the failure. With this at least we make sure we will get a log message when a task raises an exception, so we can at least know what happened and they just silently disappear. Fixes #8, improves #9.

llucax · 2024-08-09T12:13:08Z

I will close this, now we have PersistentTaskGroup that will log exceptions and offer users an easy way to check for failed tasks.

`asyncio.TaskGroup` is a very convenient construct when using parallelization for doing calculations for example, where the results for all the tasks need to be merged together to produce a final result. In this case if one of the tasks fails, it makes sense to cancel the others and abort as soon as possible, as any further calculations would be thrown away. This PR introduces a new `PersistentTaskGroup` class, intended to help managing a group of tasks that should persist even if other tasks in the group fail, usually by either only discarding the failed task or by restarting it somehow. The `ServiceBase` class is updated to use a `PersistentTaskGroup` underneath, and it is simplified for single-task services by making the service driven by a single `main` task, which can use the new group to monitor sub-tasks and act accordingly. This is part of #27 and #9.

llucax added scope:breaking-change Breaking change, users will need to update their code type:enhancement New feature or enhancement visitble to users part:asyncio Affects the asyncio module labels Jun 14, 2024

llucax added this to the v1.0.0 milestone Jun 14, 2024

llucax self-assigned this Jun 14, 2024

This was referenced Jun 14, 2024

Improve the BackgroundService interface to handling internal tasks #8

Closed

Most tasks swallow exceptions frequenz-floss/frequenz-sdk-python#826

Open

llucax mentioned this issue Jul 26, 2024

Revamp BackgroundService and rename to Service #27

Merged

llucax mentioned this issue Aug 8, 2024

Implement ServiceBase using a new PersistentTaskGroup #30

Merged

llucax closed this as completed Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a safety mechanism to `BackgroundService` to notify about crashed tasks #9

Add a safety mechanism to `BackgroundService` to notify about crashed tasks #9

llucax commented Jun 14, 2024

llucax commented Aug 9, 2024

Add a safety mechanism to BackgroundService to notify about crashed tasks #9

Add a safety mechanism to BackgroundService to notify about crashed tasks #9

Comments

llucax commented Jun 14, 2024

What happened?

What did you expect instead?

Extra information

Related issues

llucax commented Aug 9, 2024

Add a safety mechanism to `BackgroundService` to notify about crashed tasks #9

Add a safety mechanism to `BackgroundService` to notify about crashed tasks #9