Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 72: Background workers #72

Closed
127 changes: 127 additions & 0 deletions text/072-background-workers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# RFC 72: Background workers

* RFC: 72
* Author: Jake Howard, with help from the Performance sub-team
* Created: 2021-09-17
* Last Modified: 2021-09-17

## Abstract

Wagtail currently doesn't have a first-party solution for long-running tasks. Other CMSs in the ecosystem such as WordPress and Drupal have background workers, allowing them to push tasks into the background to be processed at a later date, without requiring the end user to wait for them to occur.

One of the key goals behind this proposal is removing the requirement for the user to wait for tasks they don't need to.

This proposed implementation specifically doesn't assume anything about the user's setup. This not only reduces the chances of Wagtail conflicting with any existing task system implemented by applications, but also allows it to work with almost any hosting environment a user might be using.

## Background

Some tasks done as part of certain Wagtail requests don't need to block the user, and could instead be pushed to the background, improving the perceived responsiveness of the application. Having a first-party solution would also remove the need for downstream users to build a background worker pipeline themselves.

A prime example of this kind of improvement is re-indexing pages. Currently, when a user publishes a page, the "Publish" action also re-indexes the page, which slows down the request unnecessarily. The user doesn't need to wait for the indexes to be updated, meaning they could continue with whatever they need to do next faster. By moving tasks into the background, it also means longer tasks don't tie up the application server, meaning it should be able to handle more editor traffic.

Other CMSs such as WordPress and Drupal have background workers to accelerate these kinds of non-blocking tasks. These APIs allow both for the tools themselves to push tasks to the background, but also for users to submit tasks themselves.

## Requirements

This feature has some basic requirements for it to be considered "complete":

- Wagtail's background tasks should be opt-in, and Wagtail should function as it does now without it.
- Users should be able to choose from either running a persistent background process, or periodic execution with cron
RealOrangeOne marked this conversation as resolved.
Show resolved Hide resolved
- Users should have multiple options for task backends, depending on their scale and hosting environment. By default, Redis and Django's ORM should be supported.
RealOrangeOne marked this conversation as resolved.
Show resolved Hide resolved
- Users should be able to easily add their own tasks to be executed, whether through Wagtail hooks or entirely manually.
- Tasks should be able to specify a priority, so they can be executed sooner, regardless of when they were submitted.
- Users should need to neither know nor care about the specific implementation details. This includes both the implementation details, and which backend is being used (mostly applicable to library authors)

## Implementation

The proposed implementation is powered by [Huey](https://huey.readthedocs.io), a Python task scheduler which is both composable enough for our needs, and appears stable enough to be comfortable relying on it. Huey provides the building blocks for task storage, execution and submission.

By default, Huey has no integrations with Django, allowing it to work in isolation, reducing any potential impact on people's existing sites. Huey does ship with an [optional Django integration](https://huey.readthedocs.io/en/latest/contrib.html#django), but it makes several assumptions about how Huey is used, and would conflict with users already using Huey or with any extended customization we may need.

Whilst this proposal also covers scheduled tasks for Wagtail, enabling those should be opted in separately to task scheduling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unclear on the scheduled tasks part of this. Would we replace crons (such as publish_scheduled_pages) as Huey periodic tasks, or would they remain as crons?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My imagining was both. They'd remain as management commands exactly as they are now, but there'd be the option of triggering them as scheduled tasks through a setting in settings.py. I've just realised I've not fleshed out what that API would look like yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Is the main benefit for the user that they wouldn't need to set up separate scheduled tasks if they're already running a worker, or are there other benefits you're considering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partly that - may save on infrastructure costs. But it also adds the ability for installed packages to "register" scheduled tasks which just get run automatically, without the user having to do anything, or even know / remember they're there.


### Why Huey?

The decision to use Huey was not taken lightly. A number of different queuing and worker libraries were reviewed, including but not limited to [Celery](https://docs.celeryproject.org), [`django-db-queue`](https://github.com/dabapps/django-db-queue), [`django-lightweight-queue`](https://github.com/thread/django-lightweight-queue/), [RQ](https://python-rq.org/), and [APScheduler](https://apscheduler.readthedocs.io/).

Probably the largest package in this space, Celery, would feature-wise do everything we'd need it to and more, however has one key drawback: Complexity. This background worker needed to both be sufficiently distinct from anything else the user might be doing around background workers, and also simple enough to get up and running. Celery is neither of those things.

There are a few critical features which Huey does well which weren't all available in the other offerings:

Wagtail's background workers needed to be sufficiently separate from any other background workers used on the project. DBQ and DLQ both deeply integrate with Django, and whilst it's possible to have a dedicated queue just for wagtail, it may cause issues and concerns for consumers. Huey doesn't know or care about Django, and so even if a user were already using Huey, it could still be used within Wagtail without conflicts.

Wagtail's background worker also needed to assume as little as possible about the environment it was being run in, especially around what to use as a job store. This meant any which didn't support multiple backends were immediately discarded.

Wagtail is used by a number of large and high-profile companies, meaning using smaller libraries was less desirable from a maintenance and supply-chain security perspective.

APScheduler and Huey made the short list, but at time of writing support for out-of-process workers in APScheduler is still work-in-progress and unreleased. Huey not only ticked all the boxes in terms of features, but was also sufficiently simple to work with and integrate, and popular enough to feel confident adding it as a dependency for Wagtail.

## Proposed API

Similarly to Django's caching framework, a global "background" object can be imported, which is used to add tasks. This will be a "Huey" instance, potentially subclassed with some Wagtail sugar. This global will be pre-configured based on the application's settings such as backend, immediate mode, and connection details.

```python
from wagtail.contrib.tasks import background
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be a bit early in the process to think about naming, but maybe we should add a section for proposed names.

I think task is already taken up in the context of Workflow/Tasks.

Maybe; job or worker

But I realise Huey uses Task in its terminology so maybe unavoidable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the name here is definitely just a WIP one to fill a gap.

"worker" feels like it's the thing which performs the task, as opposed to the task itself.

"job" is definitely a common enough term that it could be fine. Agree the conflict with workflows isn't ideal. We could be more explicit and call them "background tasks" or alike?


@background.task()
def do_a_task():
pass

# And now, run the task
do_a_task()
```

Using this object, tasks are submitted using the existing Huey constructs and patterns.

For Wagtail [hooks](https://docs.wagtail.io/en/stable/reference/hooks.html), there will be an additional property passed when registering the hook, which will transparently convert the hook to a task, and ensure it's submitted as a task when the hook should be called. Only certain hooks will support background tasks. Others, such as those for registering URLs or menu items must be run synchronously, and so will ignore the background argument. This will only be applicable for certain hooks, and will do nothing when passed to these hooks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is the responsibility of the caller of the hook to decide whether it should be run asynchronously?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be cases with some hooks where the user wants it to run synchronously, which I've intentionally not ruled out. It will have to be down to the caller of the hook as to whether it's possible for it to be run async, though.


```python
from wagtail.core import hooks

@hooks.register('name_of_hook', background=True)
def my_hook_function(arg1, arg2...)
pass
```

Whilst it's possible to define tasks anywhere in an application, the convention will be to put them alongside hooks in the `wagtail_hooks.py` file, to ensure they're imported at the right times.

### Settings

```python
WAGTAIL_TASKS = {
BACKEND: str # Module path to the backend to use for tasks (default empty)
BACKEND_OPTIONS: dict # Any additional options the backend may take (eg connection parameters) (default empty)
}
```

Most of the settings are passed through to Huey as [its configuration](https://huey.readthedocs.io/en/latest/api.html) with little to no modification.

Because Huey doesn't handle "Immediate" tasks as a distinct backend, a blank `BACKEND` is synonymous with immediate mode.

## Current workarounds

For Wagtail's internals, it's currently not possible to control how these are executed. For user-controlled code, it's possible to implement a task queueing system separate from Wagtail, and manually submit tasks to it as needed.

For scheduled tasks, Wagtail currently relies on Django's management commands, which requires the user to use a tool such as cron or Heroku's Scheduler to execute them.

## Implementation plan

1. Contribute cron-style consumer to Huey
2. Contribute a Django ORM based storage backend for Huey
3. Create the basic plumbing and configuration required to get a worker running as a part of the Wagtail codebase, based off the existing Huey constructs
4. Enable creating wagtail hooks as tasks
5. Documentation
6. Begin migrating background-compatible bits of Wagtail to tasks
7. Documentation
8. Initial release?
9. Complete migrating background-compatible bits of Wagtail to tasks

## Open Questions

- How will / should this interact with the ongoing "Bulk Actions" work?
- Should the executors be a unified management command with an additional flag, or 2 distinct commands?
- Should we not use Huey and write something ourselves?
- Is this _contrib_?
- Should Huey be an optional dependency (thus requiring us to implement "immediate mode" ourselves) or a required dependency.
- The background runner should probably have a name of some sort (and be consistent around terminology eg tasks)
- Should Django ORM support be a day-1 feature? (Wider support base, but delays release)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional questions

  • Should we integrate with audit logs or reporting
  • Should there be some kind of UI built to review status/logs from these tasks
  • What other Wagtail (or Django) packages exist
  • Should a more generic problem be looked at (message queue/publish / subscribe etc), for example being able to request a webhook (or set of webhooks) when certain actions are made, probably out of scope but just a thought
  • is it worth preparing a Wagtail package that implements this RFC (as much as possible), in a similar process to https://github.com/wagtail/wagtail-generic-chooser (although, that is not an RFC), this way it can get some practical usage and feedback before going into core... or maybe an isolated package will be good enough

Some relevant links

General feedback

  • I actually thought this week about a Workflow Task that does a background job (e.g. cliche SEO check or maybe runs Google lighthouse on the draft published page), this RFC would be a great candidate for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be some kind of UI built to review status/logs from these tasks?

I suspect not. These tasks are really just an implementation detail, as opposed to anything someone may want to monitor. With that said, I could definitely see a use for a list of scheduled tasks, giving admins the ability to trigger them manually if they wish?

What other Wagtail (or Django) packages exist

See "Why Huey?" section

Should a more generic problem be looked at (message queue/publish / subscribe etc), for example being able to request a webhook (or set of webhooks) when certain actions are made, probably out of scope but just a thought

I think they can be the same thing. Doing external network requests in the request/response cycle can hurt performance in many cases, so if the webhook doesn't respond immediately it'll slow everything down. Given that, webhooks could definitely be an interesting (future) extension, and probably warrants having background workers regardless.

is it worth preparing a Wagtail package that implements this RFC

In an ideal world, yes, although some of the benefits of background workers are in moving parts of wagtail-core into the background, which can't be done with an external package. With that said, integration with things like signals and hooks could definitely be done externally.

One of the main benefits of this solution though is that it's built in. If it's an external package, it just becomes another competing standard.