Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add workers doc #4628

Merged
merged 5 commits into from
Apr 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/source/guides/concurrency.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ To specify concurrency for a BentoML Service, use the concurrency field in traff

Key points about concurrency in BentoML:

- Concurrency represents the ideal number of requests a Service can simultaneously process. By default, BentoML does not impose a limit on concurrency to avoid bottlenecks.
- ``concurrency`` is a new field introduced in BentoML 1.2.8. It represents the ideal number of requests that a BentoML Service (namely, all :doc:`workers </guides/workers>` in the Service) can simultaneously process. By default, BentoML does not impose a limit on concurrency to avoid bottlenecks.
- If your Service supports :doc:`adaptive batching </guides/adaptive-batching>` or continuous batching, set ``concurrency`` to match the batch size. This aligns processing capacity with batch requirements, optimizing throughput.
- If a Service spawns multiple workers to leverage the parallelism of the underlying hardware accelerators (for example, multi-device GPUs), ``concurrency`` should be configured as the number of parallelism the devices can support.
- For Services designed to handle one request at a time, set ``concurrency`` to ``1``, ensuring that requests are processed sequentially without overlap.
Expand All @@ -47,7 +47,7 @@ When using the ``traffic`` field in the ``@bentoml.service`` decorator, you can
Note that they serve different purposes:

- ``concurrency``: Indicates the ideal number of simultaneous requests that a Service is designed to handle efficiently. It's a guideline for optimizing performance, particularly in terms of how batching or parallel processing is implemented. This means that the simultaneous requests being processed by a Service instance can still exceed the ``concurrency`` configured.
- ``max_concurrency``: Acts as a hard limit on the number of requests that can be processed simultaneously by a single instance of a Service. It's used to prevent a Service from being overwhelmed by too many requests at once, which could degrade performance or lead to resource exhaustion. Requests that exceed the ``max_concurrency`` limit will be rejected to maintain QoS and ensure that each request is handled within an acceptable time frame.
- ``max_concurrency``: Acts as a hard limit on the number of requests that can be processed simultaneously by a single instance of a Service. It's used to prevent a Service from being overwhelmed by too many requests at once, which could degrade performance or lead to resource exhaustion. Requests that exceed the ``max_concurrency`` limit will be rejected to maintain QoS and ensure that each request is handled within an acceptable time frame. Note that starting from BentoML 1.2.8, ``max_concurrency`` applies to the aggregate of all workers within a Service. For prior versions, it works on a per-worker basis.

Concurrency-based autoscaling
-----------------------------
Expand Down
7 changes: 7 additions & 0 deletions docs/source/guides/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,12 @@ This chapter introduces the key features of BentoML. We recommend you read :doc:

Create an OCI-compliant image for your BentoML project and deploy it anywhere.

.. grid-item-card:: :doc:`/guides/workers`
:link: /guides/workers
:link-type: doc

Understand BentoML workers and how to configure them.

.. grid-item-card:: :doc:`/guides/build-options`
:link: /guides/build-options
:link-type: doc
Expand Down Expand Up @@ -100,6 +106,7 @@ This chapter introduces the key features of BentoML. We recommend you read :doc:
iotypes
deployment
containerization
workers
build-options
model-store
distributed-services
Expand Down
83 changes: 83 additions & 0 deletions docs/source/guides/workers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
=======
Workers
=======

BentoML workers enhance the parallel processing capabilities of machine learning models. Under the hood, there are one or multiple workers within a BentoML :doc:`Service </guides/services>`. They are the processes that actually run the code logic within the Service. This design leverages the parallelism of the underlying hardware, whether it's multi-core CPUs or multi-device GPUs.

This document explains how to configure and allocate workers for different use cases.

Configure workers
-----------------

When you define a BentoML Service, use the ``workers`` parameter to set the number of workers. For example, setting ``workers=4`` launches four worker instances of the Service, each running in its process. Each worker is homogeneous, which means they perform the same tasks.

.. code-block:: python

@bentoml.service(
workers=4,
)
class MyService:
# Service implementation

The number of workers isn't necessarily equivalent to the number of concurrent requests a BentoML Service can serve in parallel. With optimizations like :doc:`adaptable batching </guides/adaptive-batching>` and continuous batching, each worker can potentially handle many requests simultaneously to enhance the throughput of your Service. To specify the ideal number of concurrent requests for a Service (namely, all workers within the Service), you can configure :doc:`concurrency </guides/concurrency>`.

Use cases
---------

Workers allow a BentoML Service to effectively utilize underlying hardware accelerators, like CPUs and GPUs, ensuring optimal performance and resource utilization.

The default worker count in BentoML is set to ``1``. However, depending on your computational workload and hardware configuration, you might need to adjust this number.

CPU workloads
^^^^^^^^^^^^^

Python processes are subject to the Global Interpreter Lock (GIL), a mechanism that prevents multiple native threads from executing Python code at once. This means in a multi-threaded Python program, even if it runs on a multi-core processor, only one thread can execute Python code at a time. This limits the performance of CPU-bound Python programs, making them unable to fully utilize the computational power of multi-core CPUs through multi-threading.

To avoid this and fully leverage multi-core CPUs, you can start multiple workers. However, be mindful of the memory implications, as each worker will load a copy of the model into memory. Ensure that your machine's memory can support the cumulative memory requirements of all workers.

You can set the number of worker processes based on the available CPU cores by setting ``workers`` to ``cpu_count``.

.. code-block:: python

@bentoml.service(workers="cpu_count")
class MyService:
# Service implementation

GPU workloads
^^^^^^^^^^^^^

In scenarios with multi-device GPUs, allocating specific GPUs to different workers allows each worker to process tasks independently. This can maximize parallel processing, increase throughput, and reduce overall inference time.

You use ``worker_index`` to represent a worker instance, which is a unique identifier for each worker process within a BentoML Service, starting from ``0``. This index is used primarily to allocate GPUs among multiple workers. One common use case is to load one model per CUDA device to ensure that each GPU is utilized efficiently and to prevent resource contention between models.

Here is an example:

.. code-block:: python

import bentoml

@bentoml.service(
resources={"gpu": 2},
workers=2
)
class MyService:

def __init__(self):
import torch

cuda = torch.device(f"cuda:{bentoml.server_context.worker_index-1}")
model = models.resnet18(pretrained=True)
model.to(cuda)

This Service dynamically determines the GPU device to use for the model by creating a ``torch.device`` object. The device ID is set by ``bentoml.server_context.worker_index - 1`` to allocate a specific GPU to each worker process. Worker 1 (``worker_index = 1``) uses GPU 0 and worker 2 (``worker_index = 2``) uses GPU 1. See the figure below for details.

.. image:: ../../_static/img/guides/workers/workers-models-gpus.png
:width: 400px
:align: center

When determining which device ID to assign to each worker for tasks such as loading models onto GPUs, this 1-indexing approach means you need to subtract 1 from the ``worker_index`` to get the 0-based device ID. This is because hardware devices like GPUs are usually indexed starting from 0. For more information, see GPU inference.

If you want to use multiple GPUs for distributed operations (multiple GPUs for the same worker), PyTorch and TensorFlow offer different methods:

- PyTorch: `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`_ and `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_
- TensorFlow: `Distributed training <https://www.tensorflow.org/guide/distributed_training>`_