Understanding High Memory Usage When Loading Models in Triton Inference Server #7036

sikso1892 · 2024-03-26T06:04:06Z

sikso1892
Mar 26, 2024

I've been deploying models with Triton Inference Server and noticed that it seems to use more memory when loading models compared to directly loading them in a Python script.

I'm curious to know if there are specific Triton configuration options or best practices that can help manage memory usage more effectively.

Additionally, I have observed differences in memory consumption between loading models via the Triton backend and directly through a FastAPI service. Below is a comparative overview:

PID USER DEV     TYPE  GPU        GPU MEM    CPU  HOST MEM Command
1895697 user   0  Compute   0%   3282MiB  14%     0%    869MiB /opt/tritonserver/backends/python/triton_python_backend_stub models/whisper-large-v3/1/model.py triton_python_backend_shm_region_5 1048576 1048576 1895069 /opt/tritonserver/backends/python 336 whisper-large-v3_0_0 DEFAULT
1895698 user   0  Compute   0%   3282MiB  14%     0%    869MiB /opt/tritonserver/backends/python/triton_python_backend_stub models/whisper-large-v3/1/model.py triton_python_backend_shm_region_6 1048576 1048576 1895069 /opt/tritonserver/backends/python 336 whisper-large-v3_0_1 DEFAULT
1884134 root   1  Compute   0%   1778MiB   8%     5%    938MiB gunicorn: worker [gunicorn]
1884133 root   1  Compute   0%   1778MiB   8%     5%    938MiB gunicorn: worker [gunicorn]

From the above, it is evident that loading models through the Triton Backend results in higher GPU memory consumption compared to loading the same models via a FastAPI service using Gunicorn workers. This observation leads to further questions regarding the inherent differences in memory management between these two approaches.

In the Triton Backend scenario, the memory consumption per model instance is significantly higher, which raises concerns about scalability and resource optimization, especially in environments where multiple models or instances are required.

Here is the relevant section of the model.py file for reference:

model = whisperx.load_model(
    whisper_arch=whisper_arch,
    device=device,
    compute_type=compute_type,
    vad_options=vad_options,
    language=languag,
    threads=threads,
)

I'm interested in exploring mechanisms or configurations within Triton that could help reduce this memory footprint, ensuring efficient resource use without altering the model itself. Any insights or guidance on this matter from the community would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding High Memory Usage When Loading Models in Triton Inference Server #7036

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Understanding High Memory Usage When Loading Models in Triton Inference Server #7036

sikso1892 Mar 26, 2024

Replies: 0 comments

sikso1892
Mar 26, 2024