You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been deploying models with Triton Inference Server and noticed that it seems to use more memory when loading models compared to directly loading them in a Python script.
I'm curious to know if there are specific Triton configuration options or best practices that can help manage memory usage more effectively.
Additionally, I have observed differences in memory consumption between loading models via the Triton backend and directly through a FastAPI service. Below is a comparative overview:
PID USER DEV TYPE GPU GPU MEM CPU HOST MEM Command
1895697 user 0 Compute 0% 3282MiB 14% 0% 869MiB /opt/tritonserver/backends/python/triton_python_backend_stub models/whisper-large-v3/1/model.py triton_python_backend_shm_region_5 1048576 1048576 1895069 /opt/tritonserver/backends/python 336 whisper-large-v3_0_0 DEFAULT
1895698 user 0 Compute 0% 3282MiB 14% 0% 869MiB /opt/tritonserver/backends/python/triton_python_backend_stub models/whisper-large-v3/1/model.py triton_python_backend_shm_region_6 1048576 1048576 1895069 /opt/tritonserver/backends/python 336 whisper-large-v3_0_1 DEFAULT
1884134 root 1 Compute 0% 1778MiB 8% 5% 938MiB gunicorn: worker [gunicorn]
1884133 root 1 Compute 0% 1778MiB 8% 5% 938MiB gunicorn: worker [gunicorn]
From the above, it is evident that loading models through the Triton Backend results in higher GPU memory consumption compared to loading the same models via a FastAPI service using Gunicorn workers. This observation leads to further questions regarding the inherent differences in memory management between these two approaches.
In the Triton Backend scenario, the memory consumption per model instance is significantly higher, which raises concerns about scalability and resource optimization, especially in environments where multiple models or instances are required.
Here is the relevant section of the model.py file for reference:
I'm interested in exploring mechanisms or configurations within Triton that could help reduce this memory footprint, ensuring efficient resource use without altering the model itself. Any insights or guidance on this matter from the community would be greatly appreciated.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I've been deploying models with Triton Inference Server and noticed that it seems to use more memory when loading models compared to directly loading them in a Python script.
I'm curious to know if there are specific Triton configuration options or best practices that can help manage memory usage more effectively.
Additionally, I have observed differences in memory consumption between loading models via the Triton backend and directly through a FastAPI service. Below is a comparative overview:
From the above, it is evident that loading models through the Triton Backend results in higher GPU memory consumption compared to loading the same models via a FastAPI service using Gunicorn workers. This observation leads to further questions regarding the inherent differences in memory management between these two approaches.
In the Triton Backend scenario, the memory consumption per model instance is significantly higher, which raises concerns about scalability and resource optimization, especially in environments where multiple models or instances are required.
Here is the relevant section of the model.py file for reference:
I'm interested in exploring mechanisms or configurations within Triton that could help reduce this memory footprint, ensuring efficient resource use without altering the model itself. Any insights or guidance on this matter from the community would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions