Direct Streaming of Model Weights from Cloud Storage to GPU Memory #7660

azsh1725 · 2024-09-26T11:14:46Z

Is your feature request related to a problem? Please describe.
I’m facing an issue when deploying large models in Kubernetes, especially when the pod’s ephemeral storage is limited. Triton Inference Server seems to download models to local disk (ephemeral storage) before loading them into GPU memory, which poses a problem when the available local storage is insufficient for large models. This issue becomes critical when running Triton in environments with constrained disk space, such as cloud environments where models reside in S3, but ephemeral storage on the pods is too limited for full model downloads.

Describe the solution you'd like
I would like Triton Inference Server to support direct streaming of model weights from cloud storage (e.g., S3, GCS) to GPU memory without storing the model on disk first. This feature would allow Triton to efficiently load large models in resource-constrained environments by bypassing the need for intermediate storage and directly loading the model into GPU memory. The system could stream or partially load the model in memory/GPU as required for inference, optimizing the process for large-scale deployments.

Describe alternatives you've considered
Some alternatives to address this issue include:

Mounting Persistent Volumes (PV) or Persistent Volume Claims (PVC) in Kubernetes to increase available storage, but this introduces additional overhead and complexity.
Chunk-based loading or model parallelism techniques, but these require significant changes to model architecture and inference workflows.
Mounting models in memory via tmpfs, which works for smaller models but is impractical for very large models.

Additional context
Many modern large language models, such as DeepSeek-Coder-V2-Instruct or other transformer-based architectures, can be too large to fit into a pod’s ephemeral storage. Allowing Triton to stream models from cloud storage directly into GPU memory would simplify deployment in environments like Kubernetes, where scaling and efficient resource use are critical.

azsh1725 · 2024-09-26T11:18:36Z

Given the current limitations in Triton Inference Server when dealing with constrained ephemeral storage, are there any workarounds or best practices you would recommend for efficiently loading large models from cloud storage (e.g., S3) directly to GPU memory without relying on significant local disk space? Any guidance or alternative solutions from your side would be greatly appreciated.

oandreeva-nv · 2024-09-27T17:38:41Z

Hi @azsh1725 , thanks for your proposal! I've created an internal ticket [DLIS-7365] for the team to prioritize

nnshah1 · 2024-10-03T17:26:08Z

@harryskim , @statiraju , @nicolasnoble for viz

jyono · 2024-10-09T18:39:12Z

Not really sure what streaming would look like. It sounds like you want to attach your S3 object to your pod. I suggest that you download the model weights from BLOB and store them in a kubernetes PVC. You can mount that PVC to multiple pods.

oandreeva-nv added the enhancement New feature or request label Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct Streaming of Model Weights from Cloud Storage to GPU Memory #7660

Direct Streaming of Model Weights from Cloud Storage to GPU Memory #7660

azsh1725 commented Sep 26, 2024

azsh1725 commented Sep 26, 2024

oandreeva-nv commented Sep 27, 2024

nnshah1 commented Oct 3, 2024

jyono commented Oct 9, 2024 •

edited

Loading

Direct Streaming of Model Weights from Cloud Storage to GPU Memory #7660

Direct Streaming of Model Weights from Cloud Storage to GPU Memory #7660

Comments

azsh1725 commented Sep 26, 2024

azsh1725 commented Sep 26, 2024

oandreeva-nv commented Sep 27, 2024

nnshah1 commented Oct 3, 2024

jyono commented Oct 9, 2024 • edited Loading

jyono commented Oct 9, 2024 •

edited

Loading