You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I’m facing an issue when deploying large models in Kubernetes, especially when the pod’s ephemeral storage is limited. Triton Inference Server seems to download models to local disk (ephemeral storage) before loading them into GPU memory, which poses a problem when the available local storage is insufficient for large models. This issue becomes critical when running Triton in environments with constrained disk space, such as cloud environments where models reside in S3, but ephemeral storage on the pods is too limited for full model downloads.
Describe the solution you'd like
I would like Triton Inference Server to support direct streaming of model weights from cloud storage (e.g., S3, GCS) to GPU memory without storing the model on disk first. This feature would allow Triton to efficiently load large models in resource-constrained environments by bypassing the need for intermediate storage and directly loading the model into GPU memory. The system could stream or partially load the model in memory/GPU as required for inference, optimizing the process for large-scale deployments.
Describe alternatives you've considered
Some alternatives to address this issue include:
Mounting Persistent Volumes (PV) or Persistent Volume Claims (PVC) in Kubernetes to increase available storage, but this introduces additional overhead and complexity.
Chunk-based loading or model parallelism techniques, but these require significant changes to model architecture and inference workflows.
Mounting models in memory via tmpfs, which works for smaller models but is impractical for very large models.
Additional context
Many modern large language models, such as DeepSeek-Coder-V2-Instruct or other transformer-based architectures, can be too large to fit into a pod’s ephemeral storage. Allowing Triton to stream models from cloud storage directly into GPU memory would simplify deployment in environments like Kubernetes, where scaling and efficient resource use are critical.
The text was updated successfully, but these errors were encountered:
Given the current limitations in Triton Inference Server when dealing with constrained ephemeral storage, are there any workarounds or best practices you would recommend for efficiently loading large models from cloud storage (e.g., S3) directly to GPU memory without relying on significant local disk space? Any guidance or alternative solutions from your side would be greatly appreciated.
Not really sure what streaming would look like. It sounds like you want to attach your S3 object to your pod. I suggest that you download the model weights from BLOB and store them in a kubernetes PVC. You can mount that PVC to multiple pods.
Is your feature request related to a problem? Please describe.
I’m facing an issue when deploying large models in Kubernetes, especially when the pod’s ephemeral storage is limited. Triton Inference Server seems to download models to local disk (ephemeral storage) before loading them into GPU memory, which poses a problem when the available local storage is insufficient for large models. This issue becomes critical when running Triton in environments with constrained disk space, such as cloud environments where models reside in S3, but ephemeral storage on the pods is too limited for full model downloads.
Describe the solution you'd like
I would like Triton Inference Server to support direct streaming of model weights from cloud storage (e.g., S3, GCS) to GPU memory without storing the model on disk first. This feature would allow Triton to efficiently load large models in resource-constrained environments by bypassing the need for intermediate storage and directly loading the model into GPU memory. The system could stream or partially load the model in memory/GPU as required for inference, optimizing the process for large-scale deployments.
Describe alternatives you've considered
Some alternatives to address this issue include:
Additional context
Many modern large language models, such as DeepSeek-Coder-V2-Instruct or other transformer-based architectures, can be too large to fit into a pod’s ephemeral storage. Allowing Triton to stream models from cloud storage directly into GPU memory would simplify deployment in environments like Kubernetes, where scaling and efficient resource use are critical.
The text was updated successfully, but these errors were encountered: