-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
are FP8 models supported in Triton ?? #7678
Comments
@oandreeva-nv can you help with this ^^ ? |
Hi @jayakommuru, let me verify it. I'll get back to you |
RT backend does not support FP8 I/O for the TRT engine. However, weights and internal tensors can be FP8. |
@oandreeva-nv Ok, Can there be any throughput/performance benefits by running FP8 TRT engine file with FP16 I/O? which triton data type should be used with FP8 TRT engine file in TRT backend ? |
@oandreeva-nv can you confirm if using FP16 I/O triton datatypes and FP8 TRT engine, does it give any benefit? Thanks |
Hi @jayakommuru , we have a perf_analyzer tool, that can help you analyzing the performance of your model. |
@oandreeva-nv Sure, will explore the perf-analyzer. Any idea whether to use FP32 or FP16 I/O datatype of triton for TensorRT FP8 models ? |
We have an encoder based model, and we have currently deployed in FP16 mode in production and we want to reduce the latecny further.
Does triton support FP8 ? In the datatypes documentation here: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#datatypes I don't see FP8 in the datatypes.
We are using trtexec CLI to convert onnx to trt engine file. I see an option --fp8 to generate fp8 serialized engine files. Can anyone confirm if we can deploy FP8 models in triton?
The text was updated successfully, but these errors were encountered: