Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

are FP8 models supported in Triton ?? #7678

Open
jayakommuru opened this issue Oct 4, 2024 · 7 comments
Open

are FP8 models supported in Triton ?? #7678

jayakommuru opened this issue Oct 4, 2024 · 7 comments
Labels
question Further information is requested

Comments

@jayakommuru
Copy link

We have an encoder based model, and we have currently deployed in FP16 mode in production and we want to reduce the latecny further.

Does triton support FP8 ? In the datatypes documentation here: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#datatypes I don't see FP8 in the datatypes.

We are using trtexec CLI to convert onnx to trt engine file. I see an option --fp8 to generate fp8 serialized engine files. Can anyone confirm if we can deploy FP8 models in triton?

@jayakommuru
Copy link
Author

@oandreeva-nv can you help with this ^^ ?

@oandreeva-nv
Copy link
Contributor

oandreeva-nv commented Oct 4, 2024

Hi @jayakommuru, let me verify it. I'll get back to you

@oandreeva-nv oandreeva-nv added the question Further information is requested label Oct 4, 2024
@oandreeva-nv
Copy link
Contributor

RT backend does not support FP8 I/O for the TRT engine. However, weights and internal tensors can be FP8.

@jayakommuru
Copy link
Author

@oandreeva-nv Ok, Can there be any throughput/performance benefits by running FP8 TRT engine file with FP16 I/O? which triton data type should be used with FP8 TRT engine file in TRT backend ?

@jayakommuru
Copy link
Author

@oandreeva-nv can you confirm if using FP16 I/O triton datatypes and FP8 TRT engine, does it give any benefit? Thanks

@oandreeva-nv
Copy link
Contributor

oandreeva-nv commented Oct 7, 2024

Hi @jayakommuru , we have a perf_analyzer tool, that can help you analyzing the performance of your model.

@jayakommuru
Copy link
Author

@oandreeva-nv Sure, will explore the perf-analyzer. Any idea whether to use FP32 or FP16 I/O datatype of triton for TensorRT FP8 models ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

2 participants