You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support static EC fp8 quantization
We already support quantization for fp8 in the static cache. HPS will perform fp8 quantization on the embedding vector when reading the embedding table by enable fp8_quant configuration, and perform fp32 dequantization on the embedding vector corresponding to the queried embedding key in the static embedding cache, so as to ensure the accuracy of dense part prediction.
Large model deployment demo based on HPS TensorRT-plugin
This demo shows how to use the HPS TRT-plugin to build a complete TRT engine for deploying a 147GB embedding table based on a 1TB Criteo dataset. We also provide static embedding implementation for fully offloading embedding tables to host page-locke memory for benchmarks on x86 and Grace Hopper Superchip.
Issues Fixed
Resolve Kafka update ingestion error. There was an error that prevented handing over online parameter updates coming from Kafka message queues to Redis database backends.
Fixed HPS Triton backend re-initializing the embedding cache issue due to undefined null when getting the embedded cache on the corresponding device.