You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used michaelf34/infinity:0.0.55 to deploy mixed_bread_large reranker.
The container is up and I am well capable of pinging the model using python requests, but it is a bit slow (100 requests taking 8 seconds, compared to TEI with BGE that take 0.8s for 100 requests, knowing that BGE-large and Mixed_bread _large have the same size of 335M parameters.
What is the best way to optimize the deployment and inference?
Open source status
The model implementation is available on transformers
The model weights are available on huggingface-hub
I verified that the model is currently not running in the lastest version pip install infinity_emb[all] --upgrade
Provide useful links for the implementation
No response
The text was updated successfully, but these errors were encountered:
BGE large uses BERT. (infinity DOES overwrite the modeling code / flash-attention replacement)
MixedBread-large uses DEBERTA. (infinity does not overwrite the modeling code / flash-attention replacement)
Deberta-V2 uses significant more flops (distentangeled attention), which also has a less optimized implementation.
michaelfeil
changed the title
Best Way to infer deployed models
Write a custom flash-attention function for the deberta model.
Sep 12, 2024
Model description
I used
michaelf34/infinity:0.0.55
to deploy mixed_bread_large reranker.The container is up and I am well capable of pinging the model using python requests, but it is a bit slow (100 requests taking 8 seconds, compared to TEI with BGE that take 0.8s for 100 requests, knowing that BGE-large and Mixed_bread _large have the same size of 335M parameters.
What is the best way to optimize the deployment and inference?
Open source status
pip install infinity_emb[all] --upgrade
Provide useful links for the implementation
No response
The text was updated successfully, but these errors were encountered: