Write a custom flash-attention function for the deberta model. #359

wolfassi123 · 2024-09-12T10:14:56Z

Model description

I used michaelf34/infinity:0.0.55 to deploy mixed_bread_large reranker.

The container is up and I am well capable of pinging the model using python requests, but it is a bit slow (100 requests taking 8 seconds, compared to TEI with BGE that take 0.8s for 100 requests, knowing that BGE-large and Mixed_bread _large have the same size of 335M parameters.

What is the best way to optimize the deployment and inference?

Open source status

The model implementation is available on transformers
The model weights are available on huggingface-hub
I verified that the model is currently not running in the lastest version pip install infinity_emb[all] --upgrade

Provide useful links for the implementation

No response

The text was updated successfully, but these errors were encountered:

michaelfeil · 2024-09-12T16:35:56Z

BGE large uses BERT. (infinity DOES overwrite the modeling code / flash-attention replacement)
MixedBread-large uses DEBERTA. (infinity does not overwrite the modeling code / flash-attention replacement)

Deberta-V2 uses significant more flops (distentangeled attention), which also has a less optimized implementation.

michaelfeil changed the title ~~Best Way to infer deployed models~~ Write a custom flash-attention function for the deberta model. Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a custom flash-attention function for the deberta model. #359

Write a custom flash-attention function for the deberta model. #359

wolfassi123 commented Sep 12, 2024 •

edited

Loading

michaelfeil commented Sep 12, 2024

Write a custom flash-attention function for the deberta model. #359

Write a custom flash-attention function for the deberta model. #359

Comments

wolfassi123 commented Sep 12, 2024 • edited Loading

Model description

Open source status

Provide useful links for the implementation

michaelfeil commented Sep 12, 2024

wolfassi123 commented Sep 12, 2024 •

edited

Loading