Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a custom flash-attention function for the deberta model. #359

Open
3 tasks done
wolfassi123 opened this issue Sep 12, 2024 · 1 comment
Open
3 tasks done

Write a custom flash-attention function for the deberta model. #359

wolfassi123 opened this issue Sep 12, 2024 · 1 comment

Comments

@wolfassi123
Copy link

wolfassi123 commented Sep 12, 2024

Model description

I used michaelf34/infinity:0.0.55 to deploy mixed_bread_large reranker.

The container is up and I am well capable of pinging the model using python requests, but it is a bit slow (100 requests taking 8 seconds, compared to TEI with BGE that take 0.8s for 100 requests, knowing that BGE-large and Mixed_bread _large have the same size of 335M parameters.

What is the best way to optimize the deployment and inference?

Open source status

  • The model implementation is available on transformers
  • The model weights are available on huggingface-hub
  • I verified that the model is currently not running in the lastest version pip install infinity_emb[all] --upgrade

Provide useful links for the implementation

No response

@michaelfeil
Copy link
Owner

BGE large uses BERT. (infinity DOES overwrite the modeling code / flash-attention replacement)
MixedBread-large uses DEBERTA. (infinity does not overwrite the modeling code / flash-attention replacement)

Deberta-V2 uses significant more flops (distentangeled attention), which also has a less optimized implementation.

@michaelfeil michaelfeil changed the title Best Way to infer deployed models Write a custom flash-attention function for the deberta model. Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants