Skip to content

Commit

Permalink
Final additions and changes for first release
Browse files Browse the repository at this point in the history
  • Loading branch information
alpayariyak committed May 30, 2024
1 parent 82b0544 commit 5579902
Show file tree
Hide file tree
Showing 4 changed files with 97 additions and 39 deletions.
5 changes: 4 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
ARG WORKER_CUDA_VERSION=12.1.0
FROM runpod/base:0.6.2-cuda${WORKER_CUDA_VERSION}

#Reinitialize, as its lost after the FROM command
ARG WORKER_CUDA_VERSION=12.1.0

# Python dependencies
COPY builder/requirements.txt /requirements.txt
Expand All @@ -9,7 +11,8 @@ RUN python3.11 -m pip install --upgrade pip && \
rm /requirements.txt

RUN pip uninstall torch -y && \
pip install --pre torch==2.4.0.dev20240518+cu${WORKER_CUDA_VERSION//./} --index-url https://download.pytorch.org/whl/nightly/cu${WORKER_CUDA_VERSION//./} --no-cache-dir
CUDA_VERSION_SHORT=$(echo ${WORKER_CUDA_VERSION} | cut -d. -f1,2 | tr -d .) && \
pip install --pre torch==2.4.0.dev20240518+cu${CUDA_VERSION_SHORT} --index-url https://download.pytorch.org/whl/nightly/cu${CUDA_VERSION_SHORT} --no-cache-dir

ENV HF_HOME=/runpod-volume

Expand Down
122 changes: 85 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,91 @@
> [!WARNING]
> This is a work in progress and is not yet ready for use in production.
<div align="center">

# Infinity Embedding Serverless Worker

# Infinity Text Embedding and ReRanker Worker (OpenAI Compatible)
Based on [Infinity Text Embedding Engine](https://github.com/michaelfeil/infinity)
Deploy almost any Text Embedding and Reranker models with high throughput OpenAI-compatible Endpoints on RunPod Serverless, powered by the fastest embedding inference engine, built for serving - [Infinity](https://github.com/michaelfeil/infinity)

## Docker Image

</div>

# Supported Models
When using `torch` backend, you can deploy any models supported by the sentence-transformers library.

This also means that you can deploy any model from the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard), which is currently the most popular and comprehensive leaderboard for embedding models.



# Setting up the Serverless Endpoint
## Option 1: Deploy any models directly from RunPod Console with Pre-Built Docker Image

> [!NOTE]
> We are adding a UI for deployment similar to [Worker vLLM](https://github.com/runpod-workers/worker-vllm), but for now, you can manually create the endpoint with the regular serverless configurator.

We offer a pre-built Docker Image for the Infinity Embedding Serverless Worker that you can configure entirely with Environment Variables when creating the Endpoint:

### 1. Select Worker Image Version
You can directly use the following docker images and configure them via Environment Variables.
* CUDA 11.8: `not built`
* CUDA 12.1: `michaelf34/runpod-infinity-worker:0.0.5-cu121`

## RunPod Template Environment Variables
* `MODEL_NAMES`: HuggingFace repo of a single model or multiple models separated by semicolon.
* Example - Single Model: `BAAI/bge-small-en-v1.5;`
* Example - Multiple Models: `BAAI/bge-small-en-v1.5;intfloat/e5-large-v2;`
* `BATCH_SIZES`: Batch size for each model separated by semicolon. If not provided, default batch size of 32 will be used.
* `BACKEND`: Backend for all models. Recommended is `torch` which is the default. Other options are `optimum` or `ctranslate2`.
* `DTYPES`: Dtype, by default `auto` or `fp16`.

## Supported Models
<details>
<summary>What models are supported?</summary>

- All models supported by the sentence-transformers library.
- All models reuploaded on the sentence transformers org https://huggingface.co/sentence-transformers / sbert.net.

With the command `--engine torch` the model must be compatible with sentence-transformers library

For the latest trends, you might want to check out one of the following models.
https://huggingface.co/spaces/mteb/leaderboard
| CUDA Version | Stable (Latest Release) | Development (Latest Commit) | Note |
|--------------|-----------------------------------|-----------------------------------|----------------------------------------------------------------------|
| 11.8.0 | `runpod/worker-infinity-embedding:stable-cuda111.8.0` | `runpod/worker-infinity-embedding:dev-cuda11.8.0` | Available on all RunPod Workers without additional selection needed. |
| 12.1.0 | `runpod/worker-infinity-embedding:stable-cuda12.1.0` | `runpod/worker-infinity-embedding:dev-cuda12.1.0` | When creating an Endpoint, select CUDA Version 12.4, 12.3, 12.2 and 12.1 in the filter. About 10% less total available machines than 11.8.0, but higher performance. |

### 2. Select your models and configure your deployment with Environment Variables
* `MODEL_NAMES`

HuggingFace repo of a single model or multiple models separated by semicolon.

- Examples:
- **Single** Model: `BAAI/bge-small-en-v1.5`
- **Multiple** Models: `BAAI/bge-small-en-v1.5;intfloat/e5-large-v2;`
* `BATCH_SIZES`

Batch Size for each model separated by semicolon.

- Default: `32`
* `BACKEND`

Backend for all models.

</details>
- Options:
- `torch`
- `optimum`
- `ctranslate2`
- Default: `torch`
* `DTYPES`

Precision for each model separated by semicolon.

- Options:
- `auto`
- `fp16`
- `fp8` (**New!** Only compatible with H100 and L40S)
- Default: `auto`

* `INFINITY_QUEUE_SIZE`

How many requests can be queued in the Infinity Engine.

- Default: `48000`

## Usage - OpenAI Compatibility
* `RUNPOD_MAX_CONCURRENT_REQUESTS`

How many requests can be processed concurrently by the RunPod Worker.

- Default: `300`

## Option 2: Bake models into Docker Image
Coming soon!

# Usage
There are two ways to use the endpoint - [OpenAI Compatibility](#openai-compatibility) matching how you would use OpenAI API, and [Standard Usage](#standard-usage) with the RunPod API. Note that reranking is only available with [Standard Usage](#standard-usage).
## OpenAI Compatibility
### Set up
Initialize OpenAI client and set the API Key to your RunPod API Key, and base URL to `https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/openai/v1`
1. Install OpenAI Python SDK
```bash
pip install openai
```
2. Initialize OpenAI client and set the API Key to your RunPod API Key, and base URL to `https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/openai/v1`, where `YOUR_ENDPOINT_ID` is the ID of your endpoint, e.g. `elftzf0lld1vw1`
```python
from openai import OpenAI

Expand Down Expand Up @@ -64,9 +115,9 @@ client = OpenAI(
```
Where `YOUR_DEPLOYED_MODEL_NAME` is the name of one of the models you deployed to the worker.

## Usage - Standard
## Standard Usage
### Set up
You may use /run or /runsync
You may use `/run` (asynchronous, start job and return job ID) or `/runsync` (synchronous, wait for job to finish and return result)

### Embedding
Inputs:
Expand All @@ -81,9 +132,6 @@ Inputs:
* `return_docs`: whether to return the reranked documents or not


### Additional testing
# Acknowledgements
We'd like to thank [Michael Feil](https://github.com/michaelfeil) for creating the [Infinity Embedding Engine](https://github.com/michaelfeil/infinity) and actively being involved in the development of this worker!

For the Reranker models
```bash
python src/handler.py --test_input '{"input": {"query": "Where is paris?", "docs": ["Paris is in France", "Rome is in Italy"], "model": "BAAI/bge-reranker-v2-m3"}}'
```
4 changes: 4 additions & 0 deletions src/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,7 @@ def batch_sizes(self) -> list[int]:
def dtypes(self) -> list[str]:
dtypes = self._get_no_required_multi("DTYPES", "auto")
return dtypes

@cached_property
def runpod_max_concurrency(self) -> int:
return int(os.environ.get("RUNPOD_MAX_CONCURRENCY", 300))
5 changes: 4 additions & 1 deletion src/handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,5 +58,8 @@ async def async_generator_handler(job: dict[str, Any]):

if __name__ == "__main__":
runpod.serverless.start(
{"handler": async_generator_handler, "concurrency_modifier": lambda x: 3000}
{
"handler": async_generator_handler,
"concurrency_modifier": lambda x: embedding_service.config.runpod_max_concurrency,
}
)

0 comments on commit 5579902

Please sign in to comment.