diff --git a/Dockerfile b/Dockerfile index 303a9db..711d148 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,6 +1,8 @@ ARG WORKER_CUDA_VERSION=12.1.0 FROM runpod/base:0.6.2-cuda${WORKER_CUDA_VERSION} +#Reinitialize, as its lost after the FROM command +ARG WORKER_CUDA_VERSION=12.1.0 # Python dependencies COPY builder/requirements.txt /requirements.txt @@ -9,7 +11,8 @@ RUN python3.11 -m pip install --upgrade pip && \ rm /requirements.txt RUN pip uninstall torch -y && \ - pip install --pre torch==2.4.0.dev20240518+cu${WORKER_CUDA_VERSION//./} --index-url https://download.pytorch.org/whl/nightly/cu${WORKER_CUDA_VERSION//./} --no-cache-dir + CUDA_VERSION_SHORT=$(echo ${WORKER_CUDA_VERSION} | cut -d. -f1,2 | tr -d .) && \ + pip install --pre torch==2.4.0.dev20240518+cu${CUDA_VERSION_SHORT} --index-url https://download.pytorch.org/whl/nightly/cu${CUDA_VERSION_SHORT} --no-cache-dir ENV HF_HOME=/runpod-volume diff --git a/README.md b/README.md index d47cb3c..85b3715 100644 --- a/README.md +++ b/README.md @@ -1,40 +1,91 @@ -> [!WARNING] -> This is a work in progress and is not yet ready for use in production. +
+# Infinity Embedding Serverless Worker -# Infinity Text Embedding and ReRanker Worker (OpenAI Compatible) -Based on [Infinity Text Embedding Engine](https://github.com/michaelfeil/infinity) +Deploy almost any Text Embedding and Reranker models with high throughput OpenAI-compatible Endpoints on RunPod Serverless, powered by the fastest embedding inference engine, built for serving - [Infinity](https://github.com/michaelfeil/infinity) -## Docker Image + +
+ +# Supported Models +When using `torch` backend, you can deploy any models supported by the sentence-transformers library. + +This also means that you can deploy any model from the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard), which is currently the most popular and comprehensive leaderboard for embedding models. + + + +# Setting up the Serverless Endpoint +## Option 1: Deploy any models directly from RunPod Console with Pre-Built Docker Image + +> [!NOTE] +> We are adding a UI for deployment similar to [Worker vLLM](https://github.com/runpod-workers/worker-vllm), but for now, you can manually create the endpoint with the regular serverless configurator. + + +We offer a pre-built Docker Image for the Infinity Embedding Serverless Worker that you can configure entirely with Environment Variables when creating the Endpoint: + +### 1. Select Worker Image Version You can directly use the following docker images and configure them via Environment Variables. -* CUDA 11.8: `not built` -* CUDA 12.1: `michaelf34/runpod-infinity-worker:0.0.5-cu121` - -## RunPod Template Environment Variables -* `MODEL_NAMES`: HuggingFace repo of a single model or multiple models separated by semicolon. - * Example - Single Model: `BAAI/bge-small-en-v1.5;` - * Example - Multiple Models: `BAAI/bge-small-en-v1.5;intfloat/e5-large-v2;` -* `BATCH_SIZES`: Batch size for each model separated by semicolon. If not provided, default batch size of 32 will be used. -* `BACKEND`: Backend for all models. Recommended is `torch` which is the default. Other options are `optimum` or `ctranslate2`. -* `DTYPES`: Dtype, by default `auto` or `fp16`. - -## Supported Models -
- What models are supported? - - - All models supported by the sentence-transformers library. - - All models reuploaded on the sentence transformers org https://huggingface.co/sentence-transformers / sbert.net. - - With the command `--engine torch` the model must be compatible with sentence-transformers library - - For the latest trends, you might want to check out one of the following models. - https://huggingface.co/spaces/mteb/leaderboard +| CUDA Version | Stable (Latest Release) | Development (Latest Commit) | Note | +|--------------|-----------------------------------|-----------------------------------|----------------------------------------------------------------------| +| 11.8.0 | `runpod/worker-infinity-embedding:stable-cuda111.8.0` | `runpod/worker-infinity-embedding:dev-cuda11.8.0` | Available on all RunPod Workers without additional selection needed. | +| 12.1.0 | `runpod/worker-infinity-embedding:stable-cuda12.1.0` | `runpod/worker-infinity-embedding:dev-cuda12.1.0` | When creating an Endpoint, select CUDA Version 12.4, 12.3, 12.2 and 12.1 in the filter. About 10% less total available machines than 11.8.0, but higher performance. | + +### 2. Select your models and configure your deployment with Environment Variables +* `MODEL_NAMES` + + HuggingFace repo of a single model or multiple models separated by semicolon. + + - Examples: + - **Single** Model: `BAAI/bge-small-en-v1.5` + - **Multiple** Models: `BAAI/bge-small-en-v1.5;intfloat/e5-large-v2;` +* `BATCH_SIZES` + + Batch Size for each model separated by semicolon. + + - Default: `32` +* `BACKEND` + + Backend for all models. -
+ - Options: + - `torch` + - `optimum` + - `ctranslate2` + - Default: `torch` +* `DTYPES` + + Precision for each model separated by semicolon. + + - Options: + - `auto` + - `fp16` + - `fp8` (**New!** Only compatible with H100 and L40S) + - Default: `auto` + +* `INFINITY_QUEUE_SIZE` + + How many requests can be queued in the Infinity Engine. + + - Default: `48000` -## Usage - OpenAI Compatibility +* `RUNPOD_MAX_CONCURRENT_REQUESTS` + + How many requests can be processed concurrently by the RunPod Worker. + + - Default: `300` + +## Option 2: Bake models into Docker Image +Coming soon! + +# Usage +There are two ways to use the endpoint - [OpenAI Compatibility](#openai-compatibility) matching how you would use OpenAI API, and [Standard Usage](#standard-usage) with the RunPod API. Note that reranking is only available with [Standard Usage](#standard-usage). +## OpenAI Compatibility ### Set up -Initialize OpenAI client and set the API Key to your RunPod API Key, and base URL to `https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/openai/v1` +1. Install OpenAI Python SDK +```bash +pip install openai +``` +2. Initialize OpenAI client and set the API Key to your RunPod API Key, and base URL to `https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/openai/v1`, where `YOUR_ENDPOINT_ID` is the ID of your endpoint, e.g. `elftzf0lld1vw1` ```python from openai import OpenAI @@ -64,9 +115,9 @@ client = OpenAI( ``` Where `YOUR_DEPLOYED_MODEL_NAME` is the name of one of the models you deployed to the worker. -## Usage - Standard +## Standard Usage ### Set up -You may use /run or /runsync +You may use `/run` (asynchronous, start job and return job ID) or `/runsync` (synchronous, wait for job to finish and return result) ### Embedding Inputs: @@ -81,9 +132,6 @@ Inputs: * `return_docs`: whether to return the reranked documents or not -### Additional testing +# Acknowledgements +We'd like to thank [Michael Feil](https://github.com/michaelfeil) for creating the [Infinity Embedding Engine](https://github.com/michaelfeil/infinity) and actively being involved in the development of this worker! -For the Reranker models -```bash -python src/handler.py --test_input '{"input": {"query": "Where is paris?", "docs": ["Paris is in France", "Rome is in Italy"], "model": "BAAI/bge-reranker-v2-m3"}}' -``` \ No newline at end of file diff --git a/src/config.py b/src/config.py index 0f9998a..39e8d48 100644 --- a/src/config.py +++ b/src/config.py @@ -46,3 +46,7 @@ def batch_sizes(self) -> list[int]: def dtypes(self) -> list[str]: dtypes = self._get_no_required_multi("DTYPES", "auto") return dtypes + + @cached_property + def runpod_max_concurrency(self) -> int: + return int(os.environ.get("RUNPOD_MAX_CONCURRENCY", 300)) diff --git a/src/handler.py b/src/handler.py index 4dd9696..a7bb04b 100644 --- a/src/handler.py +++ b/src/handler.py @@ -58,5 +58,8 @@ async def async_generator_handler(job: dict[str, Any]): if __name__ == "__main__": runpod.serverless.start( - {"handler": async_generator_handler, "concurrency_modifier": lambda x: 3000} + { + "handler": async_generator_handler, + "concurrency_modifier": lambda x: embedding_service.config.runpod_max_concurrency, + } )