diff --git a/Dockerfile b/Dockerfile
index 303a9db..711d148 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,6 +1,8 @@
ARG WORKER_CUDA_VERSION=12.1.0
FROM runpod/base:0.6.2-cuda${WORKER_CUDA_VERSION}
+#Reinitialize, as its lost after the FROM command
+ARG WORKER_CUDA_VERSION=12.1.0
# Python dependencies
COPY builder/requirements.txt /requirements.txt
@@ -9,7 +11,8 @@ RUN python3.11 -m pip install --upgrade pip && \
rm /requirements.txt
RUN pip uninstall torch -y && \
- pip install --pre torch==2.4.0.dev20240518+cu${WORKER_CUDA_VERSION//./} --index-url https://download.pytorch.org/whl/nightly/cu${WORKER_CUDA_VERSION//./} --no-cache-dir
+ CUDA_VERSION_SHORT=$(echo ${WORKER_CUDA_VERSION} | cut -d. -f1,2 | tr -d .) && \
+ pip install --pre torch==2.4.0.dev20240518+cu${CUDA_VERSION_SHORT} --index-url https://download.pytorch.org/whl/nightly/cu${CUDA_VERSION_SHORT} --no-cache-dir
ENV HF_HOME=/runpod-volume
diff --git a/README.md b/README.md
index d47cb3c..85b3715 100644
--- a/README.md
+++ b/README.md
@@ -1,40 +1,91 @@
-> [!WARNING]
-> This is a work in progress and is not yet ready for use in production.
+
+# Infinity Embedding Serverless Worker
-# Infinity Text Embedding and ReRanker Worker (OpenAI Compatible)
-Based on [Infinity Text Embedding Engine](https://github.com/michaelfeil/infinity)
+Deploy almost any Text Embedding and Reranker models with high throughput OpenAI-compatible Endpoints on RunPod Serverless, powered by the fastest embedding inference engine, built for serving - [Infinity](https://github.com/michaelfeil/infinity)
-## Docker Image
+
+
+
+# Supported Models
+When using `torch` backend, you can deploy any models supported by the sentence-transformers library.
+
+This also means that you can deploy any model from the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard), which is currently the most popular and comprehensive leaderboard for embedding models.
+
+
+
+# Setting up the Serverless Endpoint
+## Option 1: Deploy any models directly from RunPod Console with Pre-Built Docker Image
+
+> [!NOTE]
+> We are adding a UI for deployment similar to [Worker vLLM](https://github.com/runpod-workers/worker-vllm), but for now, you can manually create the endpoint with the regular serverless configurator.
+
+
+We offer a pre-built Docker Image for the Infinity Embedding Serverless Worker that you can configure entirely with Environment Variables when creating the Endpoint:
+
+### 1. Select Worker Image Version
You can directly use the following docker images and configure them via Environment Variables.
-* CUDA 11.8: `not built`
-* CUDA 12.1: `michaelf34/runpod-infinity-worker:0.0.5-cu121`
-
-## RunPod Template Environment Variables
-* `MODEL_NAMES`: HuggingFace repo of a single model or multiple models separated by semicolon.
- * Example - Single Model: `BAAI/bge-small-en-v1.5;`
- * Example - Multiple Models: `BAAI/bge-small-en-v1.5;intfloat/e5-large-v2;`
-* `BATCH_SIZES`: Batch size for each model separated by semicolon. If not provided, default batch size of 32 will be used.
-* `BACKEND`: Backend for all models. Recommended is `torch` which is the default. Other options are `optimum` or `ctranslate2`.
-* `DTYPES`: Dtype, by default `auto` or `fp16`.
-
-## Supported Models
-
- What models are supported?
-
- - All models supported by the sentence-transformers library.
- - All models reuploaded on the sentence transformers org https://huggingface.co/sentence-transformers / sbert.net.
-
- With the command `--engine torch` the model must be compatible with sentence-transformers library
-
- For the latest trends, you might want to check out one of the following models.
- https://huggingface.co/spaces/mteb/leaderboard
+| CUDA Version | Stable (Latest Release) | Development (Latest Commit) | Note |
+|--------------|-----------------------------------|-----------------------------------|----------------------------------------------------------------------|
+| 11.8.0 | `runpod/worker-infinity-embedding:stable-cuda111.8.0` | `runpod/worker-infinity-embedding:dev-cuda11.8.0` | Available on all RunPod Workers without additional selection needed. |
+| 12.1.0 | `runpod/worker-infinity-embedding:stable-cuda12.1.0` | `runpod/worker-infinity-embedding:dev-cuda12.1.0` | When creating an Endpoint, select CUDA Version 12.4, 12.3, 12.2 and 12.1 in the filter. About 10% less total available machines than 11.8.0, but higher performance. |
+
+### 2. Select your models and configure your deployment with Environment Variables
+* `MODEL_NAMES`
+
+ HuggingFace repo of a single model or multiple models separated by semicolon.
+
+ - Examples:
+ - **Single** Model: `BAAI/bge-small-en-v1.5`
+ - **Multiple** Models: `BAAI/bge-small-en-v1.5;intfloat/e5-large-v2;`
+* `BATCH_SIZES`
+
+ Batch Size for each model separated by semicolon.
+
+ - Default: `32`
+* `BACKEND`
+
+ Backend for all models.
-
+ - Options:
+ - `torch`
+ - `optimum`
+ - `ctranslate2`
+ - Default: `torch`
+* `DTYPES`
+
+ Precision for each model separated by semicolon.
+
+ - Options:
+ - `auto`
+ - `fp16`
+ - `fp8` (**New!** Only compatible with H100 and L40S)
+ - Default: `auto`
+
+* `INFINITY_QUEUE_SIZE`
+
+ How many requests can be queued in the Infinity Engine.
+
+ - Default: `48000`
-## Usage - OpenAI Compatibility
+* `RUNPOD_MAX_CONCURRENT_REQUESTS`
+
+ How many requests can be processed concurrently by the RunPod Worker.
+
+ - Default: `300`
+
+## Option 2: Bake models into Docker Image
+Coming soon!
+
+# Usage
+There are two ways to use the endpoint - [OpenAI Compatibility](#openai-compatibility) matching how you would use OpenAI API, and [Standard Usage](#standard-usage) with the RunPod API. Note that reranking is only available with [Standard Usage](#standard-usage).
+## OpenAI Compatibility
### Set up
-Initialize OpenAI client and set the API Key to your RunPod API Key, and base URL to `https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/openai/v1`
+1. Install OpenAI Python SDK
+```bash
+pip install openai
+```
+2. Initialize OpenAI client and set the API Key to your RunPod API Key, and base URL to `https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/openai/v1`, where `YOUR_ENDPOINT_ID` is the ID of your endpoint, e.g. `elftzf0lld1vw1`
```python
from openai import OpenAI
@@ -64,9 +115,9 @@ client = OpenAI(
```
Where `YOUR_DEPLOYED_MODEL_NAME` is the name of one of the models you deployed to the worker.
-## Usage - Standard
+## Standard Usage
### Set up
-You may use /run or /runsync
+You may use `/run` (asynchronous, start job and return job ID) or `/runsync` (synchronous, wait for job to finish and return result)
### Embedding
Inputs:
@@ -81,9 +132,6 @@ Inputs:
* `return_docs`: whether to return the reranked documents or not
-### Additional testing
+# Acknowledgements
+We'd like to thank [Michael Feil](https://github.com/michaelfeil) for creating the [Infinity Embedding Engine](https://github.com/michaelfeil/infinity) and actively being involved in the development of this worker!
-For the Reranker models
-```bash
-python src/handler.py --test_input '{"input": {"query": "Where is paris?", "docs": ["Paris is in France", "Rome is in Italy"], "model": "BAAI/bge-reranker-v2-m3"}}'
-```
\ No newline at end of file
diff --git a/src/config.py b/src/config.py
index 0f9998a..39e8d48 100644
--- a/src/config.py
+++ b/src/config.py
@@ -46,3 +46,7 @@ def batch_sizes(self) -> list[int]:
def dtypes(self) -> list[str]:
dtypes = self._get_no_required_multi("DTYPES", "auto")
return dtypes
+
+ @cached_property
+ def runpod_max_concurrency(self) -> int:
+ return int(os.environ.get("RUNPOD_MAX_CONCURRENCY", 300))
diff --git a/src/handler.py b/src/handler.py
index 4dd9696..a7bb04b 100644
--- a/src/handler.py
+++ b/src/handler.py
@@ -58,5 +58,8 @@ async def async_generator_handler(job: dict[str, Any]):
if __name__ == "__main__":
runpod.serverless.start(
- {"handler": async_generator_handler, "concurrency_modifier": lambda x: 3000}
+ {
+ "handler": async_generator_handler,
+ "concurrency_modifier": lambda x: embedding_service.config.runpod_max_concurrency,
+ }
)