Final additions and changes for first release

runpod-workers · May 30, 2024 · 5579902 · 5579902
1 parent 82b0544
commit 5579902
Show file tree

Hide file tree

Showing 4 changed files with 97 additions and 39 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -1,6 +1,8 @@
 ARG WORKER_CUDA_VERSION=12.1.0
 FROM runpod/base:0.6.2-cuda${WORKER_CUDA_VERSION}
 
+#Reinitialize, as its lost after the FROM command
+ARG WORKER_CUDA_VERSION=12.1.0
 
 # Python dependencies
 COPY builder/requirements.txt /requirements.txt
@@ -9,7 +11,8 @@ RUN python3.11 -m pip install --upgrade pip && \
     rm /requirements.txt
 
 RUN pip uninstall torch -y && \
-    pip install --pre torch==2.4.0.dev20240518+cu${WORKER_CUDA_VERSION//./} --index-url https://download.pytorch.org/whl/nightly/cu${WORKER_CUDA_VERSION//./} --no-cache-dir
+    CUDA_VERSION_SHORT=$(echo ${WORKER_CUDA_VERSION} | cut -d. -f1,2 | tr -d .) && \
+    pip install --pre torch==2.4.0.dev20240518+cu${CUDA_VERSION_SHORT} --index-url https://download.pytorch.org/whl/nightly/cu${CUDA_VERSION_SHORT} --no-cache-dir
 
 ENV HF_HOME=/runpod-volume
 

diff --git a/README.md b/README.md
@@ -1,40 +1,91 @@
-> [!WARNING]  
-> This is a work in progress and is not yet ready for use in production.
+<div align="center">
 
+# Infinity Embedding Serverless Worker
 
-# Infinity Text Embedding and ReRanker Worker (OpenAI Compatible)
-Based on [Infinity Text Embedding Engine](https://github.com/michaelfeil/infinity)
+Deploy almost any Text Embedding and Reranker models with high throughput OpenAI-compatible Endpoints on RunPod Serverless, powered by the fastest embedding inference engine, built for serving - [Infinity](https://github.com/michaelfeil/infinity)
 
-## Docker Image
+
+</div>
+
+# Supported Models
+When using `torch` backend, you can deploy any models supported by the sentence-transformers library.
+
+This also means that you can deploy any model from the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard), which is currently the most popular and comprehensive leaderboard for embedding models.
+
+
+
+# Setting up the Serverless Endpoint
+## Option 1: Deploy any models directly from RunPod Console with Pre-Built Docker Image
+
+> [!NOTE]  
+> We are adding a UI for deployment similar to [Worker vLLM](https://github.com/runpod-workers/worker-vllm), but for now, you can manually create the endpoint with the regular serverless configurator.
+
+
+We offer a pre-built Docker Image for the Infinity Embedding Serverless Worker that you can configure entirely with Environment Variables when creating the Endpoint:
+
+### 1. Select Worker Image Version
 You can directly use the following docker images and configure them via Environment Variables.
-* CUDA 11.8: `not built`
-* CUDA 12.1: `michaelf34/runpod-infinity-worker:0.0.5-cu121`
-
-## RunPod Template Environment Variables
-* `MODEL_NAMES`: HuggingFace repo of a single model or multiple models separated by semicolon.      
-    * Example - Single Model: `BAAI/bge-small-en-v1.5;`
-    * Example - Multiple Models: `BAAI/bge-small-en-v1.5;intfloat/e5-large-v2;`
-* `BATCH_SIZES`: Batch size for each model separated by semicolon. If not provided, default batch size of 32 will be used. 
-* `BACKEND`: Backend for all models. Recommended is `torch` which is the default. Other options are `optimum` or `ctranslate2`.
-* `DTYPES`: Dtype, by default `auto` or `fp16`.
-
-## Supported Models
-<details>
-  <summary>What models are supported?</summary>
-
-  - All models supported by the sentence-transformers library.
-  - All models reuploaded on the sentence transformers org https://huggingface.co/sentence-transformers / sbert.net. 
-
-  With the command `--engine torch` the model must be compatible with sentence-transformers library
-
-  For the latest trends, you might want to check out one of the following models.
-    https://huggingface.co/spaces/mteb/leaderboard
+| CUDA Version | Stable (Latest Release)                 | Development (Latest Commit)             | Note                                                        |
+|--------------|-----------------------------------|-----------------------------------|----------------------------------------------------------------------|
+| 11.8.0       | `runpod/worker-infinity-embedding:stable-cuda111.8.0`        | `runpod/worker-infinity-embedding:dev-cuda11.8.0`   | Available on all RunPod Workers without additional selection needed. |
+| 12.1.0       | `runpod/worker-infinity-embedding:stable-cuda12.1.0` | `runpod/worker-infinity-embedding:dev-cuda12.1.0` | When creating an Endpoint, select CUDA Version 12.4, 12.3, 12.2 and 12.1 in the filter. About 10% less total available machines than 11.8.0, but higher performance. |
+
+### 2. Select your models and configure your deployment with Environment Variables
+* `MODEL_NAMES`
+
+    HuggingFace repo of a single model or multiple models separated by semicolon.      
+
+    - Examples:
+        - **Single** Model: `BAAI/bge-small-en-v1.5`
+        - **Multiple** Models: `BAAI/bge-small-en-v1.5;intfloat/e5-large-v2;`
+* `BATCH_SIZES`
+
+    Batch Size for each model separated by semicolon. 
+
+    - Default: `32`
+* `BACKEND`
+
+    Backend for all models. 
 
-</details>
+    - Options: 
+        - `torch`
+        - `optimum`
+        - `ctranslate2`
+    - Default: `torch`
+* `DTYPES`
+
+    Precision for each model separated by semicolon.
+
+    - Options:
+        - `auto`
+        - `fp16`
+        - `fp8` (**New!** Only compatible with H100 and L40S)
+    - Default: `auto`
+
+* `INFINITY_QUEUE_SIZE`
+
+    How many requests can be queued in the Infinity Engine. 
+
+    - Default: `48000`
 
-## Usage - OpenAI Compatibility
+* `RUNPOD_MAX_CONCURRENT_REQUESTS`
+
+    How many requests can be processed concurrently by the RunPod Worker. 
+
+    - Default: `300`
+
+## Option 2: Bake models into Docker Image
+Coming soon!
+
+# Usage
+There are two ways to use the endpoint - [OpenAI Compatibility](#openai-compatibility) matching how you would use OpenAI API, and [Standard Usage](#standard-usage) with the RunPod API. Note that reranking is only available with [Standard Usage](#standard-usage).
+## OpenAI Compatibility
 ### Set up
-Initialize OpenAI client and set the API Key to your RunPod API Key, and base URL to `https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/openai/v1`
+1. Install OpenAI Python SDK
+```bash
+pip install openai
+```
+2. Initialize OpenAI client and set the API Key to your RunPod API Key, and base URL to `https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/openai/v1`, where `YOUR_ENDPOINT_ID` is the ID of your endpoint, e.g. `elftzf0lld1vw1`
 ```python
 from openai import OpenAI
 
@@ -64,9 +115,9 @@ client = OpenAI(
     ```
     Where `YOUR_DEPLOYED_MODEL_NAME` is the name of one of the models you deployed to the worker.
 
-## Usage - Standard
+## Standard Usage
 ### Set up
-You may use /run or /runsync
+You may use `/run` (asynchronous, start job and return job ID) or `/runsync` (synchronous, wait for job to finish and return result)
 
 ### Embedding
 Inputs:
@@ -81,9 +132,6 @@ Inputs:
 * `return_docs`: whether to return the reranked documents or not
 
 
-### Additional testing
+# Acknowledgements
+We'd like to thank [Michael Feil](https://github.com/michaelfeil) for creating the [Infinity Embedding Engine](https://github.com/michaelfeil/infinity) and actively being involved in the development of this worker!
 
-For the Reranker models 
-```bash
-python src/handler.py --test_input '{"input": {"query": "Where is paris?", "docs": ["Paris is in France", "Rome is in Italy"], "model": "BAAI/bge-reranker-v2-m3"}}'
-```
diff --git a/src/config.py b/src/config.py
@@ -46,3 +46,7 @@ def batch_sizes(self) -> list[int]:
     def dtypes(self) -> list[str]:
         dtypes = self._get_no_required_multi("DTYPES", "auto")
         return dtypes
+
+    @cached_property
+    def runpod_max_concurrency(self) -> int:
+        return int(os.environ.get("RUNPOD_MAX_CONCURRENCY", 300))
diff --git a/src/handler.py b/src/handler.py
@@ -58,5 +58,8 @@ async def async_generator_handler(job: dict[str, Any]):
 
 if __name__ == "__main__":
     runpod.serverless.start(
-        {"handler": async_generator_handler, "concurrency_modifier": lambda x: 3000}
+        {
+            "handler": async_generator_handler,
+            "concurrency_modifier": lambda x: embedding_service.config.runpod_max_concurrency,
+        }
     )