Add note on hf-token for llama3 model (#386)

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
kserve · Aug 18, 2024 · 99b939a · 99b939a
1 parent f991e85
commit 99b939a
Showing 1 changed file with 48 additions and 0 deletions.
diff --git a/docs/modelserving/v1beta1/llm/huggingface/text_generation/README.md b/docs/modelserving/v1beta1/llm/huggingface/text_generation/README.md
@@ -6,6 +6,23 @@ In this example, We demonstrate how to deploy `Llama3 model` for text generation
 KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel.
 If the model is not supported by vLLM, KServe falls back to HuggingFace backend as a failsafe.
 
+!!! note
+    The Llama3 model requires huggingface hub token to download the model. You can set the token using `HF_TOKEN` 
+    environment variable.
+
+Create a secret with the Hugging Face token.
+
+=== "Yaml"
+    ```yaml
+    apiVersion: v1
+    kind: Secret
+    metadata:
+        name: hf-secret
+    type: Opaque	
+    stringData:
+        HF_TOKEN: <token>
+    ```
+
 === "Yaml"
 
     ```yaml
@@ -22,6 +39,13 @@ If the model is not supported by vLLM, KServe falls back to HuggingFace backend
           args:
             - --model_name=llama3
             - --model_id=meta-llama/meta-llama-3-8b-instruct
+          env:
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: hf-secret
+                  key: HF_TOKEN
+                  optional: false
           resources:
             limits:
               cpu: "6"
@@ -150,6 +174,23 @@ curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" \
 You can use `--backend=huggingface` argument to perform the inference using Hugging Face API. KServe Hugging Face backend runtime also 
 supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for inference.
 
+!!! note
+    The Llama3 model requires huggingface hub token to download the model. You can set the token using `HF_TOKEN` 
+    environment variable.
+
+Create a secret with the Hugging Face token.
+
+=== "Yaml"
+    ```yaml
+    apiVersion: v1
+    kind: Secret
+    metadata:
+        name: hf-secret
+    type: Opaque	
+    stringData:
+        HF_TOKEN: <token>
+    ```
+
 === "Yaml"
 
     ```yaml
@@ -167,6 +208,13 @@ supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for i
             - --model_name=llama3
             - --model_id=meta-llama/meta-llama-3-8b-instruct
             - --backend=huggingface
+          env:
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: hf-secret
+                  key: HF_TOKEN
+                  optional: false
           resources:
             limits:
               cpu: "6"