Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add note on hf-token for llama3 model #386

Merged
merged 1 commit into from
Aug 18, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,23 @@ In this example, We demonstrate how to deploy `Llama3 model` for text generation
KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel.
If the model is not supported by vLLM, KServe falls back to HuggingFace backend as a failsafe.

!!! note
The Llama3 model requires huggingface hub token to download the model. You can set the token using `HF_TOKEN`
environment variable.

Create a secret with the Hugging Face token.

=== "Yaml"
```yaml
apiVersion: v1
kind: Secret
metadata:
name: hf-secret
type: Opaque
stringData:
HF_TOKEN: <token>
```

=== "Yaml"

```yaml
Expand All @@ -22,6 +39,13 @@ If the model is not supported by vLLM, KServe falls back to HuggingFace backend
args:
- --model_name=llama3
- --model_id=meta-llama/meta-llama-3-8b-instruct
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: HF_TOKEN
optional: false
resources:
limits:
cpu: "6"
Expand Down Expand Up @@ -150,6 +174,23 @@ curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" \
You can use `--backend=huggingface` argument to perform the inference using Hugging Face API. KServe Hugging Face backend runtime also
supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for inference.

!!! note
The Llama3 model requires huggingface hub token to download the model. You can set the token using `HF_TOKEN`
environment variable.

Create a secret with the Hugging Face token.

=== "Yaml"
```yaml
apiVersion: v1
kind: Secret
metadata:
name: hf-secret
type: Opaque
stringData:
HF_TOKEN: <token>
```

=== "Yaml"

```yaml
Expand All @@ -167,6 +208,13 @@ supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for i
- --model_name=llama3
- --model_id=meta-llama/meta-llama-3-8b-instruct
- --backend=huggingface
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: HF_TOKEN
optional: false
resources:
limits:
cpu: "6"
Expand Down