Issue getting Llama3 8b running on GKE #43

francescov1 · 2024-05-24T17:44:13Z

I'm trying to deploy Llama3 8b on GKE using optimum but running into some troubles.

Following instructions here: https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference. I built the docker image using the make command mentioned.

The server will start booting up, but gets stuck at "Warming up model". See logs below:

│ 2024-05-24T17:26:26.309789Z  INFO text_generation_launcher: Args { model_id: "meta-llama/Meta-Llama-3-8B", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_c │
│ 2024-05-24T17:26:26.309895Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"                                                                                                                                                │
│ 2024-05-24T17:26:26.400493Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]                                                                                                                                     │
│ 2024-05-24T17:26:26.400639Z  INFO download: text_generation_launcher: Starting download process.                                                                                                                                               │
│ 2024-05-24T17:26:26.475982Z  WARN text_generation_launcher: 'extension' argument is not supported and will be ignored.                                                                                                                         │
│                                                                                                                                                                                                                                                │
│ 2024-05-24T17:26:51.727997Z  INFO download: text_generation_launcher: Successfully downloaded weights.                                                                                                                                         │
│ 2024-05-24T17:26:51.728345Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0                                                                                                                                               │
│ 2024-05-24T17:26:54.273164Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0                                                                                                                             │
│                                                                                                                                                                                                                                                │
│ 2024-05-24T17:26:54.332635Z  INFO shard-manager: text_generation_launcher: Shard ready in 2.603384915s rank=0                                                                                                                                  │
│ 2024-05-24T17:26:54.431655Z  INFO text_generation_launcher: Starting Webserver                                                                                                                                                                 │
│ 2024-05-24T17:26:54.453486Z  INFO text_generation_router: router/src/main.rs:185: Using the Hugging Face API                                                                                                                                   │
│ 2024-05-24T17:26:54.453528Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"                                                     │
│ 2024-05-24T17:26:54.739323Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved_special_token_151|>' w

... (lots more tokenizer warnings, same as the ones above and below)

| 2024-05-24T17:26:54.739610Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved_special_token_250|>' w │
│ 2024-05-24T17:26:54.866449Z  INFO text_generation_router: router/src/main.rs:471: Serving revision 62bd457b6fe961a42a631306577e622c83876cb6 of model meta-llama/Meta-Llama-3-8B                                                                │
│ 2024-05-24T17:26:54.866479Z  INFO text_generation_router: router/src/main.rs:253: Using config Some(Llama)                                                                                                                                     │
│ 2024-05-24T17:26:54.866493Z  INFO text_generation_router: router/src/main.rs:265: Using the Hugging Face API to retrieve tokenizer config                                                                                                      │
│ 2024-05-24T17:28:23.784610Z  INFO text_generation_router: router/src/main.rs:314: Warming up model

Here's my config:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: optimum-tpu-llama3-8b-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: optimum-tpu-llama3-8b-server
  template:
    metadata:
      labels:
        app: optimum-tpu-llama3-8b-server
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      hostNetwork: true
      hostIPC: true
      containers:
        - name: optimum-tpu-llama3-8b-server
          image: us-central1-docker.pkg.dev/project-lighthouse-403916/tpus/optimum-tpu:latest
          securityContext:
            privileged: true
          args:
            - "--model-id=meta-llama/Meta-Llama-3-8B"
            - "--max-concurrent-requests=1"
            - "--max-input-length=512"
            - "--max-total-tokens=1024"
            - "--max-batch-prefill-tokens=512"
            - "--max-batch-total-tokens=1024"
          env:
            - name: HF_TOKEN
              value: <token>
            - name: HUGGING_FACE_HUB_TOKEN
              value: <token>
            - name: HF_BATCH_SIZE
              value: "1"
            - name: HF_SEQUENCE_LENGTH
              value: "1024"
          ports:
            - containerPort: 80
          volumeMounts:
            - name: data-volume
              mountPath: /data
          resources:
            requests:
              google.com/tpu: 8
            limits:
              google.com/tpu: 8
      volumes:
        - name: data-volume
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: optimum-tpu-llama3-8b-svc
spec:
  selector:
    app: optimum-tpu-llama3-8b-server
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 80

Any ideas?

The text was updated successfully, but these errors were encountered:

tengomucho · 2024-05-31T07:56:15Z

Hi Francesco,
Sorry we didn't have the chance to answer earlier... we'll be looking at this and get back to you soon!

carlesoctav · 2024-06-08T07:24:30Z

any updates?

tengomucho · 2024-06-11T15:25:12Z

I just re-tried this with llama3-8b and it worked fine, but I tested with a lower number of input length and total tokens. With these settinss the server takes ~15s for warmup. Can you retry this, with --max-input-length 32 --max-total-tokens 64?

francescov1 · 2024-06-14T00:13:46Z

@tengomucho Unfortunately that didn't work. I used the same manifests as above with the changes you mentioned. I also rebuilt the docker image with the latest changes from main.

What TPU are you running on? Is it possible that the v5e node is not big enough, and its unable to use multiple nodes? I can try on a v5p if that's better

tengomucho · 2024-06-14T08:38:41Z

I tried on a v5e-litepod8. The only difference I would say is that I did not use GKE, I used the docker container generated by make tpu-tgi as explained here.

francescov1 · 2024-06-14T17:24:15Z

hmm I don't see why my K8s config would be any different to that.

Is there a prebuilt public Docker image I can test out?

tengomucho · 2024-06-15T22:20:06Z

Let me cook one for you, I'll do it on Monday and I'll get back to you.

rick-c-goog · 2024-06-23T00:55:37Z

any update on this, I had the same issue with GKE, none of huggingface model works( gemma-2b, mistral, llama etc). No error in logs either, just hang with Info: Warming up model for gemma,

For Misrtral a little bit different:
2024-06-23T00:48:10.071293Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-06-23T00:48:10.199181Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-06-23T00:48:10.199294Z INFO download: text_generation_launcher: Starting download process.
2024-06-23T00:48:10.272564Z WARN text_generation_launcher: 'extension' argument is not supported and will be ignored.
2024-06-23T00:48:56.746082Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-23T00:48:56.791824Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-23T00:48:59.480818Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-06-23T00:48:59.495306Z INFO shard-manager: text_generation_launcher: Shard ready in 2.702693453s rank=0
2024-06-23T00:48:59.548993Z INFO text_generation_launcher: Starting Webserver
2024-06-23T00:48:59.554356Z INFO text_generation_router: router/src/main.rs:195: Using the Hugging Face API
2024-06-23T00:48:59.554399Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-06-23T00:48:59.727654Z WARN text_generation_router: router/src/main.rs:233: Could not retrieve model info from the Hugging Face hub.
2024-06-23T00:48:59.770889Z INFO text_generation_router: router/src/main.rs:289: Using config Some(Mistral)
2024-06-23T00:48:59.770904Z WARN text_generation_router: router/src/main.rs:298: no pipeline tag found for model mistralai/Mistral-7B-v0.3

rick-c-goog · 2024-06-23T00:58:09Z

At the same time, I was able to try the following example test inside GKE POD created.
https://github.com/huggingface/optimum-tpu/blob/main/examples/text-generation/generation.py

rick-c-goog · 2024-06-25T02:59:19Z

@tengomucho, any comment on optimum-tpu on GKE issues or potentially public image?

tengomucho · 2024-06-25T14:58:11Z

Hey, sorry it took me longer to get this done, but you should be able to test this TGI image huggingface/optimum-tpu:v0.1.1-tgi.

rick-c-goog · 2024-06-25T15:21:17Z

Thank you, @tengomucho, got stuck/hang on same step on Warming up:
2024-06-25 11:12:01.541 EDT
{fields: {…}, level: INFO, target: text_generation_launcher, timestamp: 2024-06-25T15:12:01.541489Z}
2024-06-25 11:12:01.541 EDT
{fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:01.541603Z}
2024-06-25 11:12:01.628 EDT
{fields: {…}, level: WARN, target: text_generation_launcher, timestamp: 2024-06-25T15:12:01.628394Z}
2024-06-25 11:12:12.752 EDT
{fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:12.752135Z}
2024-06-25 11:12:12.752 EDT
{fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:12.752408Z}
2024-06-25 11:12:15.687 EDT
{fields: {…}, level: INFO, target: text_generation_launcher, timestamp: 2024-06-25T15:12:15.687254Z}
2024-06-25 11:12:15.756 EDT
{fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:15.756244Z}
2024-06-25 11:12:15.855 EDT
{fields: {…}, level: INFO, target: text_generation_launcher, timestamp: 2024-06-25T15:12:15.855187Z}
2024-06-25 11:12:15.861 EDT
Using the Hugging Face API
2024-06-25 11:12:15.862 EDT
Token file not found "/root/.cache/huggingface/token"
2024-06-25 11:12:16.568 EDT
Could not retrieve model info from the Hugging Face hub.
2024-06-25 11:12:16.585 EDT
Using config Some(Gemma)
2024-06-25 11:12:16.585 EDT
Using the Hugging Face API to retrieve tokenizer config
2024-06-25 11:12:16.587 EDT
no pipeline tag found for model google/gemma-2b-it
2024-06-25 11:13:03.877 EDT
Warming up model

tengomucho · 2024-06-25T16:02:28Z

Umh strange, I just tested it and it worked fine. I tested with this command line BTW:

HF_TOKEN=<your_hf_token_here>
MODEL_ID=google/gemma-2b

sudo docker run --net=host \
                --privileged \
                -v $(pwd)/data:/data \
                -e HF_TOKEN=${HF_TOKEN} \
                ghcr.io/huggingface/optimum-tpu:v0.1.1-tgi \
                --model-id ${MODEL_ID} \
                --max-concurrent-requests 4 \
                --max-input-length 32 \
                --max-total-tokens 64 \
                --max-batch-size 1

And it took ~12s to warm up:

2024-06-25T15:56:14.798018Z  WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model google/gemma-2b
2024-06-25T15:57:47.220655Z  INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-06-25T15:57:54.872585Z  INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64

rick-c-goog · 2024-06-25T16:08:03Z

I believe it is GKE specific,

francescov1 · 2024-06-25T23:35:26Z

@tengomucho Im seeing the same thing. I retried my deployment manifest I pasted above but with the image huggingface/optimum-tpu:v0.1.1-tgi and still getting the same behavior

liurupeng · 2024-06-26T17:09:45Z

this one works for me:

kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      hostNetwork: true
      volumes:
        - name: data-volume
          emptyDir: {}
      containers:
      - name: tgi-tpu
        image: {optimum-tpu-image}
        args:
        - --model-id=google/gemma-2b
        - --max-concurrent-requests=4
        - --max-input-length=32
        - --max-total-tokens=64
        - --max-batch-size=1
        securityContext:
            privileged: true
        env:
          - name: HF_TOKEN
            value: {your_token}
          - name: HUGGING_FACE_HUB_TOKEN
            value: {your_token}
        ports:
        - containerPort: 80
        volumeMounts:
            - name: data-volume
              mountPath: /data
        resources:
          requests:
            google.com/tpu: 8
          limits:
            google.com/tpu: 8
---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080  
      targetPort: 80  ```

rick-c-goog · 2024-06-27T02:54:44Z

thanks, @liurupeng,
I got the logs as following:

2024-06-26 22:43:50.501 EDT
�[2m2024-06-27T02:43:50.500866Z�[0m �[32m INFO�[0m �[1mshard-manager�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Shard ready in 2.703506822s �[2m�[3mrank�[0m�[2m=�[0m0�[0m
2024-06-26 22:43:50.599 EDT
�[2m2024-06-27T02:43:50.599561Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Starting Webserver
2024-06-26 22:43:50.611 EDT
�[2m2024-06-27T02:43:50.611767Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m185:�[0m Using the Hugging Face API
2024-06-26 22:43:50.611 EDT
�[2m2024-06-27T02:43:50.611800Z�[0m �[32m INFO�[0m �[2mhf_hub�[0m�[2m:�[0m �[2m/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs�[0m�[2m:�[0m�[2m55:�[0m Token file not found "/root/.cache/huggingface/token"
2024-06-26 22:43:51.329 EDT
�[2m2024-06-27T02:43:51.329230Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m471:�[0m Serving revision 2ac59a5d7bf4e1425010f0d457dde7d146658953 of model google/gemma-2b
2024-06-26 22:43:51.329 EDT
�[2m2024-06-27T02:43:51.329250Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m253:�[0m Using config Some(Gemma)
2024-06-26 22:43:51.329 EDT
�[2m2024-06-27T02:43:51.329254Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m265:�[0m Using the Hugging Face API to retrieve tokenizer config
2024-06-26 22:44:48.963 EDT
�[2m2024-06-27T02:44:48.962935Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m314:�[0m Warming up model
2024-06-26 22:44:55.038 EDT
�[2m2024-06-27T02:44:55.038381Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m351:�[0m Setting max batch total tokens to 64
2024-06-26 22:44:55.038 EDT
�[2m2024-06-27T02:44:55.038396Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m352:�[0m Connected
2024-06-26 22:44:55.038 EDT
�[2m2024-06-27T02:44:55.038401Z�[0m �[33m WARN�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m366:�[0m Invalid hostname, defaulting to 0.0.0.0

So, I assume the TGI model should be up and running, but the curl validation command throws connection refused error( I tried both container port 80 or 8000):
kubectl run -it busybox --image radial/busyboxplus:curl
If you don't see a command prompt, try pressing enter.
[ root@busybox:/ ]$ curl 34.118.229.124:8080/generate \

-X POST
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}'
-H 'Content-Type: application/json'
curl: (7) Failed to connect to 34.118.229.124 port 8080: Connection refused
[ root@busybox:/ ]$

Did you try the curl connection to validate?

liurupeng · 2024-06-27T04:09:24Z

@rick-c-goog I ran the below command:

kubectl port-forward svc/service 8080:8080

curl 127.0.0.1:8080/generate     -X POST     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}'     -H 'Content-Type: application/json'

rick-c-goog · 2024-06-27T13:36:04Z

Thanks, @liurupeng , the port-forward curl to 127.0.0.1 working, then busybox curl to service cluster IP afterwards

Bihan · 2024-07-01T06:14:11Z

@tengomucho I am testing optimum-tpu with v2-8 and getting similar issues as discussed above. Does optimum-tpu only supports v5e-litepod?

tengomucho · 2024-07-01T06:59:43Z

@Bihan For now we have only tested v5e configurations.

Bihan · 2024-07-01T07:13:13Z

@Bihan For now we have only tested v5e configurations.

@tengomucho Thank you for a quick a reply. Do you think testing with v2-8 or v3-8 would require a major modification?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue getting Llama3 8b running on GKE #43

Issue getting Llama3 8b running on GKE #43

francescov1 commented May 24, 2024 •

edited

Loading

tengomucho commented May 31, 2024

carlesoctav commented Jun 8, 2024

tengomucho commented Jun 11, 2024

francescov1 commented Jun 14, 2024

tengomucho commented Jun 14, 2024

francescov1 commented Jun 14, 2024

tengomucho commented Jun 15, 2024

rick-c-goog commented Jun 23, 2024

rick-c-goog commented Jun 23, 2024

rick-c-goog commented Jun 25, 2024

tengomucho commented Jun 25, 2024

rick-c-goog commented Jun 25, 2024

tengomucho commented Jun 25, 2024 •

edited

Loading

rick-c-goog commented Jun 25, 2024

francescov1 commented Jun 25, 2024

liurupeng commented Jun 26, 2024 •

edited

Loading

rick-c-goog commented Jun 27, 2024

liurupeng commented Jun 27, 2024

rick-c-goog commented Jun 27, 2024

Bihan commented Jul 1, 2024

tengomucho commented Jul 1, 2024

Bihan commented Jul 1, 2024

Issue getting Llama3 8b running on GKE #43

Issue getting Llama3 8b running on GKE #43

Comments

francescov1 commented May 24, 2024 • edited Loading

tengomucho commented May 31, 2024

carlesoctav commented Jun 8, 2024

tengomucho commented Jun 11, 2024

francescov1 commented Jun 14, 2024

tengomucho commented Jun 14, 2024

francescov1 commented Jun 14, 2024

tengomucho commented Jun 15, 2024

rick-c-goog commented Jun 23, 2024

rick-c-goog commented Jun 23, 2024

rick-c-goog commented Jun 25, 2024

tengomucho commented Jun 25, 2024

rick-c-goog commented Jun 25, 2024

tengomucho commented Jun 25, 2024 • edited Loading

rick-c-goog commented Jun 25, 2024

francescov1 commented Jun 25, 2024

liurupeng commented Jun 26, 2024 • edited Loading

rick-c-goog commented Jun 27, 2024

liurupeng commented Jun 27, 2024

rick-c-goog commented Jun 27, 2024

Bihan commented Jul 1, 2024

tengomucho commented Jul 1, 2024

Bihan commented Jul 1, 2024

francescov1 commented May 24, 2024 •

edited

Loading

tengomucho commented Jun 25, 2024 •

edited

Loading

liurupeng commented Jun 26, 2024 •

edited

Loading