[TPU] Reduce compilation time & Upgrade PyTorch XLA version #6856

WoosukKwon · 2024-07-27T04:24:54Z

This PR bumps up the PyTorch XLA version to 0726 and utilizes the new dynamic shape support to reduce the compilation time. When the XLA graphs are already cached in the disk, this reduces the compilation time from 30 mins to 5 mins.

github-actions · 2024-07-27T04:25:07Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

* upstream/main: (66 commits) [Bugfix] Fix PaliGemma MMP (vllm-project#6930) [TPU] Fix greedy decoding (vllm-project#6933) [Kernel] Tuned int8 kernels for Ada Lovelace (vllm-project#6848) [Kernel] Fix marlin divide-by-zero warnings (vllm-project#6904) [ci] GHA workflow to remove ready label upon "/notready" comment (vllm-project#6921) [Kernel] Remove unused variables in awq/gemm_kernels.cu (vllm-project#6908) [Frontend] New `allowed_token_ids` decoding request parameter (vllm-project#6753) [Bugfix] Allow vllm to still work if triton is not installed. (vllm-project#6786) [TPU] Support tensor parallelism in async llm engine (vllm-project#6891) [Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel (vllm-project#6901) [Core] Reduce unnecessary compute when logprobs=None (vllm-project#6532) [Kernel] Tuned FP8 Kernels for Ada Lovelace (vllm-project#6677) [Model] Initialize support for InternVL2 series models (vllm-project#6514) [Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 (vllm-project#6871) Add Nemotron to PP_SUPPORTED_MODELS (vllm-project#6863) [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (vllm-project#6795) [TPU] Reduce compilation time & Upgrade PyTorch XLA version (vllm-project#6856) [Docs] Add RunLLM chat widget (vllm-project#6857) [Model] Initial support for BLIP-2 (vllm-project#5920) [CI/Build][Doc] Update CI and Doc for VLM example changes (vllm-project#6860) ...

…ject#6856)

…ject#6856) Signed-off-by: Alvant <[email protected]>

WoosukKwon added 30 commits June 24, 2024 01:53

Add & warnings

76fc072

Add in dummy_run

27a5ad8

Add is_driver_worker

5ab6f65

Make TPUExecutor similar to GPUExecutor

c4e79a0

Add multiprocessing-based TPU executor

ff81993

Use TPU to initialize Ray cluster

16e80b2

Add pjrt proc init

05884ce

Add Ray TPU executor

20d23eb

Use Ray TPU executor for tp

5d4df21

Minor

6b2c76c

Fix TPUWorker.execute_model

d91446b

Add is_driver_worker & input broadcast

ab1595d

Call xm._init_world_size_ordinal

4b45393

Bug fix on vocab

86451a2

Use all gather for TPU

0539299

Support TPU in GroupCoordinator

b35917c

Delete multiproc TPU executor

b9a84bc

Minor

c756b76

[Bugfix][TPU] Fix CPU cache allocation & swapping

16e9934

Merge branch 'fix-tpu-swpa' into tpu-n

e25f470

yapf

ca6d1d6

Add Ray to TPU dependency

cd4f68d

Merge branch 'main' into tpu-n

5df4164

Fix

546987a

Fix

330be6e

Merge branch 'main' into tpu-n

b45ed24

Add use_all_gather to LoRA

8fab9fd

Fix

c4cbe9f

Merge branch 'main' into tpu-n

2871c7c

Add an assert for dim == -1

db7adc7

WoosukKwon added 19 commits July 24, 2024 20:13

Merge branch 'main' into tpu-n

755fe0b

Merge branch 'main' into tpu-n

d5fadfd

[TPU] Support collective communications in XLA devices

af3a259

Use current_platform

0f2abea

is_xla -> is_tpu

8ebea7e

Define TPU communicator

782b182

Merge branch 'main' into tpu-n

76fd300

Merge branch 'add-xla-comm' into tpu-n

75f842b

Fix

8087227

Address comments

f04e179

Device init

f493c89

Fix patch

f14b085

Merge branch 'add-xla-comm' into tpu-n

1668582

0726

f9df97d

xr

9994742

Add dynamic=True

e0d3232

Remove import

2f6f54f

yapf

8bb1159

Merge branch 'main' into upgrade-xla

c11e129

WoosukKwon added the tpu Related to Google TPUs label Jul 27, 2024

WoosukKwon added 3 commits July 27, 2024 04:45

Add comment & doc

fafda57

Minor

79c45d5

Minor

4f0a23c

WoosukKwon merged commit fad5576 into main Jul 27, 2024
28 checks passed

WoosukKwon deleted the upgrade-xla branch July 27, 2024 17:28

dtrifiro mentioned this pull request Aug 5, 2024

Sync with [email protected] opendatahub-io/vllm#120

Closed

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[TPU] Reduce compilation time & Upgrade PyTorch XLA version (vllm-pro…

0b4fcaf

…ject#6856)

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[TPU] Reduce compilation time & Upgrade PyTorch XLA version (vllm-pro…

078266f

…ject#6856) Signed-off-by: Alvant <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU] Reduce compilation time & Upgrade PyTorch XLA version #6856

[TPU] Reduce compilation time & Upgrade PyTorch XLA version #6856

WoosukKwon commented Jul 27, 2024

github-actions bot commented Jul 27, 2024

[TPU] Reduce compilation time & Upgrade PyTorch XLA version #6856

[TPU] Reduce compilation time & Upgrade PyTorch XLA version #6856

Conversation

WoosukKwon commented Jul 27, 2024

github-actions bot commented Jul 27, 2024