Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Upstream codebase diff #470

Draft
wants to merge 634 commits into
base: main
Choose a base branch
from
Draft

[DO NOT MERGE] Upstream codebase diff #470

wants to merge 634 commits into from

Conversation

kzawora-intel
Copy link

@kzawora-intel kzawora-intel commented Nov 6, 2024

Scope of changes:

  • Contiguous PA
  • Multi-step scheduling
  • Automatic prefix caching
  • Padding-aware scheduling/max_num_prefill_seqs
  • Guided decoding fixes
  • FP8 support (INC/w8a8/weights_load_device)
  • ApplyToppTopkScalar sampler optimization
  • LoRA/MultiLoRA support
  • FusedMoE support
  • Model changes (adding mark_steps)
  • Tests
  • FakeHPU mode
  • CI stuff (.jenkins, .github)
  • Lots of minor stuff (RNG, FSDPA flag, reduced block fragmentation)

kzawora-intel and others added 30 commits October 28, 2024 15:52
To repro:

start server:
`VLLM_SKIP_WARMUP=true python -m vllm.entrypoints.openai.api_server`

send a request (this works fine):
```
 curl -v http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{"model": "facebook/opt-125m","prompt": "The future of AI is ","max_tokens": 100,"temperature": 0}'
```

if request has a seed it fails:
```
curl -v http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{"model": "facebook/opt-125m","prompt": "The future of AI is ","max_tokens": 100,"temperature": 0, "seed" : 37}'
```

Failure happens here:

[vllm-fork/vllm/model_executor/sampling_metadata.py at habana_main ·
HabanaAI/vllm-fork](https://github.com/HabanaAI/vllm-fork/blob/habana_main/vllm/model_executor/sampling_metadata.py#L220)

```
if sampling_params.seed is not None:
                seq_group_metadata.state.generator = torch.Generator(
                    device=device).manual_seed(sampling_params.seed)
```
 

`RuntimeError: Device type HPU is not supported for torch.Generator()
api.`

This PR fixes above issue by using htrandom [Intel Gaudi PyTorch Python
API (habana_frameworks.torch) — Gaudi Documentation 1.17.1
documentation](https://docs.habana.ai/en/latest/PyTorch/Reference/Python_Packages.html?highlight=htrandom#random-number-generator-apis)
Fix one_hot bug in torch compile mode
```
>           block_mapping = torch.nn.functional.one_hot(metadata.block_mapping,
                                                        num_classes=batch_size)
E           RuntimeError: Class values must be non-negative.

../../vllm/worker/hpu_model_runner.py:311: RuntimeError
```
Due to high dynamicity on logits processing it's better to offload it
completely to CPU instead of computing it on HPU.
This PR supports the unit test test_layers with LoraMask based approach
This PR enables automatic prefix caching in intel gaudi HPUs.
Please refer to this
[RFC](vllm-project#2614) for detailed
informations about prefix caching.
Implementation of multi-step scheduling. To use the feature, pass
--num_scheduler_steps=[n] as a server parameter. In my tests, best
results were achieved with n==64, but this will vary depending on the
model.

---------

Co-authored-by: Karol Damaszke <[email protected]>
Co-authored-by: jmaksymczuk <[email protected]>
This removers the need to pass VLLM_PROMPT_USE_FUSEDSDPA environment
variable in order to enable FusedSDPA attention. Fallback attention can
still be used if VLLM_PROMPT_USE_FUSEDSDPA=0 is provided.
Contiguous cache fetching to avoid using costly gather operation on
Gaudi3. Requires changes in vllm-hpu-extension
(HabanaAI/vllm-hpu-extension#17) to work.

Introduces redundant calculations in decoding phase. Feature improves
the performance of all tested workloads over the entire benchmark
(5-12%) on Gaudi3. PR #426
further improves the performance of this feature (9-22%). Only
compatible with v2-block-manager. Feature negatively impacts the
performance of Gaudi2.

Use VLLM_CONTIGUOUS_PA=true environment variable to enable.
This change is fixing the performance issue I have introduced in the PR
#414 -- due to the usage of `torch.where` both functions have been
called. Now we will run only the selected one.
Change `NaiveBlockAllocator` to use a priority queue so that we always
allocate the lowest block id first.

This further increases the performance of contiguous paged attention.

- [ ] Add an option or env variable to enable/disable this behavior.
(Not sure if this is necessary)

---------

Co-authored-by: Yang Wang <[email protected]>
Adding calculation of OpenSSF Scorecard. Note: badge (visible at repo main page) will be disabled for now.
max_num_prefill_seqs parameter is used only when
use_padding_aware_scheduling is True.

use_padding_aware_scheduling default value is False, so
max_num_prefill_seqs shouldn't be required to pass each time
SchedulerConfig is initialized.

Dozens of tests in tests/core are failing due to these parameters issue.
This PR implements tensor parallelism for multi-step scheduling.
0.20.2 had some changes that break lm_eval API
Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: B-201 <[email protected]>
Co-authored-by: B-201 <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Varad Ahirwadkar <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Rafael Vasquez <[email protected]>
Signed-off-by: Yuan Zhou <[email protected]>
Signed-off-by: luka <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: Vinay Damodaran <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: charlifu <[email protected]>
Signed-off-by: Sam Stoelinga <[email protected]>
Signed-off-by: Vasily Alexeev <[email protected]>
Signed-off-by: Kevin-Yang <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: wangshuai09 <[email protected]>
Signed-off-by: Qishuai [email protected]
Signed-off-by: yuze.zyz <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Kunjan Patel <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: kevin <[email protected]>
Signed-off-by: YiSheng5 <[email protected]>
Signed-off-by: yan ma <[email protected]>
Signed-off-by: Went-Liang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: sasha0552 <[email protected]>
Signed-off-by: mzusman <[email protected]>
Signed-off-by: Prashant Gupta <[email protected]>
Signed-off-by: André Jonasson <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Peter Salas <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Michael Green <[email protected]>
Signed-off-by: Shanshan Wang <[email protected]>
Signed-off-by: Gregory Shtrasberg <[email protected]>
Signed-off-by: daitran2k1 <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Hissu Hyvarinen <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: Linkun Chen <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: sasha0552 <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: Li, Jiang <[email protected]>
Co-authored-by: Kuntai Du <[email protected]>
Co-authored-by: Daniele <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: bnellnm <[email protected]>
Co-authored-by: Kai Wu <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Shashwat Srijan <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Andrew Feldman <[email protected]>
Co-authored-by: afeldman-nm <[email protected]>
Co-authored-by: laishzh <[email protected]>
Co-authored-by: Max de Bayser <[email protected]>
Co-authored-by: Max de Bayser <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Co-authored-by: Haoyu Wang <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: tomeras91 <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Kunjan <[email protected]>
Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal>
Co-authored-by: Cody Yu <[email protected]>
Co-authored-by: Thomas Parnell <[email protected]>
Co-authored-by: Chih-Chieh Yang <[email protected]>
Co-authored-by: Yue Zhang <[email protected]>
Co-authored-by: Chen Zhang <[email protected]>
Co-authored-by: Andy Dai <[email protected]>
Co-authored-by: Dhia Eddine Rhaiem <[email protected]>
Co-authored-by: yudian0504 <[email protected]>
Co-authored-by: Varad Ahirwadkar <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: Baoyuan Qi <[email protected]>
Co-authored-by: Wallas Henrique <[email protected]>
Co-authored-by: Travis Johnson <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: ngrozae <[email protected]>
Co-authored-by: Falko1 <[email protected]>
Co-authored-by: Rafael Vasquez <[email protected]>
Co-authored-by: chenqianfzh <[email protected]>
Co-authored-by: wangshuai09 <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: xendo <[email protected]>
Co-authored-by: Jerzy Zagorski <[email protected]>
Co-authored-by: gopalsarda <[email protected]>
Co-authored-by: Yuan <[email protected]>
Co-authored-by: Gubrud, Aaron D <[email protected]>
Co-authored-by: adgubrud <[email protected]>
Co-authored-by: Yuhong Guo <[email protected]>
Co-authored-by: Yuhong Guo <[email protected]>
Co-authored-by: Ronen Schaffer <[email protected]>
Co-authored-by: Aurick Qiao <[email protected]>
Co-authored-by: Jeremy Arnold <[email protected]>
Co-authored-by: Lucas Wilkinson <[email protected]>
Co-authored-by: yulei <[email protected]>
Co-authored-by: Seth Kimmel <[email protected]>
Co-authored-by: Kaunil Dhruv <[email protected]>
Co-authored-by: Flex Wang <[email protected]>
Co-authored-by: Mengqing Cao <[email protected]>
Co-authored-by: Alex Brooks <[email protected]>
Co-authored-by: Yongzao <[email protected]>
Co-authored-by: Yunfei Chu <[email protected]>
Co-authored-by: Vinay R Damodaran <[email protected]>
Co-authored-by: Yan Ma <[email protected]>
Co-authored-by: Zhuohan Li <[email protected]>
Co-authored-by: litianjian <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>
Co-authored-by: Charlie Fu <[email protected]>
Co-authored-by: Kevin H. Luu <[email protected]>
Co-authored-by: Will Johnson <[email protected]>
Co-authored-by: pavlo-ruban <[email protected]>
Co-authored-by: Sam Stoelinga <[email protected]>
Co-authored-by: ErkinSagiroglu <[email protected]>
Co-authored-by: Vasiliy Alekseev <[email protected]>
Co-authored-by: kakao-kevin-us <[email protected]>
Co-authored-by: Kevin-Yang <[email protected]>
Co-authored-by: 科英 <[email protected]>
Co-authored-by: madt2709 <[email protected]>
Co-authored-by: litianjian <[email protected]>
Co-authored-by: Zhong Qishuai <[email protected]>
Co-authored-by: tastelikefeet <[email protected]>
Co-authored-by: Sven Seeberg <[email protected]>
Co-authored-by: yannicks1 <[email protected]>
Co-authored-by: Junichi Sato <[email protected]>
Co-authored-by: Kunjan <[email protected]>
Co-authored-by: Will Eaton <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Lily Liu <[email protected]>
Co-authored-by: YiSheng5 <[email protected]>
Co-authored-by: Went-Liang <[email protected]>
Co-authored-by: Elfie Guo <[email protected]>
Co-authored-by: Harsha vardhan manoj Bikki <[email protected]>
Co-authored-by: Guillaume Calmettes <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]>
Co-authored-by: Mor Zusman <[email protected]>
Co-authored-by: Prashant Gupta <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: André Jonasson <[email protected]>
Co-authored-by: Pavani Majety <[email protected]>
Co-authored-by: Gene Der Su <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Peter Salas <[email protected]>
Co-authored-by: sroy745 <[email protected]>
Co-authored-by: Michael Green <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Nikita Furin <[email protected]>
Co-authored-by: shanshan wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Gregory Shtrasberg <[email protected]>
Co-authored-by: Yang Zheng <[email protected]>
Co-authored-by: Yang Zheng(SW)(Alex) <[email protected]>
Co-authored-by: Tran Quang Dai <[email protected]>
Co-authored-by: Chauncey <[email protected]>
Co-authored-by: hissu-hyvarinen <[email protected]>
Co-authored-by: lkchen <[email protected]>
Co-authored-by: Linkun Chen <[email protected]>
Co-authored-by: Linkun Chen <[email protected]>
Co-authored-by: Gene Der Su <[email protected]>
imkero and others added 26 commits November 17, 2024 02:10
Set vllm-hpu-extension to 2542c18
This is a bug fixed introduced by last spec_decode PR formatting commit.
Fix here
This PR introduces async copying into _prepare_prompt and
_prepare_decode, which makes copying faster.
It also moves precompute_indices_and_offsets funtion into forward to
avoid unnecessary H2D copying.
@@ -0,0 +1,45 @@
name: codespell

Check failure

Code scanning / Scorecard

Token-Permissions High

score is 0: no topLevel permission defined
Remediation tip: Visit https://app.stepsecurity.io/secureworkflow.
Tick the 'Restrict permissions for GITHUB_TOKEN'
Untick other options
NOTE: If you want to resolve multiple issues at once, you can visit https://app.stepsecurity.io/securerepo instead.
Click Remediation section below for further remediation help
def test_stateless_process_group(worker):
port1 = get_open_port()
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("", port1))

Check warning

Code scanning / CodeQL

Binding a socket to all network interfaces Medium test

'' binds a socket to all interfaces.

Copilot Autofix AI 2 days ago

To fix the problem, we need to bind the socket to a specific interface instead of all interfaces. This can be done by replacing the empty string ('') with a specific IP address, such as 127.0.0.1, which binds the socket to the localhost interface. This change ensures that the socket only accepts connections from the local machine, mitigating the security risk.

Suggested changeset 1
tests/distributed/test_utils.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tests/distributed/test_utils.py b/tests/distributed/test_utils.py
--- a/tests/distributed/test_utils.py
+++ b/tests/distributed/test_utils.py
@@ -126,3 +126,3 @@
     with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
-        s.bind(("", port1))
+        s.bind(("127.0.0.1", port1))
         port2 = get_open_port()
EOF
@@ -126,3 +126,3 @@
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("", port1))
s.bind(("127.0.0.1", port1))
port2 = get_open_port()
Copilot is powered by AI and may make mistakes. Always verify output.
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
xuechendi and others added 2 commits November 18, 2024 16:06
…rror] (#502)

Fix argument incompatible issue for FP8

```
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
ERROR 11-11 04:29:13 engine.py:143]     return self._call_impl(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
ERROR 11-11 04:29:13 engine.py:143]     result = forward_call(*args, **kwargs)
ERROR 11-11 04:29:13 engine.py:143] TypeError: PatchedVLLMKVCache.forward() missing 2 required positional arguments: 'block_indices' and 'block_offset'
```

FIX #453
https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#troubleshooting-tweaking-hpu-graphs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
habana Issues or PRs submitted by Habana Labs
Projects
None yet
Development

Successfully merging this pull request may close these issues.