[Bugfix] We have fixed the bug that occurred when using FlashInfer as the backend in vLLM Speculative Decoding. #5412

bong-furiosa · 2024-06-11T05:16:59Z

ISSUE
We identified that when using FlashInfer as the backend in vLLM Speculative Decoding, incorrect output results were generated. The main cause of this issue was found to be the incorrect input metadata of paged_kv_indices and paged_kv_indptr to FlashInfer during the execution of the _prepare_model_input function in the ModelRunner class of model_runner.py. The incorrect calculation of indices and indptr causes the draft model to read wrong kv cache values during the proposal generation phase.

SOLUTION
We have fixed the code to calculate the correct indices and indptr, allowing the draft model to propose accurate results.

RESULT
We sampled random prompts from the ShareGPT dataset to compare the results before and after the fix. While the FlashAttention Backend and the pre-fix FlashInfer Backend produced almost completely different output results, post-fix FlashInfer Backend and FlashAttention Backend both generated nearly identical output results.

FlashAttention Backend

Prompt: 'Roy goes to the grocery store to buy a can of juice that has 100 calories of calories and contains 1 gram of sugar. If he buys another can that has 150 calories and 1 gram of sugar and then buys another can that has 200 calories and 2 gram of sugar, how many calories and grams of sugar in total will he buy?'
Generated text: '\n\nAnswer:\nTotal calories = 100 + 150 + 200 = 450 calories\nTotal sugar = 1 + 1 + 2 = 4 grams of sugar\n\nExplanation:\nRoy buys 3 cans of juice in total. The first can has 100 calories and 1 gram of sugar, the second can has 150 calories and 1 gram of sugar, and the third can has 200 calories and 2 grams of sugar. Therefore, the total calories Roy buys are 100 + 150 + 200 = 450 calories. Similarly, the total sugar Roy buys is 1 + 1 + 2 = 4 grams of sugar.'

pre-fix FlashInfer Backend

Prompt: 'Roy goes to the grocery store to buy a can of juice that has 100 calories of calories and contains 1 gram of sugar. If he buys another can that has 150 calories and 1 gram of sugar and then buys another can that has 200 calories and 2 gram of sugar, how many calories and grams of sugar in total will he buy?'
Generated text: '\nRoy 2.\n\n\nAnswer:\n\nTotal calories = 1501 calories\nTotal sugar = 3 grams\n\nExplanation:\n\nRoy bought 3 cans of juice\n\nThe calories in each can of juice are:\n10 calories\n\n\n\nThe total calories Roy bought are 1 calories per can of juice.\n\nThe total sugar in the juice juice juice:\n\n\nThe total calories Roy bought are 3 grams\n\n\n\n\n\n\nTherefore, Roy will buy \n\n\n\n\n\n\n\n\n\n\n\n'

post-fix FlashInfer Backend

Prompt: 'Roy goes to the grocery store to buy a can of juice that has 100 calories of calories and contains 1 gram of sugar. If he buys another can that has 150 calories and 1 gram of sugar and then buys another can that has 200 calories and 2 gram of sugar, how many calories and grams of sugar in total will he buy?'
Generated text: '\n\nAnswer:\nTotal calories = 100 + 150 + 200 = 450 calories\nTotal sugar = 1 + 1 + 2 = 4 grams of sugar\n\nExplanation:\nRoy buys 3 cans of juice in total. The first can has 100 calories and 1 gram of sugar, the second can has 150 calories and 1 gram of sugar, and the third can has 200 calories and 2 grams of sugar. Therefore, the total calories Roy buys are 100 + 150 + 200 = 450 calories. Similarly, the total sugar Roy buys is 1 + 1 + 2 = 4 grams of sugar.'

CODE EXAMPLE

VLLM_ATTENTION_BACKEND=FLASH_ATTN python3 test_specdec.py
VLLM_ATTENTION_BACKEND=FLASHINFER python3 test_specdec.py

When forcing the vLLM Backend to use FlashInfer, an error might occur in the __init__ function of the Attention class in layer.py. To resolve this, you can modify the __init__ function as follows

        if impl_cls == FlashInferImpl:
            self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
                                 alibi_slopes, sliding_window, kv_cache_dtype)
        else:
            self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
                                 alibi_slopes, sliding_window, kv_cache_dtype,
                                 blocksparse_params)

# test_specdec.py
import random
import json
from typing import List, Optional, Tuple
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          PreTrainedTokenizerBase)
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer


prompts = [
    "Pretend that you are one the greatest entrepreneurial thought leaders of our time. Make the for and against case on why someone should work at high growth start-ups versus quantitative trading.",
    "We already have html test reports. The info is in a xml and it is formatted with xsl. We also have javascript code in it.\n\nAll this is old but works as we want, but we want to we want to modernize this. The main reason is that it also uses activex, which is unsafe. It forces us to view the reports in Internet Explorer, which is deprecated, or to use IETab plugin to view them in Chrome.\n\nI am looking for ways to modernize it so wevcan use it on Chrome without plugin and without disabling security features.\n\nThe main reason we use activeX is that we have to access the file system and modify files on it. The test references, which are in tcml files (they are xml in fact, that's just an extension), have to be modified if the developer clicks on an Accept button to accept the new result of a test.",
    "Please re-write so that question 2 assesses biases without directly mentioning them so that people can answer honestly and not feel defensive. Please re-write question 5 to specify about primary care and its interactions with mental health care",
    "Cleanup the following: A startup incubator firm specializes in helping startups prepare and launch. The organization is is divided into groups that each focuses on supporting a specific startup. Each group is supposed to led by leader who has prior experience as a startup assistants. In scaling the firm and taking on a new client, a new group is formed without such a leader. A month into supporting this startup, one of their members leaves temporarily to join a startup as they launch, gaining experience as a startup assistant. Upon returning to his original group, his new experience places him as the missing leader of the group. This group now must navigate this disruption of a new hierarchy as well as this leader having been an equal previously. Furthermore, this new leader has returned with new affectations he naturally picked up from the startup group he joined for launch, now appears different from how they remembered him. 2 / 2",
    "Improve this email to university faculty and staff to encourage feedback about switching from Groupwise & Filr to the possibility of Google Suite or MS Office 365. This needs to be in a welcoming and encouraging tone, but not too formal given the audience. Here's the email to improve: Dear members of the Campus community,\n\nAs previously announced, a project is underway to select a new collaboration platform for the campus. This would include email, calendar, file sharing, video conferencing, team chat, standard document tools that support real-time collaborative editing, as well as the underlying IT architecture to support these services. \n\nAs we gather requirements on a potential future state, the Collaboration Platform Advisory Group invites all Faculty and Staff to provide feedback on their requirements by submitting the survey at: https://uregina.eu.qualtrics.com/jfe/form/SV\\_0Goe2XgXue51qKy\n\nThe survey is open now, and will close on Monday, March 27 at 9:00 AM. It will take about 5 minutes to complete and will assist us in defining requirements. There are 8 short questions, asking how you currently collaborate, what features of a collaboration platform are most important to you, and a request for additional comments.\n\nFurther information on this project is available at: https://ursource.uregina.ca/is/collaboration-project.html\n\nIf you have any questions about the project, or wish to provide further feedback to the advisory group, email can be sent to: [email protected]",
    "I am going to right here some input and I'd like you to use it afterwords.\nProject name: Team work onboarding\nIndustry : Saas product for teams\nAbout project: project management tool. In this project we'll work on the onboarding screen\nMain user personas: people that need project management tools\nDefault language / brand language (playful, informative, informal etc.)\\*: smart, interesting\nBackground story: user just signed up for the platform, now they need to learn how to use it. They need to name their project in this screen\n\nContent elements of this screen:\nHeader\nShort instructional description\nButton\n\nGenerate microcopy for the content elements header, Short instructional description, Button.",
    "Roy goes to the grocery store to buy a can of juice that has 100 calories of calories and contains 1 gram of sugar. If he buys another can that has 150 calories and 1 gram of sugar and then buys another can that has 200 calories and 2 gram of sugar, how many calories and grams of sugar in total will he buy?",
    "Which command entered on a switch configured with Rapid PVST\\* listens and learns for a specific\ntime period?\nA. switch(config)#spanning-tree vlan 1 max-age 6\nB. switch(config)#spanning-tree vlan 1 hello-time 10\nC. switch(config)#spanning-tree vlan 1 priority 4096\nD. switch(config)#spanning-tree vlan 1 forward-time 20"
]
tokenizer = AutoTokenizer.from_pretrained("NousResearch/llama-2-7b-chat-hf")
sampling_params = SamplingParams(temperature=0, top_k=1, max_tokens=512, stop_token_ids=[tokenizer.eos_token_id])

llm = LLM(
    model="NousResearch/llama-2-7b-chat-hf",                  # "NousResearch/llama-2-7b-chat-hf",
    speculative_model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",      # "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    num_speculative_tokens=5,
    use_v2_block_manager=True,
    enforce_eager=True,
    gpu_memory_utilization = 0.9,
    max_num_seqs=1
)

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}")
    print(f"Generated text: {generated_text!r}")

print("------------------------------------------------")

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

comaniac · 2024-06-11T05:23:11Z

cc @LiuXiaoxuanPKU who is working on enabling CUDA graph for flash infer so this should be covered as well.

bong-furiosa · 2024-06-13T00:57:50Z

🙇 Thank you for your interests in this PR! @LiuXiaoxuanPKU @comaniac !

However, I feel that frequent terminations during the check process through Buildkite are preventing the proper review of the PR.
As far as I know, vLLM's Buildkite automatically checks up to 5 times. Currently, the check attempts for buildkite/ci/pr/distributed-tests have reached the maximum limit, so it seems the PR will no longer be re-checked.
This makes it difficult to evaluate whether the code modification is correct when applied to the current vLLM source code.

In this cases, should I cancel the current PR and request a new one, or would it be better to wait?
I believe the advice from both of you will be very helpful in resolving any difficulties that occur when contributing to vLLM in the future!

comaniac · 2024-06-13T01:04:38Z

You can just rebase or push a new comment to trigger it again.

bong-furiosa · 2024-06-17T00:53:24Z

Hello @LiuXiaoxuanPKU @cadedaniel !

We inquire about the progress regarding the removal of batch expansion in vLLM (@cadedaniel mentioned it in #5016).
We are currently independently implementing Speculative Decoding without batch expansion in vLLM, separate from the original vLLM repo.
Our goal is to support the implementation on the FlashInfer and FlashAttention backend.
And this PR addresses a small bug that was discovered during the implementation process.

Could you please let us know where we can find and catch information about your ongoing implementation of Speculative Decoding without batch expansion in vLLM and the support for FlashInfer CUDA Graph? 🤔
Never mind! I think I found similar topic. #4628

LiuXiaoxuanPKU · 2024-06-18T07:57:26Z

Hi thanks for the PR! It's a great fix, I've already added it to #4628. Since that PR will be merged soon, if you don't mind, I will just wait for that PR (eta tmr) to get merged and integrate your fix (will also add you as a coauthor on that PR).

LiuXiaoxuanPKU · 2024-06-28T22:29:49Z

Close this PR as #4628 is merged. Thanks!

flashinfer bug at _prepare_model_input function is fixed

cf9edaf

LiuXiaoxuanPKU self-assigned this Jun 11, 2024

bong-furiosa added 3 commits June 13, 2024 01:21

fix comment

d9f14fc

Merge remote-tracking branch 'upstream/main'

12a00cf

fix comment and applied new commits

12469af

This was referenced Jun 14, 2024

Add cuda_device_count_stateless #5473

Merged

[Bug]: Distribute Tests PR test fails #5544

Closed

bong-furiosa added 2 commits June 15, 2024 02:34

Merge remote-tracking branch 'upstream/main'

61df3e8

merged with upstream and modified comment

170c942

LiuXiaoxuanPKU closed this Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] We have fixed the bug that occurred when using FlashInfer as the backend in vLLM Speculative Decoding. #5412

[Bugfix] We have fixed the bug that occurred when using FlashInfer as the backend in vLLM Speculative Decoding. #5412

bong-furiosa commented Jun 11, 2024

comaniac commented Jun 11, 2024

bong-furiosa commented Jun 13, 2024

comaniac commented Jun 13, 2024

bong-furiosa commented Jun 17, 2024 •

edited

Loading

LiuXiaoxuanPKU commented Jun 18, 2024

LiuXiaoxuanPKU commented Jun 28, 2024

[Bugfix] We have fixed the bug that occurred when using FlashInfer as the backend in vLLM Speculative Decoding. #5412

[Bugfix] We have fixed the bug that occurred when using FlashInfer as the backend in vLLM Speculative Decoding. #5412

Conversation

bong-furiosa commented Jun 11, 2024

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

comaniac commented Jun 11, 2024

bong-furiosa commented Jun 13, 2024

comaniac commented Jun 13, 2024

bong-furiosa commented Jun 17, 2024 • edited Loading

LiuXiaoxuanPKU commented Jun 18, 2024

LiuXiaoxuanPKU commented Jun 28, 2024

bong-furiosa commented Jun 17, 2024 •

edited

Loading