-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core]: (Last/N) Support prefill only models by Workflow Defined Engine #8964
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
97e7e5c
to
5f05fa9
Compare
For the PR that is going to be merged, can you please modify the existing vLLM code instead of transferring them into a completely new namespace/directory? Otherwise, it's difficult to see how the existing code needs to be changed. If the implementation is completely different, you can create a file in the existing directory but with wde suffix, similar to how we have block manager v2 right now. |
Now the vllm project is divided by function. Modules with the same function are put together, all executors are put together, and all workers are put together. In fact, it is very painful to maintain. According to Conway's Law, systems should be built around business capabilities. Although I really want more people to discuss the wde architecture, it seems that no one is very interested. |
For example, Divide by function, put all the pens together, put all the paper together, put all the paints together, the result is very inconvenient to use. . Divided according to usage scenarios, it is obviously much easier to use and maintain. This is how wde modules are divided. |
Let's discuss it for a while, maybe there is a better way. |
Let's focus on more specific issues. InputProcessor Reranker model input pairs. If using wde, I only need to switch to RerankerInputProcessor, plug and play. So where should I put this piece of code? https://github.com/noooop/vllm/blob/wde_encode_only/vllm/wde/reranker/ |
Even if your architecture is perfect (this is never the case, I keep throwing away my old code 😅), there is an important practical issue to consider, which is how to adopt this without breaking the guarantees currently provided by vLLM. Keep in mind that this is not a personal project, so you don't have the freedom to change everything in one commit - rather, hundreds of devs are working on this project at the same time. If you can't solve this issue, no one will be willing to jump to completely new code. So, I suggest keeping the original directory layout where possible.
Currently, our code is organized function-first, capability-second, as you have said. But, there's nothing preventing you from adding capability-specific files into function-specific directories for now. |
Can we start reviewing the code as it is laid out now. Specifically reviewing the module as I listed them from the top down, my code is also submitted in this order。 Focusing on the part where I have slightly modified the vllm module, should I keep both files or apply the changes to the original files. Once the review is complete, I will change it to a functional-first organizational. |
Please change the original files. It's difficult to see the diffs otherwise. Also this makes sure that your implementation can still work with existing vLLM code. |
Ideally we should be able to run our existing tests on your code to verify its correctness. |
Can I put this code in the inputs folder? https://github.com/vllm-project/vllm/tree/main/vllm/inputs |
Yes, please do so. |
The role of wde/core/schema/engine_io.py in wde is the same as that of inputs.data. At the time of my first commit, you hadn't finished refactoring inputs.data. I can change to using inputs.data but where should I put the input of the reranker
I will discuss the code organization structure with you only this time. Other modules will be reorganized according to the results of this discussion. |
I prefer option 3. We start with something like this:
Then in future PRs, we can gradually move the existing code, e.g.:
To increase the chance of acceptance, I further recommend only doing this reorganization within one main module (in this case,
Only afterwards can we even consider making the switch from function-first to capability-first at the top level:
|
next
In vllm/inputs/preprocess.py , the verification and encoding parts are merged. In wde preprocess is separated into two parts. Do you ok with that? Of course, this process will only be used on the prefill only scheduler and will not modify the previous scheduler. |
As long as this strategy works for other model types, I'm fine with it. (i.e. don't split them up only to find that it actually doesn't work e.g. for embedding models) |
I'm busy these days, so I'll submit a new code on the 8th. |
next
If you have time, please review the Prefill only attention related code. I will put Prefill only attention under the attention folder
Do you ok with that? |
No worries, take your time.
I'm not familiar with the attention implementation, so you will have to ask someone else for this. This is another reason why we should split up the PR so each PR focuses on one module. Then we can have different people reviewing changes to each module. |
Can you help me get someone to come take a look? Prefill only attention has no decoding stage, so it is very simple |
@comaniac may be able to comment on the attention backend. But I suggest you split up the PR first. |
75038b4
to
6c9ba48
Compare
0333601
to
cb4946d
Compare
So far it seems that you're just adding brand new files. Are you going to further refine those PRs so vLLM actually uses them? |
I organized the previous code according to function-first |
Do you have any suggestions for the approximate location of the new code? |
I prefer you delete the old files that have been split up to the new locations. This will prevent the issue of forgetting to copy over the changes from more recent PRs since your PR was first created (by forcing you to handle merge conflicts). |
I know, this PR is temporarily used to discuss how to segment. So submit all the code for now |
For example:
|
I know, this PR is temporarily used to discuss how to segment. So submit all the code for now This PR definitely needs to be resubmitted |
I find myself frequently having to switch over to another copy of the repo and find where your code originally comes from. It would ease the review process if you actually "move" the code. |
Also, to avoid triggering CI, please mark those PRs with draft status. |
Sorry, I don't want to make it so complicated Previously
|
Honestly, I don't care too much about the location of the new files. The more important part is how to transition from the existing code to the new layout. |
I would rather you dive straight into the refactoring PRs one at a time instead of just partially showing the new layout. For looking at the new layout, I prefer the branch that has all the new files in one place so I can get the big picture. |
So how should I cooperate? This is all the code to support the first model : bert |
You can have two branches:
|
|
I've added back the previous wde code Do you need to split the wde code by module and submit it multiple times? |
For viewing the layout, keep everything in the same PR (don't split it). |
Is everything okay now (Just a moment ago, Accidentally pushed -f to #8452, finally fixed. pushed -f is too dangerous |
It is better now. I don't think attention backends should be placed under |
Also, you have some inconsistent naming, where model tests are placed under |
mark
mark |
The biggest difference between the prefill only models and the decoding model lies in the attention and scheduler.
Corresponding to #9124 and #9181 these two PRs are not directly related and can be reviewed at the same time without conflict. Why not modify the existing code directly, but add a new file?
If we merge #9124 and #9181, the rest are very simple codes and are easy to merge. As for the Workflow Defined Engine architecture, it is actually similar to the code below and is actually not complicated.
|
Since @WoosukKwon has already started the refactoring, you should coordinate this with him. |
Do you mean that this PR is submitted directly to the refactored version? |
You should ping him with a link to this PR. |
FIX #8453
PTAL #8452 #8779
This is the first part of [Support prefill only models by Workflow Defined Engine]
The main goal is to support the BERT model
Includes the following parts:
[Core]: (1/N) Support prefill only models by Workflow Defined Engine - Prefill only scheduler #9181
InputProcessor: Input(request_id, inputs, params, arrival_time) -> InputProcessor -> Request
RequestProcessor: Request -> RequestProcessor -> SchedulableRequest
Scheduler
[Core]: (3/N) Support prefill only models by Workflow Defined Engine - new models
Modifications required to the vllm code: (not yet submitted