[Core][Model] Add simple_model_runner and a new model XLMRobertaForSequenceClassification through multimodal interface #6260

AllenDou · 2024-07-09T11:51:03Z

This PR,

Add a new model XLMRobertaForSequenceClassification (for RAG scenario) https://github.com/huggingface/transformers/blob/v4.42.3/src/transformers/models/xlm_roberta/modeling_xlm_roberta.py
which processes input data through a multimodal interface like the following

outputs = llm.process([{
    "prompt": prompt,
    "multi_modal_data": {
        "xlmroberta": inputs,
    }
}])

Introduce class ModelMode [DECODER, ENCODER, ENCODER_DECODER, EMBEDDING, SIMPLE]
Add a new model runner(simple_model_runner.py), I didn't use model_runner.py or embedding_model_runner.py because they have procedures like logits(), sample(), and pooling() that we don't need. Maybe we could find a way to create a more general model runner to support 'any' model
Add SimpleSequenceGroupOutput, SimpleRequestOutput, and SimpleOutput for general output
Tensor-parallel is enabled, but no benefits have been achieved so far
Test case/example added

I have two goals:

To find a way to enable vllm to integrate 'any' model
To replace these models' layers with vllm/layers/* as well as TP/PP to achieve performance improvement

CLOSE #6424
CLOSE #6789
CLOSE #8022

DarkLight1337 · 2024-07-09T12:24:01Z

If I'm understanding this PR correctly, you are basically using the multi-modal interface to pass data directly to the model (in this case the input IDs and attention mask). Are you working towards making vLLM function out-of-the-box with generic HuggingFace models?

DarkLight1337

Left some initial comments.

examples/flagembedding.py

tests/conftest.py

vllm/entrypoints/llm.py

vllm/multimodal/bge.py

vllm/multimodal/registry.py

vllm/outputs.py

AllenDou · 2024-07-09T13:32:48Z

If I'm understanding this PR correctly, you are basically using the multi-modal interface to pass data directly to the model (in this case the input IDs and attention mask). Are you working towards making vLLM function out-of-the-box with generic HuggingFace models?

Yes, you are right. In fact, I have two goals:

To find a way to enable vllm to integrate 'any' model.
To replace these models' layers with vllm/layers/* as well as TP/PP to achieve performance improvement.

After I replaced XLMRobertaForSequenceClassification's query/key/value linear with QKVParallelLinear, I saw a 15% performance improvement.

vllm/entrypoints/llm.py

vllm/config.py

vllm/engine/llm_engine.py

vllm/core/scheduler.py

vllm/engine/llm_engine.py

vllm/entrypoints/llm.py

vllm/entrypoints/openai/serving_embedding.py

vllm/worker/worker.py

DarkLight1337

Added some suggestions to improve the type annotations.

vllm/model_executor/models/__init__.py

vllm/config.py

vllm/model_executor/models/__init__.py

vllm/core/interfaces.py

vllm/worker/worker.py

DarkLight1337

The config and multi-modal parts LGTM. I assume the model is implemented correctly since the tests pass.

However, since I'm not involved with the internals of model executor, block manager and workers, I'll leave it to @robertgshaw2-neuralmagic to review those. He will also see how to integrate this into the existing work for encoder-decoder support.

robertgshaw2-redhat · 2024-07-11T22:30:52Z

Im going to take a look at this over the weekend. Thanks @AllenDou!

AllenDou · 2024-07-15T03:03:17Z

mark #6424

AllenDou · 2024-07-16T02:58:18Z

/ready

AllenDou · 2024-07-18T12:14:12Z

Hello @robertgshaw2-neuralmagic , just a friendly reminder to review this PR when you get a chance,
We want to enable more models through this feature, thank a lot!

robertgshaw2-redhat · 2024-07-18T12:50:53Z

Thanks @AllenDou ! this is on my list

AllenDou · 2024-07-26T07:12:16Z

mark #6789

AllenDou · 2024-07-31T08:10:23Z

mark #1187 #205

AllenDou · 2024-08-09T09:33:08Z

offline_inference_xlmroberta_awq.py is deleted, as after hacking autoawq for xlmroberta model, I see no gain of performance benifit under vllm serving. Trying to fp8.

…quenceClassification through multimodal interface

prashil1996 · 2024-08-30T21:11:58Z

@AllenDou @robertgshaw2-neuralmagic @DarkLight1337
Is there any timeline on when this could be delivered?

AllenDou · 2024-09-01T02:45:41Z

@AllenDou @robertgshaw2-neuralmagic @DarkLight1337 Is there any timeline on when this could be delivered?

Maybe we should wait until @robertgshaw2-neuralmagic gets a chance to review this PR.

DarkLight1337 · 2024-09-13T17:33:56Z

A quick heads-up that the new locations of the model tests have been adjusted in #7820, so please merge from main.

DarkLight1337 · 2024-09-13T17:34:19Z

Also @robertgshaw2-neuralmagic do you have any timeframe on when you will be available to review this PR?

fan-niu · 2024-09-25T10:22:07Z

@AllenDou Hello, thank you for your contribution. I use concurrency to request services, and there will be a series of error reports, including the following errors:
File "/share5/projects/llm/finetune-accelerate/inference/tensorrtllm-experiment/sglang/code/vllm_test/vllm_sqclassification/vllm/vllm/model_executor/models/xlmroberta.py", line 721, in forward batch_size, seq_length = input_shape ValueError: not enough values to unpack (expected 2, got 1)

Can you please help to optimize async requests? Thanks

Note:
I use http://127.0.0.1:8081/v1/completions, concurrency = 2,
payload:
{
"prompt": "what time is it",
"model":MODEL_NAME,
"stop": "<\s>",
"max_tokens":128,
"stream": False,
"presence_penalty": 0,
"frequency_penalty": 0,
"temperature": 0.0,
"top_p": 1,
}

AllenDou · 2024-09-25T11:02:13Z

@AllenDou Hello, thank you for your contribution. I use concurrency to request services, and there will be a series of error reports, including the following errors: File "/share5/projects/llm/finetune-accelerate/inference/tensorrtllm-experiment/sglang/code/vllm_test/vllm_sqclassification/vllm/vllm/model_executor/models/xlmroberta.py", line 721, in forward batch_size, seq_length = input_shape ValueError: not enough values to unpack (expected 2, got 1)

Can you please help to optimize async requests? Thanks

Note: I use http://127.0.0.1:8081/v1/completions, concurrency = 2, payload: { "prompt": "what time is it", "model":MODEL_NAME, "stop": "<\s>", "max_tokens":128, "stream": False, "presence_penalty": 0, "frequency_penalty": 0, "temperature": 0.0, "top_p": 1, }

This PR currently does not support frontend access through the http protocol. By the way, the XLM-Roberta model compares the similarity of two strings. Therefore, you need to pass a tuple containing two strings (string, string) as input. Please refer to examples/offline_inference_xlmroberta.py for more details.

fan-niu · 2024-09-25T11:11:05Z

@AllenDou Thanks for your reply, can you tell me how to add support for http requests on your branch? Thank you, or can you add this function if it is convenient for you? My final goal is to need llamaForSequenceClassification based on http request, I have added llamaForSequenceClassification to llama.py with reference to your branch, and implemented the forward function, but currently I cannot make the correct request through http.

wangyuanxiong-hub · 2024-11-01T07:36:21Z

@AllenDou Thanks for your reply, can you tell me how to add support for http requests on your branch? Thank you, or can you add this function if it is convenient for you? My final goal is to need llamaForSequenceClassification based on http request, I have added llamaForSequenceClassification to llama.py with reference to your branch, and implemented the forward function, but currently I cannot make the correct request through http.

Is it accessible via http now?

wangyuanxiong-hub · 2024-11-05T09:28:22Z

The test results were not as expected, and the input and output methods were very confusing.

AllenDou · 2024-11-05T09:50:55Z

@AllenDou Thanks for your reply, can you tell me how to add support for http requests on your branch? Thank you, or can you add this function if it is convenient for you? My final goal is to need llamaForSequenceClassification based on http request, I have added llamaForSequenceClassification to llama.py with reference to your branch, and implemented the forward function, but currently I cannot make the correct request through http.

Is it accessible via http now?

I apologize for the delayed response. I've been working on cute&ptx recently, therefore, this PR will not be updated for now :(

mergify · 2024-11-08T05:05:15Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AllenDou.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

DarkLight1337 · 2024-11-25T12:47:14Z

Closing as superseded by #10400. Sorry your PR didn't make it!

DarkLight1337 self-assigned this Jul 9, 2024

DarkLight1337 requested a review from ywang96 July 9, 2024 12:07

DarkLight1337 reviewed Jul 9, 2024

View reviewed changes

robertgshaw2-redhat self-requested a review July 9, 2024 13:39