Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: CI - Split up "Models Test" and "Vision Language Models Test" #7439

Closed
jon-chuang opened this issue Aug 12, 2024 · 6 comments · Fixed by #9846
Closed

[Feature]: CI - Split up "Models Test" and "Vision Language Models Test" #7439

jon-chuang opened this issue Aug 12, 2024 · 6 comments · Fixed by #9846

Comments

@jon-chuang
Copy link
Contributor

jon-chuang commented Aug 12, 2024

🚀 The feature, motivation and pitch

Takes 1 hour+ on CI compared to others, which take <~30 min. Thus, ends up being a bottleneck

So, should be split up similar to kernels

CC: @khluu

@youkaichao
Copy link
Member

root for this!

@DarkLight1337
Copy link
Member

This sounds great! However, we should still avoid regressions. I suggest testing a representative subset per PR and perform full testing once in a while.

Any ideas on which models we should always test?

@DarkLight1337
Copy link
Member

A more concrete plan:

Let's add a pytest marker core_model to explicitly tag model tests to be tested in each PR. By default, unmarked model tests will only be tested daily.

Separately, we can organize the models tests as follows:

The first three are currently included in Models test while the fourth one is currently in VLM test. They are split apart further so that model PRs can easily run only the tests that are relevant to their model.

(By the way, the models directory in main vLLM is pretty cluttered... perhaps we can also organize the model files as above?)

In the normal CI, we can add the pytest flag -m core_model so that only the tagged tests are run. The full CI (run daily) omits this flag to run all tests.

Regarding which model tests to mark as core_model:

  • We should select one model per family of architectures.
  • For multi-modal models, we should consider one model with multi-modal encoder defined by Transformers, and another one that is defined by ourselves.
  • There is no need to consider LoRA, TP/PP and quantization behavior since their basic implementations are covered by other tests already. The full CI should be able to catch any regressions of those features for individual models.

@jon-chuang
Copy link
Contributor Author

Hmm, it sounds unreliable to ask users to run specified model tests... in pytorch for instance it's mandatory to run quite comprehensive test suite in every PR

I think that changing this behaviour will potentially introduce uncaught bugs.

Separately, we can organize the models tests as follows:

models/embedding/language
models/encoder_decoder/language
models/decoder_only/language
models/decoder_only/vision
models/decoder_only/audio (once #7615 is merged)

I think that first step of splitting up like this is a good idea.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 2, 2024

After some offline discussion, we have decided to cut down on the number of models to regularly test in CI. This is due to the following reasons:

  • Most of the model tests aim to test the correctness of the model implementation. However, there is no need to repeatedly test this unless the model file itself is updated. Instead, we can cover the code paths in model loader, layers, executor and runner (which are shared between models) using a relatively small number of tests.
  • Some of the model PRs are from the same organization who originally developed the model. In this case, we trust the authors to implement their own model correctly.
  • With the growing number of models supported by vLLM, it becomes increasingly expensive to test them all in the long run.

For encoder-decoder and embedding models, we will preserve all tests since there are only a few tests involved.

For multimodal models, @ywang96 and I have decided to only test the following models regularly in CI:

  • LLaVA w/ our implementation of CLIPEncoder
  • PaliGemma w/ our implementation of SiglipEncoder
  • Ultravox w/ HuggingFace's implementation of WhisperEncoder
  • (A model that supports video input, once it's implemented in vLLM. Qwen2-VL will add M-ROPE so we should test that)

Meanwhile, @simon-mo will help narrow down the list of decoder-only language models to test.

Note that this does not remove the need to test new models. New model PRs should still implement tests so we can verify their implementation, as well as re-run those tests whenever a future PR updates related code. Until @khluu figures out how to create a separate test item for each model without clogging up the web UI, these tests will remain excluded from CI; instead, we can run these tests locally as necessary.

@DarkLight1337
Copy link
Member

Closing as completed by #9488. VLM tests will be further split up after #9372.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants