Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support qwen2-vl #32318

Merged
merged 103 commits into from
Aug 26, 2024
Merged

support qwen2-vl #32318

merged 103 commits into from
Aug 26, 2024

Conversation

simonJJJ
Copy link
Contributor

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your are missing a few files for the automapping to work! would recommend running transformers-cli add-new-model-like and overwrite the config, md etc with what you have here!

Then you should be able to ping @zucchini-nlp for a review on this new multimodal model!

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition! Yes, after adding auto maps and md files, feel free to tag for review. Let me know if you need any help with that

@simonJJJ
Copy link
Contributor Author

simonJJJ commented Aug 1, 2024

hi @zucchini-nlp, tidy all files and all test cases were passed.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Great to see more multimodal LLMs.

My main concern in the current implementation is the chat template format. I wouldn't recommend passing images/processing kwargs in the template. Also, we would need some changes to be consistent with transformers models, left more comments below

docs/source/en/perf_infer_gpu_one.md Show resolved Hide resolved
docs/source/en/model_doc/qwen2-vl.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/qwen2-vl.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/qwen2-vl.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/qwen2-vl.md Outdated Show resolved Hide resolved
src/transformers/models/qwen2_vl/modeling_qwen2_vl.py Outdated Show resolved Hide resolved
src/transformers/models/qwen2_vl/modeling_qwen2_vl.py Outdated Show resolved Hide resolved
src/transformers/models/qwen2_vl/modeling_qwen2_vl.py Outdated Show resolved Hide resolved
src/transformers/models/qwen2_vl/modeling_qwen2_vl.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some last nits but should be good to go! 🔥

if images is not None:
pixel_values, vision_grid_thws = [], []
for image in images:
patches, image_grid_thw = self._preprocess(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._preprocess already loops on the provided images, why are we not simply using self._preprocess?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we batched images as one sequence since different image has different sequence length.

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py Outdated Show resolved Hide resolved
src/transformers/models/qwen2_vl/modeling_qwen2_vl.py Outdated Show resolved Hide resolved
Comment on lines +264 to +268
self.mlp = nn.Sequential(
nn.Linear(self.hidden_size, self.hidden_size),
nn.GELU(),
nn.Linear(self.hidden_size, dim),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could also just use the VisionMlp with gelu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not want to reuse VisionMlp here since they have different semantics.

Comment on lines +326 to +327
for i in range(1, len(cu_seqlens)):
attention_mask[..., cu_seqlens[i - 1] : cu_seqlens[i], cu_seqlens[i - 1] : cu_seqlens[i]] = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can probably be vectorized, but good enough for not!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not trivial to be vectorized since the dynamic sequence length.

kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
kv_seq_len += cache_position[0] + 1
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we use rotary_seq_len = cache_position[-1] here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This segment of code is primarily copied from Qwen2. I've noticed some recent changes in the implementation. Would it be better to modify it like this to maintain consistency?

kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
    kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah actually no for this part, as get_usable_length is "old" sorry for that. I was mostly commenting on the fact that cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) does not use the rotary seq length argument, while the FlashAttention used it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think line 763 should be kv_seq_len = cache_position[0] + 1. Btw, this line seems to be useless Qwen2VLAttention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RoPE for Qwen has been modified after this PR, so we don't reply on kv-length anymore, so yes the variable is useless now :)

@simonJJJ
Copy link
Contributor Author

Hi @ArthurZucker, I think we are all good for merging this PR?

@zucchini-nlp
Copy link
Member

Totally forgot about this, can we swap order of input args for the processor so that it is 'images, text, ......' ? We are doing processor standardization and it'll be easier to have ot correct orders from the beginning, instead of deprecating one more model. I'll take of the whole standardization forQenVLProcessor kwargs later

@simonJJJ
Copy link
Contributor Author

Totally forgot about this, can we swap order of input args for the processor so that it is 'images, text, ......' ? We are doing processor standardization and it'll be easier to have ot correct orders from the beginning, instead of deprecating one more model. I'll take of the whole standardization forQenVLProcessor kwargs later

What is the correct order? Alphabetic order?

@zucchini-nlp
Copy link
Member

No, it's just the inputs that should be in 'image, text, video ', whilw now it is 'text, images, video,...'. Then you can leave order of other kwargs as it is, we'll take care of the rest

@simonJJJ
Copy link
Contributor Author

No, it's just the inputs that should be in 'image, text, video ', whilw now it is 'text, images, video,...'. Then you can leave order of other kwargs as it is, we'll take care of the rest

done.

@ArthurZucker
Copy link
Collaborator

Yep gimme a minute to check the new changes and merge accordingly!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay one final nit and let's merge! 🔥

kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
kv_seq_len += cache_position[0] + 1
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah actually no for this part, as get_usable_length is "old" sorry for that. I was mostly commenting on the fact that cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) does not use the rotary seq length argument, while the FlashAttention used it

@ArthurZucker
Copy link
Collaborator

Thanks a lot for bearing with me, we'll actually take care of changing that in another PR, let's merge 🤗
Compile won't be supported out of the box but it's alright otherwise!
Congrats team for the awesome model!

@ArthurZucker ArthurZucker merged commit 19e6e80 into huggingface:main Aug 26, 2024
23 of 25 checks passed
@simonJJJ
Copy link
Contributor Author

Thanks a lot for bearing with me, we'll actually take care of changing that in another PR, let's merge 🤗 Compile won't be supported out of the box but it's alright otherwise! Congrats team for the awesome model!

thanks a lot! I really appreciate you guys effort!

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024
* support-qwen2-vl

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* hyphen->underscore

* make style

* add-flash2-tipd

* delete-tokenize=False

* remove-image_processor-in-init-file

* add-qwen2_vl-in-MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES

* format-doct

* support-Qwen2VLVisionConfig

* remove-standardize_cache_format

* fix-letter-varaibles

* remove-torch-in-image-processor

* remove-useless-docstring

* fix-one-letter-varaible-name

* change-block-name

* default-quick-gelu-in-vision

* remove-useless-doc

* use-preimplemented-flash-forward

* fix-doc

* fix-image-processing-doc

* fix-apply-rotary-embed

* fix-flash-attn-sliding-window

* refactor

* remove-default_template

* remove-reorder_cache

* simple-get-rope_deltas

* update-prepare_inputs_for_generation

* update-attention-mask

* update-rotary_seq_len

* remove-state

* kv_seq_length

* remove-warning

* _supports_static_cache

* remove-legacy-cache

* refactor

* fix-replace

* mrope-section-doc

* code-quality

* code-quality

* polish-doc

* fix-image-processing-test

* update readme

* Update qwen2_vl.md

* fix-test

* Update qwen2_vl.md

* nit

* processor-kwargs

* hard-code-norm_layer

* code-quality

* discard-pixel-values-in-gen

* fix-inconsistent-error-msg

* unify-image-video

* hidden_act

* add-docstring

* vision-encode-as-PreTrainedModel

* pixel-to-target-dtype

* update doc and low memoryvit

* format

* format

* channel-foramt

* fix vit_flashatt

* format

* inherit-Qwen2VLPreTrainedModel

* simplify

* format-test

* remove-one-line-func-in-image-processing

* avoid-one-line-reshape

* simplify-rotary_seq_len

* avoid-single-letter-variable

* no-for-loop-sdpa

* avoid-single-letter-variable

* remove-one-line-reshape

* remove-one-line-reshape

* remove-no-rope-in-vit-logic

* default-mrope

* add-copied-from

* more-docs-for-mrope

* polish-doc

* comment-and-link

* polish-doc

* single-letter-variables

* simplify-image-processing

* video->images

* kv_seq_len-update

* vision-rope-on-the-fly

* vision-eager-attention

* change-processor-order

---------

Co-authored-by: baishuai <[email protected]>
Co-authored-by: ShuaiBai623 <[email protected]>
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024
* support-qwen2-vl

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* hyphen->underscore

* make style

* add-flash2-tipd

* delete-tokenize=False

* remove-image_processor-in-init-file

* add-qwen2_vl-in-MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES

* format-doct

* support-Qwen2VLVisionConfig

* remove-standardize_cache_format

* fix-letter-varaibles

* remove-torch-in-image-processor

* remove-useless-docstring

* fix-one-letter-varaible-name

* change-block-name

* default-quick-gelu-in-vision

* remove-useless-doc

* use-preimplemented-flash-forward

* fix-doc

* fix-image-processing-doc

* fix-apply-rotary-embed

* fix-flash-attn-sliding-window

* refactor

* remove-default_template

* remove-reorder_cache

* simple-get-rope_deltas

* update-prepare_inputs_for_generation

* update-attention-mask

* update-rotary_seq_len

* remove-state

* kv_seq_length

* remove-warning

* _supports_static_cache

* remove-legacy-cache

* refactor

* fix-replace

* mrope-section-doc

* code-quality

* code-quality

* polish-doc

* fix-image-processing-test

* update readme

* Update qwen2_vl.md

* fix-test

* Update qwen2_vl.md

* nit

* processor-kwargs

* hard-code-norm_layer

* code-quality

* discard-pixel-values-in-gen

* fix-inconsistent-error-msg

* unify-image-video

* hidden_act

* add-docstring

* vision-encode-as-PreTrainedModel

* pixel-to-target-dtype

* update doc and low memoryvit

* format

* format

* channel-foramt

* fix vit_flashatt

* format

* inherit-Qwen2VLPreTrainedModel

* simplify

* format-test

* remove-one-line-func-in-image-processing

* avoid-one-line-reshape

* simplify-rotary_seq_len

* avoid-single-letter-variable

* no-for-loop-sdpa

* avoid-single-letter-variable

* remove-one-line-reshape

* remove-one-line-reshape

* remove-no-rope-in-vit-logic

* default-mrope

* add-copied-from

* more-docs-for-mrope

* polish-doc

* comment-and-link

* polish-doc

* single-letter-variables

* simplify-image-processing

* video->images

* kv_seq_len-update

* vision-rope-on-the-fly

* vision-eager-attention

* change-processor-order

---------

Co-authored-by: baishuai <[email protected]>
Co-authored-by: ShuaiBai623 <[email protected]>
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
* support-qwen2-vl

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* hyphen->underscore

* make style

* add-flash2-tipd

* delete-tokenize=False

* remove-image_processor-in-init-file

* add-qwen2_vl-in-MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES

* format-doct

* support-Qwen2VLVisionConfig

* remove-standardize_cache_format

* fix-letter-varaibles

* remove-torch-in-image-processor

* remove-useless-docstring

* fix-one-letter-varaible-name

* change-block-name

* default-quick-gelu-in-vision

* remove-useless-doc

* use-preimplemented-flash-forward

* fix-doc

* fix-image-processing-doc

* fix-apply-rotary-embed

* fix-flash-attn-sliding-window

* refactor

* remove-default_template

* remove-reorder_cache

* simple-get-rope_deltas

* update-prepare_inputs_for_generation

* update-attention-mask

* update-rotary_seq_len

* remove-state

* kv_seq_length

* remove-warning

* _supports_static_cache

* remove-legacy-cache

* refactor

* fix-replace

* mrope-section-doc

* code-quality

* code-quality

* polish-doc

* fix-image-processing-test

* update readme

* Update qwen2_vl.md

* fix-test

* Update qwen2_vl.md

* nit

* processor-kwargs

* hard-code-norm_layer

* code-quality

* discard-pixel-values-in-gen

* fix-inconsistent-error-msg

* unify-image-video

* hidden_act

* add-docstring

* vision-encode-as-PreTrainedModel

* pixel-to-target-dtype

* update doc and low memoryvit

* format

* format

* channel-foramt

* fix vit_flashatt

* format

* inherit-Qwen2VLPreTrainedModel

* simplify

* format-test

* remove-one-line-func-in-image-processing

* avoid-one-line-reshape

* simplify-rotary_seq_len

* avoid-single-letter-variable

* no-for-loop-sdpa

* avoid-single-letter-variable

* remove-one-line-reshape

* remove-one-line-reshape

* remove-no-rope-in-vit-logic

* default-mrope

* add-copied-from

* more-docs-for-mrope

* polish-doc

* comment-and-link

* polish-doc

* single-letter-variables

* simplify-image-processing

* video->images

* kv_seq_len-update

* vision-rope-on-the-fly

* vision-eager-attention

* change-processor-order

---------

Co-authored-by: baishuai <[email protected]>
Co-authored-by: ShuaiBai623 <[email protected]>
dataKim1201 pushed a commit to dataKim1201/transformers that referenced this pull request Oct 7, 2024
* support-qwen2-vl

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* hyphen->underscore

* make style

* add-flash2-tipd

* delete-tokenize=False

* remove-image_processor-in-init-file

* add-qwen2_vl-in-MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES

* format-doct

* support-Qwen2VLVisionConfig

* remove-standardize_cache_format

* fix-letter-varaibles

* remove-torch-in-image-processor

* remove-useless-docstring

* fix-one-letter-varaible-name

* change-block-name

* default-quick-gelu-in-vision

* remove-useless-doc

* use-preimplemented-flash-forward

* fix-doc

* fix-image-processing-doc

* fix-apply-rotary-embed

* fix-flash-attn-sliding-window

* refactor

* remove-default_template

* remove-reorder_cache

* simple-get-rope_deltas

* update-prepare_inputs_for_generation

* update-attention-mask

* update-rotary_seq_len

* remove-state

* kv_seq_length

* remove-warning

* _supports_static_cache

* remove-legacy-cache

* refactor

* fix-replace

* mrope-section-doc

* code-quality

* code-quality

* polish-doc

* fix-image-processing-test

* update readme

* Update qwen2_vl.md

* fix-test

* Update qwen2_vl.md

* nit

* processor-kwargs

* hard-code-norm_layer

* code-quality

* discard-pixel-values-in-gen

* fix-inconsistent-error-msg

* unify-image-video

* hidden_act

* add-docstring

* vision-encode-as-PreTrainedModel

* pixel-to-target-dtype

* update doc and low memoryvit

* format

* format

* channel-foramt

* fix vit_flashatt

* format

* inherit-Qwen2VLPreTrainedModel

* simplify

* format-test

* remove-one-line-func-in-image-processing

* avoid-one-line-reshape

* simplify-rotary_seq_len

* avoid-single-letter-variable

* no-for-loop-sdpa

* avoid-single-letter-variable

* remove-one-line-reshape

* remove-one-line-reshape

* remove-no-rope-in-vit-logic

* default-mrope

* add-copied-from

* more-docs-for-mrope

* polish-doc

* comment-and-link

* polish-doc

* single-letter-variables

* simplify-image-processing

* video->images

* kv_seq_len-update

* vision-rope-on-the-fly

* vision-eager-attention

* change-processor-order

---------

Co-authored-by: baishuai <[email protected]>
Co-authored-by: ShuaiBai623 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants