Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attn implementation for composite models #32238

Merged
merged 74 commits into from
Oct 22, 2024

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Jul 26, 2024

What does this PR do?

Related to #32221 and fixes #30565. As described in the issue, currently MultiModal models set attn implementation only in the LM backbone, and not in vision backbone. While working on it, I found another bug. Specifically, using a property to set SDPA flag to VLM class doesn't work because we don't know what is self.LM before init the VLM. I tried class property, but that would throw error that self has no attribute LM.

This PR makes a few modifications to how attn impl is set in modeling to fix the above issues.

More precisely:

  • If the model is composite, i.e. has 2 or more PreTrainedConfigs as part of the main config (text/vision config), then we set the requested attention implementation to each of the sub-configs as a dict. For ex, in LLaVa case:
model.from_pretrained(id, attn_implementation="sdpa")

# internally we do the following, where the second line dispatches SDPA on LM backbone
# if we can't dispatch, an error is raised as it should be when loading an LLM with unsupported attention (i.e. llama doesn't support SDPA)
self.config._attn_implementation = {"text_config": "sdpa", "vision_config": "sdpa"}
AutoModelForCausalLM._from_config(config.text_config, attn_implementation=config._attn_implementation["text_config"])
  • If user just loads the composite model without any explicit attention, then we dispatch SDPA in any of the sub-models whenever possible. In case any/none of the vision/text models support SDPA, we silently fallback to eager. No warnings, no errors, which is in line with non-composite models
# suppose LM backbone doesn't support SDPA here, then we'll use eager silently
# but dispatch SDPA on vision backbone
model.from_pretrained(id, attn_implementation=None)
  • The users are also free now to set different attentions in different backbones. We can set eager on vision model and sdpa on LM. That can be done if one passes a dict attn_implementation. And the dict keys have to be identical to the ModelConfig attribute names for sub-configs. In other words, if the config has config.text_config, then the dict key must be "text_config" to correctly dispatch
model.from_pretrained(id,  {"text_config": "sdpa", "vision_config": "eager"})
  • For non-composite models nothing changes and everything works as before

NOTE:
I had a few rare cases where the model was not really composite but had several sub-configs. For example DBRX, and I decided to get rid of the sub-configs there. I am not sure what is the right way to deprecate that, so any comments on that welcome
Also, some models have nested configs, where a vision config is part of text config (Qwen2-VL). Only three models are like that, mostly because I didn;t review properly when the model was added and didn't insist on changing stuff. For those models, we don't consider them as composite and just leave it working as it was

…, idefics, idefics2, kosmos2, fuyu, blip, blip_2, instructblip, instructblipvideo, paligemma
@zucchini-nlp zucchini-nlp changed the title Vlm sdpa flag VLMs: dispatch sdpa to each sub model Jul 26, 2024
…, idefics, idefics2, kosmos2, fuyu, blip, blip_2, instructblip, instructblipvideo
@zucchini-nlp zucchini-nlp requested a review from qubvel July 26, 2024 07:01
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Not sure I fully understand the logic at this moment, so I left a few comments below

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved
src/transformers/modeling_utils.py Outdated Show resolved Hide resolved
src/transformers/modeling_utils.py Outdated Show resolved Hide resolved
@ydshieh
Copy link
Collaborator

ydshieh commented Jul 26, 2024

Regarding

MODEL_MAPPING.get(type(config.text_config), None)
            if text_model_cls is not None:

it could probably reuse

def _get_model_class(config, model_mapping):

But both will be somehow slow as you mentioned, and sometimes give unexpected errors (because loading a modeling module file). But we might have someway to avoid that.

@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Jul 29, 2024

@ydshieh @qubvel I made some changes after our last discussion that changing class attr is not good. So now we don't change the attr per se, but we change the config's attn_impl manually while setting sdpa. I checked that it works this way also. The last question is whether to set it to SDPA when both sub-configs support it or at least one supports it.

I will add more tests if we agree on the design

And also, regarding

But both will be somehow slow as you mentioned, and sometimes give unexpected errors (because loading a modeling module file). But we might have someway to avoid that.

unfortunately we can't do that, cause it raises error if key is not in the dict and we have not auto-mappable models. So we rather try to get and if not simple continue.

@ydshieh
Copy link
Collaborator

ydshieh commented Jul 29, 2024

unfortunately we can't do that, cause it raises error if key is not in the dict and we have not auto-mappable models. So we rather try to get and if not simple continue.

but why not try except around the usage of _get_model_class? I don't feel very strong here, but just it is something well designed in order to get the model class which is good if we can reuse it.

@zucchini-nlp
Copy link
Member Author

Hmm, not sure how good if to use a try except, can it be another source of hidden bugs even if we use except AttributeError? I'm okay with any way, as long as it's consistent with the library and doesn't cause bugs

@zucchini-nlp
Copy link
Member Author

As we discussed internally, I made the loading logic model-agnostic. So now we check if the config contains any PretrainedConfig inside and set attn implementation to each PretrainedConfig. This solved vision-encoder-decoder problem we had in slack and makes us more versatile for possible future multimodal models.

I added some tests in vision enc-dec models where only one sub-config supports SDPA and when both support. Also added tests in Idefics (which were skipped earlier) to verify that LM has its SDPA dispatched even though other modules do not.

TODO in the next few hours or tomorrow:

  • Add similar tests to encoder-decoder models, not only vision enc-dec
  • Accept from users a config with separate attn_implementation already set, such that attn_implementation={"vision_config": "sdpa", "text_config": "eager"}

@ArthurZucker
Copy link
Collaborator

Related to #33953 as well, will have a look, but having to change modeling code means we are doing something wrong as it should be a layer of abstraction!

@zucchini-nlp
Copy link
Member Author

Cool thanks! The modeling is mostly changed to get the attn implementation from dict and for IDEFICS (1 and 2), which fall out of standards and enforce hard requirements on sdpa. Hope we can get this merged soon, as it is getting more of a problem when new quirky models are added

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it is indeed important work!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have seen these modified elsewhere, let's revert as unrelated

src/transformers/modeling_utils.py Show resolved Hide resolved
Comment on lines 1452 to 1454
self.vision_model = Blip2VisionModel._from_config(
config.vision_config, attn_implementation=config._attn_implementation["vision_config"]
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my main question is why is this info not stored in the config.vision_config directly?
Initializing the composite config should propagate the attn_implementation attribute to the sub configs, instead of extracting it in modeling

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means updating from_config to support extracting the attn_implementation from the config 😉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right! The was the initial plan was similar, but it was supposed to recursively call autoset_attn_impl from within autoset_attn_impl for each sub-module-class. But that was too much extra code and checks

But I think I got you mean here, like instead of doing a dict for generatl config we can simply propagare user-requested attn for each sub-config. Then we hope it will get picked up in later AutoModel.from_config calls. Good idea and a lot cleaner, let me see how it goes. Hopefully that way we don't need specific treatment for nested configs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤗 good luck!

Comment on lines +1587 to +1589
# Composite models consisting of several PretrainedModels have to specify attention impl as a dict
# where keys are sub-config names. But most people will specify one `str` which means that should dispatch it
# for all sub-models.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, sometimes you want output hidden states of one module but not the other one, so I guess it's minimul flexibility.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but IMO we should propagate here

@zucchini-nlp
Copy link
Member Author

@ArthurZucker I made changes as you suggested but I want to warn that it still needed a few extra logic. Making this PR revealed soo many issues. So we have two options now to indicate attn for composite models:

Attn implementation as a dict:
The prev solution when we simply copy paste the user-requested attn in all sub-configs, and form a dict. So that the general config.attn = Dict(text_config: None, vision_config: None)

  • no need for flag whether attn implementation is set or not, because when we have a dict we always use the very first requested attn in all sub-configs. We just need to save it as a dict once and then go back to that dict to "get()"
  • we don't need to unset attn implementation when saving so one can set a different attn when loading any model back.
  • BUT we change a lot of model code to pass in explictly (attn_implementation=config._attn_implementation["text_config"]) and we need to account for cases like Chameleon or DBRX where the sub-configs don't even need attn

Attn implementation directly propagated in corresponding configs:
Current solution when we simply propagate the attn to all configs we can find, and don't touch the general config

  • no need to change any modeling code, except for general clean up
  • BUT we need a flag attn_implementation_autoset so we don't dispatch attn on the same config twice. Also we would need to unset the flag when saving. See below for reason

Imagine we have a composite or nested config. Currently we set attn in three places: from_pretrained, from_config and init. When we don't require any attn (None) and load with model.from_pretrained(id), internally we will call autoset_attn and set SDPA to the general config and the sub-configs will get None propagated. But then the same method autoset_attn is called second time when we init the class, and this time we'll try to propagate SDPA to all sub-configs (because general config has SDPA already set). If any sub-module has no SDPA support, an error will be raised as it was requested directly. From user perspective they did nothing wrong, but they see an error.
So the decision is to either stop autosetting attn for each config once it was set, which is happening now. Or in prev version I had attn implementations saved as dict, which basically did the same. When we got a dict attn, we had nothing to set and simply returned

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modeling changes are just perfect 😉 before we have a big refactor this is super super welcome

Comment on lines +1598 to +1606
for key in config:
if isinstance(getattr(config, key), PretrainedConfig):
sub_config = getattr(config, key)
curr_attn_implementation = (
requested_attn_implementation
if not isinstance(requested_attn_implementation, dict)
else requested_attn_implementation.get(key, None)
)
sub_config._attn_implementation_internal = curr_attn_implementation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is donc to set the attn_implementation of the sub configs, I am wondering if we can't have a way to use is_composition to simply say:

  • if we are a composition, then we call on our composiite member the _autoset_attn_implementation function

Today we have a problem with is_compoisition: we have this

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        cls._set_token_in_kwargs(kwargs)

        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

        if config_dict.get("model_type") == "mllama":
            config_dict = config_dict["vision_config"]

        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
            logger.warning(
                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
            )

        return cls.from_dict(config_dict, **kwargs)

which IMO we should never have to do and we copy past this everywhere.

This means there is something wrong with how we handle composition!

@ArthurZucker
Copy link
Collaborator

If we find a clean way to auto_set attn implementation would be better in the futur!

@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Oct 21, 2024

merging tomorrow, the tests pass locally in most cases except fro the cases not touched by this PR. For ex some FA2/sdpa equivalence tests are failing for me in main branch

@zucchini-nlp zucchini-nlp merged commit 21d5025 into huggingface:main Oct 22, 2024
23 of 25 checks passed
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* first try

* codestyle

* idefics2 is happy

* [run-slow] llava, llava_next, video_llava, vipllava, llava_next_video, idefics, idefics2, kosmos2, fuyu, blip, blip_2, instructblip, instructblipvideo, paligemma

* fix-copies

* [run-slow] llava, llava_next, video_llava, vipllava, llava_next_video, idefics, idefics2, kosmos2, fuyu, blip, blip_2, instructblip, instructblipvideo

* blip-2 needs to init vision from config

* when was this removed O_o

* minor fix

* tests

* this way?

* tests

* model-agnostic code

* codestyle

* add tests for idefics

* modify general test for VLMs

* no generation test for vlm yet!

* no generation test here also

* wanr in VIT-SDPA if output attn

* add more tests

* user can pass dict as attn impl

* repo consistency

* update

* muicgen

* no prints

* forgot speech enc-dec and clip

* how many composite models we have?

* musicgen meelody is same as mudicgen

* +siglip

* fix tests + add some more

* remove idefics custom overriden code

* make idefics2 automappable

* nits

* skip tests

* doctests

* Update src/transformers/models/idefics2/configuration_idefics2.py

Co-authored-by: amyeroberts <[email protected]>

* Update tests/models/clip/test_modeling_clip.py

Co-authored-by: amyeroberts <[email protected]>

* Update tests/models/idefics2/test_modeling_idefics2.py

Co-authored-by: amyeroberts <[email protected]>

* Update tests/models/idefics2/test_modeling_idefics2.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/configuration_utils.py

Co-authored-by: amyeroberts <[email protected]>

* major update, no need for automap

* clean up

* add FA2 test

* more tests

* style

* skip tests

* why did these started failing now?

* no attributes for FA2 needed

* one tiny test

* address comment about FA2 false warning

* style

* add new models and resolve conflicts

* fix copies

* let it be this way for now, come back tomorrow to review

* some more fixes

* update

* more updates

* update

* fix copies

* style and tests

* another big update

* fix tests

* fix tests

* update

* another update

* fix tests

* fix copies

* fix tests

---------

Co-authored-by: amyeroberts <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Correct check for SDPA in Vision Language Models
7 participants