-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge decoder and decoder with past to stateful for seq2seq #1078
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ce180de
to
6b9dc88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @eaidova 🔥
7700019
to
75c653d
Compare
75c653d
to
40f3dac
Compare
Ticket: 159473 Optimum-intel PR: huggingface/optimum-intel#1078 This PR switches optimum-intel in tests to stateful seq2seq branch. Tests check both stateful and with past decoders. Once optimum-intel PR is merged I'll switch version back to master.
Ticket: 159473 Optimum-intel PR: huggingface/optimum-intel#1078 This PR switches optimum-intel in tests to stateful seq2seq branch. Tests check both stateful and with past decoders. Once optimum-intel PR is merged I'll switch version back to master.
Ticket: 159473 Optimum-intel PR: huggingface/optimum-intel#1078 This PR switches optimum-intel in tests to stateful seq2seq branch. Tests check both stateful and with past decoders. Once optimum-intel PR is merged I'll switch version back to master.
42c8902
to
9219632
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR thanks @eaidova ! Left two comments to make sure stateful compatible models are exported as expected, good to merge once resolved
Co-authored-by: Ilyas Moutawwakil <[email protected]>
failed flux tests are not related, they are caused by update models on hf hub, I prepared PR for fixing this issue |
What does this PR do?
This PR introduces stateful approach similar like we use for decoder only models for encoder-decoder architecutes like whisper, t5 e.t.c.
Background
decoders in seq2seq models have additional key-value cache pairs that produced by decoder attention based on encoder state. As per generation cycle, the encoder is called once, these states are the same for each decoder inference step compared to self-attention that increments the previous sequence len on each step. For efficiency, these values should be calculated once during first decoder step, but it leads to differences in the model graph or requires to have some condition blocks (if we want to have one model). The current optimum-intel export solution is having 2 decoders, which allows maintain optimal performance, but increases pipeline memory consumption due to full weight duplication in memory for decoders
This PR allows overcome the limitation and export only one decoder that conditionally will calculate cross_attn cache on generation demand. Additionally, it moves cache management on plugin level that simplify model usage and gives more possibilities for memory and perf optimizations on runtime side. Our experiments shows 20-30% perf boost comparing with model with 2 stateless decoders
Before submitting