Generate: Deprecate returning legacy cache by default; Handle `use_cache=False` #32863

gante · 2024-08-17T12:24:01Z

What does this PR do?

Another step towards using Cache everywhere 💪

This PR makes the following [Cache+generate]-related changes:

Don't initialize a cache when use_cache=False (fixes Cache updating when use_cache = False #32843 )
generate tests now explicitly pass use_cache, instead of setting it in model.config 🤢 We were relying on a LOT of side effects, and missing the incorrect case mentioned in Cache updating when use_cache = False #32843
Add sanity-checks on cache-related parameters, in generation_config
Add a deprecation cycle on the default cache return type, so we start returning a Cache instance by default on generate
Isolate all cache initialization logic in generate into a single function, and reorganize the logic by blocks

HuggingFaceDocBuilderDev · 2024-08-17T13:01:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2024-08-17T15:16:50Z

src/transformers/generation/configuration_utils.py

@@ -130,9 +130,29 @@ class GenerationConfig(PushToHubMixin):
            [this paper](https://arxiv.org/pdf/1610.02424.pdf) for more details.
        penalty_alpha (`float`, *optional*):
            The values balance the model confidence and the degeneration penalty in contrastive search decoding.
+        dola_layers (`str` or `List[int]`, *optional*):


moved up to this documentation section (Parameters that control the generation strategy used), which makes more sense

gante · 2024-08-17T15:17:29Z

src/transformers/generation/configuration_utils.py

+            `'high'` to improve short-answer tasks. Check the [documentation](https://github.com/huggingface/transformers/blob/main/docs/source/en/generation_strategies.md)
+            or [the paper](https://arxiv.org/abs/2309.03883) for more details.
+
+        > Parameters that control the cache


new cache-related docs section in GenerationConfig, moved all cache-related flags here

gante · 2024-08-17T15:17:46Z

src/transformers/generation/configuration_utils.py

@@ -544,8 +539,9 @@ def validate(self, is_init=False):
            raise ValueError(f"`max_new_tokens` must be greater than 0, but is {self.max_new_tokens}.")
        if self.pad_token_id is not None and self.pad_token_id < 0:
            warnings.warn(
-                f"`pad_token_id` should be positive but got {self.pad_token_id}. This will cause errors when batch generating, if there is padding. "
-                "Please set `pad_token_id` explicitly by `model.generation_config.pad_token_id=PAD_TOKEN_ID` to avoid errors in generation, and ensure your `input_ids` input does not have negative values."
+                f"`pad_token_id` should be positive but got {self.pad_token_id}. This will cause errors when batch "


(>120 chars/line)

gante · 2024-08-17T15:17:58Z

src/transformers/generation/configuration_utils.py

@@ -675,6 +671,14 @@ def validate(self, is_init=False):
                        group_error_prefix
                        + "`diversity_penalty` should be greater than `0.0`, otherwise your groups will be identical."
                    )
+            # DoLa generation
+            if self.dola_layers is not None and (self.repetition_penalty is None or self.repetition_penalty < 1.2):


gante · 2024-08-17T15:19:27Z

src/transformers/generation/utils.py

@@ -136,27 +136,23 @@ class GenerateDecoderOnlyOutput(ModelOutput):
        sequences (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
            The generated sequences. The second dimension (sequence_length) is either equal to `max_length` or shorter
            if all batches finished early due to the `eos_token_id`.
-        scores (`tuple(torch.FloatTensor)` *optional*, returned when `output_scores=True` is passed or when `config.output_scores=True`):
+        scores (`tuple(torch.FloatTensor)` *optional*, returned when `output_scores=True`):


In our docs we often mention that there are two ways to parameterize generate (generation_config or pass arg to generate). I don't think we need to be verbose here.

Also, setting through config is deprecated 😉

gante · 2024-08-17T15:20:38Z

src/transformers/generation/utils.py

+            Returns the model cache, used to speed up decoding. Different models have a different cache format, check
+            the model's documentation. Usually, a [`~cache_utils.Cache`] instance.


rewrote this one.

The old description was outdated (legacy cache), and we now know that different models have different caches, so we shouldn't be precise here. The model class docs can be more precise, let's redirect users there.

gante · 2024-08-17T15:21:13Z

src/transformers/generation/utils.py

@@ -328,6 +312,7 @@ class GenerateBeamEncoderDecoderOutput(ModelOutput):
    past_key_values: Optional[Tuple[Tuple[Tuple[torch.FloatTensor]]]] = None


+# TODO (joao): remove the equivalent classes and typing shortcuts below in v5


(These aliases made sense in the past, not anymore. They are, however, hard to deprecate!)

gante · 2024-08-17T15:23:00Z

src/transformers/generation/utils.py

@@ -1497,6 +1482,127 @@ def _supports_default_dynamic_cache(self) -> bool:
        """
        return self._supports_cache_class and "jamba" not in self.__class__.__name__.lower()

+    def _prepare_cache_for_generation(


New function, moving the cache logic from generate. I've organized the logic in blocks, putting the cases where we DON'T prepare a new cache at the top.

It is doing essentially the same, except for the Quick escape route 2, which is new. Added the warning in Quick escape route 3.

gante · 2024-08-17T15:25:03Z

src/transformers/generation/utils.py

-            if isinstance(result, ModelOutput) and hasattr(result, "past_key_values"):
-                if isinstance(result.past_key_values, (DynamicCache, EncoderDecoderCache)):
-                    result.past_key_values = result.past_key_values.to_legacy_cache()
+        # Convert to legacy cache format if requested


This logic is expanded to handle a deprecation cycle

gante · 2024-08-17T15:26:32Z

tests/generation/test_utils.py

@@ -194,6 +194,7 @@ def _greedy_generate(
        output_attentions=False,
        output_hidden_states=False,
        return_dict_in_generate=False,
+        use_cache=True,


changes in this file: pass use_cache to generate, instead of relying on model.config.use_cache=False and its side-effects

added a check to confirm that the cache is None when we pass use_cache=False

ArthurZucker

Cool! Let's make sure slow tests all pass as well here!

src/transformers/generation/configuration_utils.py

src/transformers/generation/utils.py

ArthurZucker · 2024-08-22T13:55:44Z

src/transformers/generation/utils.py

+        # TODO(joao): support static caches in assisted generation. assisted generation needs to roll back caches,
+        # which is only supported in dynamic caches atm


let's create an issue and leave it up to the community in the mean time!

…che=False` (huggingface#32863)

gante changed the title ~~Generate: Update cache initialization~~ Generate: Deprecate returning legacy cache by default; Handle use_cache=False Aug 17, 2024

gante commented Aug 17, 2024

View reviewed changes

gante requested a review from ArthurZucker August 17, 2024 15:26

ArthurZucker approved these changes Aug 22, 2024

View reviewed changes

gante added 8 commits August 22, 2024 15:08

tmp

13c07d0

organize cache init

a6611e7

fix conflict

69bf5f4

update tests

d3c3e5a

handle corner cases

bc5b50a

special models

8958d49

whisper is special

914a7ea

make fixup :D

e8492bf

gante force-pushed the update_cache_kwargs branch from a442522 to e8492bf Compare August 22, 2024 15:09

PR comments

550f7a6

gante merged commit a26de15 into huggingface:main Aug 22, 2024
23 checks passed

gante deleted the update_cache_kwargs branch August 22, 2024 19:01

gante mentioned this pull request Aug 23, 2024

Fix: StaticCache & inputs_embeds #32932

Merged

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024

Generate: Deprecate returning legacy cache by default; Handle `use_ca…

210ba5b

…che=False` (huggingface#32863)

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024

Generate: Deprecate returning legacy cache by default; Handle `use_ca…

4e0112a

…che=False` (huggingface#32863)

itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024

Generate: Deprecate returning legacy cache by default; Handle `use_ca…

2226b66

…che=False` (huggingface#32863)

gcervantes8 mentioned this pull request Oct 1, 2024

Whisper Scoring Model Saving Errors due to Config+GenerationConfig #33845

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate: Deprecate returning legacy cache by default; Handle `use_cache=False` #32863

Generate: Deprecate returning legacy cache by default; Handle `use_cache=False` #32863

gante commented Aug 17, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 17, 2024

gante Aug 17, 2024 •

edited

Loading

gante Aug 17, 2024

gante Aug 17, 2024

gante Aug 17, 2024

gante Aug 17, 2024

gante Aug 17, 2024 •

edited

Loading

gante Aug 17, 2024 •

edited

Loading

gante Aug 17, 2024 •

edited

Loading

gante Aug 17, 2024

gante Aug 17, 2024

ArthurZucker left a comment

ArthurZucker Aug 22, 2024

gante Aug 22, 2024

		Returns the model cache, used to speed up decoding. Different models have a different cache format, check
		the model's documentation. Usually, a [`~cache_utils.Cache`] instance.

		@@ -328,6 +312,7 @@ class GenerateBeamEncoderDecoderOutput(ModelOutput):
		past_key_values: Optional[Tuple[Tuple[Tuple[torch.FloatTensor]]]] = None


		# TODO (joao): remove the equivalent classes and typing shortcuts below in v5

		# TODO(joao): support static caches in assisted generation. assisted generation needs to roll back caches,
		# which is only supported in dynamic caches atm

Generate: Deprecate returning legacy cache by default; Handle use_cache=False #32863

Generate: Deprecate returning legacy cache by default; Handle use_cache=False #32863

Conversation

gante commented Aug 17, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Aug 17, 2024

gante Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gante Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

gante Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

gante Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Generate: Deprecate returning legacy cache by default; Handle `use_cache=False` #32863

Generate: Deprecate returning legacy cache by default; Handle `use_cache=False` #32863

gante commented Aug 17, 2024 •

edited

Loading

gante Aug 17, 2024 •

edited

Loading

gante Aug 17, 2024 •

edited

Loading

gante Aug 17, 2024 •

edited

Loading

gante Aug 17, 2024 •

edited

Loading