[generation] Support cache-cropping methods #35591

gante · 2025-01-09T14:26:29Z

What does this PR do?

Adds the requirements to solve #35168

[EDITED after the discussion ending on this comment]

New dynamic cache compression methods may crop the cache differently at each layer. Doing so breaks our implicit assumption regarding the shape of the dynamic cache, where all layers have the same length. As we can see in the issue tagged above, we obtain an exception if we decide to do so.

We can enable it with small code changes in downstream libraries if we prepare the causal mask based on the maximum sequence length seen in all layers in the cache. A downstream library would then have to implement a custom attention forward pass, to left-crop the attention mask accordingly (e.g. using the length of the key)

The following tests were run, with no regressions compared to main:

RUN_SLOW=1 py.test tests/utils/test_cache_utils.py -vv
RUN_SLOW=1 py.test tests/models/llama/test_modeling_llama.py -vv

src/transformers/models/llama/modeling_llama.py

HuggingFaceDocBuilderDev · 2025-01-09T14:56:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp

As it solves the issue for #35168 with no major code changes from KVPress side, LGTM

My only concern is that it is slightly breaking, if users were somehow exploiting this right-side slicing feature before

src/transformers/models/llama/modeling_llama.py

gante · 2025-01-10T15:24:23Z

@ArthurZucker, this draft PR makes a small modeling change to enable an advanced cache feature, as used e.g. in Nvidia's KVPress library -- see the PR header for full details.

If you agree with these changes, I will propagate them to other models before requesting a final review :)

ArthurZucker

Hey! Sorry but I don't think it makes sense to do this!
It's a very niche case, and is not transparent for people who just want to see how llama works!

gante · 2025-01-13T10:27:52Z

@ArthurZucker makes sense. In fact, downstream libraries can implement their own custom attention implementation and crop the mask as they wish there.

We would be able to unblock this cool use case in other libs if we merge the non-modeling change, i.e. make the get_seq_length(layer=None) call return the longest sequence length for all layers, as opposed to the sequence length for layer 0 (current behavior). The two are equivalent except on this advanced use case. This change looks okay to you, correct?

EDIT: chatted on slack, moving forward with the non-model changes

gante · 2025-01-13T14:39:30Z

src/transformers/models/aria/modeling_aria.py

@@ -1037,7 +1037,7 @@ def _update_causal_mask(
            target_length = (
                attention_mask.shape[-1]
                if isinstance(attention_mask, torch.Tensor)
-                else past_seen_tokens + sequence_length + 1
+                else past_seen_tokens + sequence_length


This comment applies to all models: the +1 not needed, we are discarding its corresponding data in the attn forward pass (e.g. here)

By removing it, we should expect a minimal improvement wrt memory and inference time. We also ensure that the causal mask has exactly the length of the (cached) input length, which is a requirement for the downstream use case this PR wants to enable.

zucchini-nlp · 2025-01-16T12:05:15Z

Btw, I am working with Mllama now and remembered it has weird cache where cross attention layers have a very large length. Can we make sure mllama generation tests dont fail after this change?

gante · 2025-01-16T12:42:19Z

@zucchini-nlp correct, it does change. I'm adding a test to catch more models like this. EDIT: mllama is the only model that fails this test :)

btw, RUN_SLOW=1 py.test tests/models/mllama/test_modeling_mllama.py results in 11 failures on both main and this branch 😬

zucchini-nlp · 2025-01-16T13:13:17Z

That is too much for the main branch, afaik mllama will not be able to run the slow tests because we are in EU and the repo is gated 😢

Currently I am removing skip from many generation tests that checked hasattr(config, "use_cache") otherwise all of them are skipped in multimodal models (PR will be submitted soon). In mlllama case it also skips all tests that need caching I guess. From code it seems like the get_seq_length() is used only when preparing cache_position if None and when preparing the attention mask. So I think the model might fail whenever there is no mask to infer length or when continuing generation from "initial prompt"

For other models, in VLM side that's the only one and maybe ImageGPT is a bit peculiar. From LLM side all I can think is the Hybrid Cache models where sliding window length is fixed and can be lower/higher than static layers

gante commented Jan 9, 2025

View reviewed changes

src/transformers/models/llama/modeling_llama.py Show resolved Hide resolved

zucchini-nlp reviewed Jan 10, 2025

View reviewed changes

src/transformers/models/llama/modeling_llama.py Show resolved Hide resolved

gante changed the title ~~[draft] support cache cropping~~ Support cache-cropping methods Jan 10, 2025

gante changed the title ~~Support cache-cropping methods~~ [generation] Support cache-cropping methods Jan 10, 2025

ArthurZucker reviewed Jan 13, 2025

View reviewed changes

gante mentioned this pull request Jan 13, 2025

DynamicCache does not support variable lengths, except for FA2 #35168

Open

4 tasks

gante force-pushed the max_get_seq_len branch from 3a58fca to cf263b8 Compare January 13, 2025 11:48

gante commented Jan 13, 2025

View reviewed changes

gante marked this pull request as ready for review January 13, 2025 14:40

gante requested review from eustlb, Cyrilvallez and Rocketknight1 as code owners January 13, 2025 14:40

gante requested a review from ArthurZucker January 13, 2025 14:40

gante added 8 commits January 16, 2025 14:27

test

b9ba462

draft

aca2e35

tmp

986338a

remove +1 on causal mask

befb795

fix a few tests

c78856e

make fixup

29e1696

add test for transition

116a615

doc

d17ee05

gante force-pushed the max_get_seq_len branch from 0a27d11 to d17ee05 Compare January 16, 2025 14:27

make fixup

2a5b7b7

zucchini-nlp mentioned this pull request Jan 17, 2025

VLM: enable skipped tests #35746

Open

gante added 2 commits January 17, 2025 13:04

Merge branch 'main' into max_get_seq_len

052902f

Merge branch 'main' into max_get_seq_len

16b186a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[generation] Support cache-cropping methods #35591

[generation] Support cache-cropping methods #35591

gante commented Jan 9, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 9, 2025

zucchini-nlp left a comment

gante commented Jan 10, 2025 •

edited

Loading

ArthurZucker left a comment

gante commented Jan 13, 2025 •

edited

Loading

gante Jan 13, 2025

zucchini-nlp commented Jan 16, 2025

gante commented Jan 16, 2025 •

edited

Loading

zucchini-nlp commented Jan 16, 2025 •

edited

Loading

[generation] Support cache-cropping methods #35591

Are you sure you want to change the base?

[generation] Support cache-cropping methods #35591

Conversation

gante commented Jan 9, 2025 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jan 9, 2025

zucchini-nlp left a comment

Choose a reason for hiding this comment

gante commented Jan 10, 2025 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

gante commented Jan 13, 2025 • edited Loading

gante Jan 13, 2025

Choose a reason for hiding this comment

zucchini-nlp commented Jan 16, 2025

gante commented Jan 16, 2025 • edited Loading

zucchini-nlp commented Jan 16, 2025 • edited Loading

gante commented Jan 9, 2025 •

edited

Loading

gante commented Jan 10, 2025 •

edited

Loading

gante commented Jan 13, 2025 •

edited

Loading

gante commented Jan 16, 2025 •

edited

Loading

zucchini-nlp commented Jan 16, 2025 •

edited

Loading