Make `Cache` a subclass of `torch.Tensor` #35792

IlyasMoutawwakil · 2025-01-20T13:22:34Z

What does this PR do?

Both torch script tracing and torch dynamo/fx have restrictions on input types (torch script has more) which makes the export fail as one torch module (the model) is passing another (the cache) around as its input. Having Cache be a subclass of torch.Tensor bypasses these issues and imo makes more sense as the Cache class has no forward and is just a container of torch tensors.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

gante

In principle LGTM. I'm calling up the torch.export<>transformers expert to review to double-check these changes are also okay for that goal as well 🤗

Question: Cache object holds a list of tensors, usually with a pair of tensors per layer. On some cases, we can have different tensors of a cache on different devices. Would this conflict with the new inheritance?

Double-checks:

Have you confirmed that slow llama tests and slow cache tests have no regressions with respect to main? (RUN_SLOW=1 py.test tests/models/llama/test_modeling_lama.py -vv and RUN_SLOW=1 py.test tests/utils/test_cache_utils.py -vv)
Have you confirmed that llama + static cache + compilation preserves throughput? (can share a script if needed :) )

src/transformers/cache_utils.py

gante · 2025-01-20T18:51:28Z

src/transformers/utils/fx.py

-    {},
-    proxy_factory_fn=create_cache_proxy_factory_fn(StaticCache),
-)
+# def create_cache_proxy_factory_fn(orig_cache_cls: Type[Cache]) -> Callable[[Node], HFCacheProxy]:


This is for optimum and you're part of optimum, so I'm assuming it's okay :D

Yeah I'm not sure why this is was needed as well, tagging @echarlaix @mht-sharma for more info

Not sure either

adding @michaelbenayoun who worked on this

gante · 2025-01-20T18:59:06Z

@guangy10 as requested on Slack, have a look if you're available 🙏

guangy10 · 2025-01-21T20:21:58Z

For the correctness testing, no extensive testing, but we do have some correctness guarantee for supported models test_export_static_cache (pointer). Can you run slow tests on this PR?

Also I'm not exactly sure if the StaticCache will be functioning as expected. Because with nn.Module the Cache is registered as a mutable buffer and lifted to the graph input during export. I'm curious how it works with tensor subclass. It seems like tensor subclasses do not directly support buffer registration like nn.Module does. Can we compare the graph between using the nn.Module solution vs. the tensor subclass solution.

Alternatively, since the motivation is to handle the legacy torch script tracing (I assume the traffic to this path will be lower and lower over time), would it be a cleaner separation if we create a dedicated Cache subclass for it but keeping the one for pytorch2.0+ as nn.Module? No need to maintain compatibility to the torch script solution.

IlyasMoutawwakil · 2025-01-22T09:45:42Z

Question: Cache object holds a list of tensors, usually with a pair of tensors per layer. On some cases, we can have different tensors of a cache on different devices. Would this conflict with the new inheritance?

Shouldn't be an issue as we're not using the _make_subclass() but rather _make_wrapper_subclass(), the difference is explained by @albanD:

These two functions do quite different things. The main difference is that when you do _make_subclass(), the current object is a honest to goodness Tensor with data in its storage and everything. When you do _make_wrapper_subclass(), the current object has no data and it is expected that some field on the Tensor will be another Tensor (hence the outer one being called wrapper) that contains real data.
in https://dev-discuss.pytorch.org/t/whats-the-difference-between-torch-tensor-make-subclass-and-torch-tensor-make-wrapper-subclass/1839

One example is the QuantizedTensor subclass which has two dtypes (a public one qt.dtype and an internal one qt._data.dtype

Have you confirmed that slow llama tests and slow cache tests have no regressions with respect to main? (RUN_SLOW=1 py.test tests/models/llama/test_modeling_lama.py -vv and RUN_SLOW=1 py.test tests/utils/test_cache_utils.py -vv)
Have you confirmed that llama + static cache + compilation preserves throughput? (can share a script if needed :) )

Running them right now (btw is there a way to trigger them on the CI ?), I was only running llama fast tests and llama+executorch integration tests.

tests/models/llama/test_modeling_llama.py

IlyasMoutawwakil · 2025-01-22T14:36:34Z

Edit: confirmed these two tests fail on main as well

Running RUN_SLOW=1 pytest tests/models/llama/test_modeling_llama.py -vv give two errors which I guess are related the machine I'm testing on (A100 vs the A10 that's used in the CI) ;

FAILED tests/models/llama/test_modeling_llama.py::LlamaIntegrationTest::test_llama_3_1_hard - AssertionError: 'Tell[74 chars]ical social and political upheaval in France t[557 chars]s.\n' != 'Tell[74 chars]ical political...
FAILED tests/models/llama/test_modeling_llama.py::LlamaIntegrationTest::test_model_7b_logits_bf16 - AssertionError: False is not true

in the first social and political is reversed to political and social :

E       AssertionError: 'Tell[74 chars]ical social and political upheaval in France t[557 chars]s.\n' != 'Tell[74 chars]ical political and social upheaval in France t[557 chars]s.\n'
E       Diff is 1259 characters long. Set self.maxDiff to None to see it.

in the second the assertion is not verbose enough:

>       self.assertTrue(
            torch.allclose(
                EXPECTED_MEAN[self.cuda_compute_capability_major_version].to(torch_device),
                out.logits.float().mean(-1),
                atol=1e-2,
                rtol=1e-2
            )
        )
E       AssertionError: False is not true

adding some verbosity:

E       AssertionError: False is not true : Expected: tensor([[-6.5208, -4.1218, -4.9377, -3.2536,  0.8127, -2.9811,  1.2918, -3.3848]],
E              device='cuda:0')
E       Got: tensor([[-6.5081, -4.1175, -4.9761, -3.1678,  0.8199, -3.0029,  1.2809, -3.3309]],
E              device='cuda:0')

src/transformers/cache_utils.py

IlyasMoutawwakil added 7 commits January 20, 2025 14:17

use tensor cache instead of module cache

d4b631e

unproxy cache

a77a94b

torch tensor subclassing

45bb39b

fix boolean evaluation

8606594

style

95c1686

fix zamba and jamba dynamic cache

d269417

make cache class exportable and executorch compatible

b67b6eb

IlyasMoutawwakil force-pushed the tensor-cache branch from 1114e7e to b67b6eb Compare January 20, 2025 17:47

gante reviewed Jan 20, 2025

View reviewed changes

IlyasMoutawwakil commented Jan 22, 2025

View reviewed changes

tests/models/llama/test_modeling_llama.py Show resolved Hide resolved

extract wrapper kwargs from init signature to correctly instantate

4950a9e

IlyasMoutawwakil added 2 commits January 22, 2025 15:42

add clone and to

6e9799c

fix test_cache_utils

da60604

IlyasMoutawwakil force-pushed the tensor-cache branch from 5829a6a to da60604 Compare January 22, 2025 14:43

IlyasMoutawwakil and others added 3 commits January 22, 2025 15:53

Merge branch 'main' into tensor-cache

85c71b0

add device and dtype setters

2bbbbbc

revert

485f959

IlyasMoutawwakil commented Jan 22, 2025

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

IlyasMoutawwakil and others added 4 commits January 22, 2025 17:18

Update src/transformers/cache_utils.py

2f4e0bc

more reverts

338f595

Merge branch 'main' into tensor-cache

dc1bd15

rebased

80b49d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `Cache` a subclass of `torch.Tensor` #35792

Make `Cache` a subclass of `torch.Tensor` #35792

IlyasMoutawwakil commented Jan 20, 2025

gante left a comment

gante Jan 20, 2025

IlyasMoutawwakil Jan 22, 2025

mht-sharma Jan 22, 2025

echarlaix Jan 22, 2025

gante commented Jan 20, 2025

guangy10 commented Jan 21, 2025

IlyasMoutawwakil commented Jan 22, 2025 •

edited

Loading

IlyasMoutawwakil commented Jan 22, 2025 •

edited

Loading

Make Cache a subclass of torch.Tensor #35792

Are you sure you want to change the base?

Make Cache a subclass of torch.Tensor #35792

Conversation

IlyasMoutawwakil commented Jan 20, 2025

What does this PR do?

Before submitting

Who can review?

gante left a comment

Choose a reason for hiding this comment

gante Jan 20, 2025

Choose a reason for hiding this comment

IlyasMoutawwakil Jan 22, 2025

Choose a reason for hiding this comment

mht-sharma Jan 22, 2025

Choose a reason for hiding this comment

echarlaix Jan 22, 2025

Choose a reason for hiding this comment

gante commented Jan 20, 2025

guangy10 commented Jan 21, 2025

IlyasMoutawwakil commented Jan 22, 2025 • edited Loading

IlyasMoutawwakil commented Jan 22, 2025 • edited Loading

Make `Cache` a subclass of `torch.Tensor` #35792

Make `Cache` a subclass of `torch.Tensor` #35792

IlyasMoutawwakil commented Jan 22, 2025 •

edited

Loading

IlyasMoutawwakil commented Jan 22, 2025 •

edited

Loading