Add Support for Gradient Checkpointing #759

lenglaender · 2024-11-07T12:34:33Z

Add Support for Gradient Checkpointing

This PR adds support for gradient checkpointing Gradient checkpointing is a technique that trades computation for memory by recomputing intermediate activations during the backward pass instead of storing them. This is particularly useful when training large models. Because we recompute values during the backpropagation, we need to preserve the original ForwardContext in this phase. I solved this by overwriting the gradient_checkpointing_enable function so that the checkpoint function receives the current ForwardContext as the backward pass context manager.

- oerwrite the gradient_checkpointing_enable to provide our ForwardContext during the recomputation of values during backpropagation - 2 bugs remaining: bottleneck adapter for models with the legacy implementation (BERT) & Parallel. Parallel has the problem that we manipulate the batch dimension and this currently leads to an error

docs & style & fixes - albert: skip unsupported tests - deberta(V2): fix embedding bug with inplace operations. - deberta: fix LoRAMergedLinear Bug with device mismatch

calpt

Thanks a lot for digging into this and enabling compatibility w adapters, this looks great! Just a couple of small comments before we're good to merge

calpt · 2025-01-19T10:28:59Z

src/adapters/models/deberta_v2/modeling_deberta_v2.py

@@ -49,6 +50,60 @@ def forward(self, hidden_states, input_tensor):
        return hidden_states


+# Copied from transformers.models.deberta.modeling_deberta.DebertaEmbeddings with DebertaLayerNorm->LayerNorm,Deberta->DebertaV2
+class DebertaV2EmbeddingsWithAdapters(DebertaV2Embeddings):


to clarify: is this fixing something specific to adapter training or just generally for the model with gradient checkpointing? If the latter, I think we shouldn't copy the class here but try to fix upstream in Transformers directly (same for Deberta-v1)

It's not specific to our library. It is a bug from HF Transformers when one calls model.enable_input_require_grads() before training. Input grads are AFAIK required when training adapters, at least HF advices to do this. This is from the doc of the enable_input_require_grads function:

Enables the gradients for the input embeddings. This is useful for fine-tuning adapter weights while keeping the model weights fixed.

I will open a PR in the HF Transformers repo this week to fix this bug for Deberta(V2). Since it's an HF Transformers bug, it doesn't work with PEFT currently either; the following script throws the same error, we encountered:

from transformers import DebertaConfig, DebertaForSequenceClassification from peft import get_peft_model, LoraConfig, TaskType import torch # Create a minimal DeBERTa config config = DebertaConfig( hidden_size=32, num_hidden_layers=5, num_attention_heads=4, intermediate_size=37, hidden_act="gelu", relative_attention=True, pos_att_type="p2c|c2p" ) # PEFT model model = DebertaForSequenceClassification(config) peft_config = LoraConfig(task_type=TaskType.SEQ_CLS, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1) model = get_peft_model(model, peft_config) # Enable input gradients model.enable_input_require_grads() model.train() # Move to GPU if available device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) # Create a simple input tensor (random; doesn't matter) & forward pass batch_size = 2 seq_length = 10 input_ids = torch.ones((batch_size, seq_length), dtype=torch.long).to(device) attention_mask = torch.ones_like(input_ids).to(device) outputs = model(input_ids=input_ids, attention_mask=attention_mask)

calpt · 2025-01-19T10:32:37Z

tests/methods/base.py

+        config = self.config()
+        state_dict_after_training = {}
+
+        for train_with_checkpointing in [True, False]:


do we need to test both True and False here given we have other train tests for standard training?

We train it once with and once without gradient checkpointing to compare the state dicts afterwards to ensure that gradient checkpointing leads to the same weight updates. I will add a short comment to explain this.

calpt · 2025-01-19T10:37:26Z

src/adapters/model_mixin.py

@@ -1617,6 +1621,70 @@ def save_pretrained(
        # Remove adapters config
        del self.config.adapters

+    def gradient_checkpointing_enable(self, gradient_checkpointing_kwargs=None):


Can we add a small note from where this method is overriden?

Yes, will do this 👍

lenglaender · 2025-01-21T19:25:50Z

Hey @calpt, can you quickly review my replies to your comments and approve the PR if everything is alright so I can merge this? We need to merge this PR first, then add all new tests to test refactoring #740, and then we can merge #763 (because Julian has already merged the test refactoring into his PR). So, currently, this PR is blocking everything we want to have in the next release.

akatief added a commit to akatief/adapters that referenced this pull request Nov 11, 2024

fix: apply & change @lenglaender solution in adapter-hub#759

17fb225

lenglaender and others added 3 commits November 11, 2024 22:51

minor fix but doesn't resolve the remaining issues

cb07dd4

Only run adjust_tensors_for_parallel_ if bsz is different

8421f63

remove parallel grad checkpointing test

94df2fe

calpt force-pushed the fix/gradient_checkpointing branch from 5edfd4d to 94df2fe Compare November 25, 2024 22:09

calpt linked an issue Dec 7, 2024 that may be closed by this pull request

ForwardContext is None with gradient checkpointing enabled #732

Open

Merge branch 'main' into fix/gradient_checkpointing

e1a6f71

lenglaender mentioned this pull request Jan 7, 2025

Upgrade Transformers to v4.47.x #776

Merged

lenglaender added 6 commits January 9, 2025 15:48

Merge branch 'main' into fix/gradient_checkpointing

db5bd27

fix deberta and albert tests.

ea5b68f

docs & style & fixes - albert: skip unsupported tests - deberta(V2): fix embedding bug with inplace operations. - deberta: fix LoRAMergedLinear Bug with device mismatch

Add docs and notebook

5999bb7

Fix all remaining bugs (T5 and BeIT HF bug)

b89b941

fix handling of CLIP and fix MT5 like we did for T5

e7de20b

fix CLIP

0d3f0a1

lenglaender marked this pull request as ready for review January 14, 2025 10:49

lenglaender changed the title ~~WIP: Add Support for Gradient Checkpointing~~ Add Support for Gradient Checkpointing Jan 14, 2025

lenglaender requested review from calpt and TimoImhof January 14, 2025 10:50

calpt reviewed Jan 19, 2025

View reviewed changes

lenglaender added 2 commits January 20, 2025 23:43

Merge branch 'main' into fix/gradient_checkpointing

77e4241

Add comments

c7e178b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Gradient Checkpointing #759

Add Support for Gradient Checkpointing #759

lenglaender commented Nov 7, 2024 •

edited

Loading

calpt left a comment

calpt Jan 19, 2025

lenglaender Jan 20, 2025

calpt Jan 19, 2025

lenglaender Jan 20, 2025

calpt Jan 19, 2025

lenglaender Jan 20, 2025

lenglaender commented Jan 21, 2025 •

edited

Loading

Add Support for Gradient Checkpointing #759

Are you sure you want to change the base?

Add Support for Gradient Checkpointing #759

Conversation

lenglaender commented Nov 7, 2024 • edited Loading

Add Support for Gradient Checkpointing

calpt left a comment

Choose a reason for hiding this comment

calpt Jan 19, 2025

Choose a reason for hiding this comment

lenglaender Jan 20, 2025

Choose a reason for hiding this comment

calpt Jan 19, 2025

Choose a reason for hiding this comment

lenglaender Jan 20, 2025

Choose a reason for hiding this comment

calpt Jan 19, 2025

Choose a reason for hiding this comment

lenglaender Jan 20, 2025

Choose a reason for hiding this comment

lenglaender commented Jan 21, 2025 • edited Loading

lenglaender commented Nov 7, 2024 •

edited

Loading

lenglaender commented Jan 21, 2025 •

edited

Loading