Integrate Habana flash attention to Llama2-70B finetune #596

mandy-li · 2023-12-13T01:06:04Z

As title

HuggingFaceDocBuilderDev · 2023-12-13T01:12:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

regisss

LGTM!

I ran the command line from the example with and without --use_flash_attention:

Without:

max_memory_allocated (GB)   =      73.87
memory_allocated (GB)       =       21.6

With:

max_memory_allocated (GB)   =      76.01
memory_allocated (GB)       =      26.91

Memory consumption should be lower with flash attention no?

mandy-li · 2023-12-13T21:37:39Z

@regisss , that is a good observation. With recompute in flash attention disabled, we are supposed to have better performance, but memory consumption is more. I made another commit to add recompute arg to the script to allow user to configure that option.

Here are performance and memory usage comparisons:

Configuration	Performance	Memory
wo/ Habana Flash Attention	'train_runtime': 2669.7156, 'train_samples_per_second': 2.458,	max_memory_allocated (GB): 73.99 memory_allocated (GB): 21.6
w/ Habana Flash Attention + recompute=False	'train_runtime': 2625.176, 'train_samples_per_second': 2.499,	max_memory_allocated (GB): 76.39 memory_allocated (GB): 26.91
w/ Habana Flash Attention + recompute=True	'train_runtime': 2839.2379, 'train_samples_per_second': 2.358,	max_memory_allocated (GB): 67.42 memory_allocated (GB): 21.53
w/ Habana Flash Attention + recompute=True + PT_HPU_SDPA_BATCH_NUMHEADS_SLICE=0	'train_runtime': 2675.4069, 'train_samples_per_second': 2.45,	max_memory_allocated (GB): 77.83 memory_allocated (GB): 31.93

regisss

LGTM!

Integrate Habana flash attention to Llama2-70B finetune

f19ded3

mandy-li requested a review from regisss as a code owner December 13, 2023 01:06

mandy-li added the run-test Run CI for PRs from external contributors label Dec 13, 2023

mandy-li requested a review from schoi-habana December 13, 2023 01:10

regisss approved these changes Dec 13, 2023

View reviewed changes

Add flash_attention_recompute arg to llama2 finetune

f1a26c8

mandy-li requested review from ssarkar2, bhargaveede, vivekgoe and libinta as code owners December 13, 2023 21:32

mandy-li requested a review from a user December 13, 2023 21:32

regisss approved these changes Dec 14, 2023

View reviewed changes

regisss merged commit 28a9646 into main Dec 14, 2023
9 checks passed

regisss deleted the mandy/llama2_fa branch December 14, 2023 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Habana flash attention to Llama2-70B finetune #596

Integrate Habana flash attention to Llama2-70B finetune #596

mandy-li commented Dec 13, 2023

HuggingFaceDocBuilderDev commented Dec 13, 2023

regisss left a comment

mandy-li commented Dec 13, 2023

regisss left a comment

Integrate Habana flash attention to Llama2-70B finetune #596

Integrate Habana flash attention to Llama2-70B finetune #596

Conversation

mandy-li commented Dec 13, 2023

HuggingFaceDocBuilderDev commented Dec 13, 2023

regisss left a comment

Choose a reason for hiding this comment

mandy-li commented Dec 13, 2023

regisss left a comment

Choose a reason for hiding this comment