training benchmarking question: unable to reproduce the training speed #70

tong-zeng · 2023-05-17T14:36:55Z

Hi @s-JoL, thank you very much for your efforts making llama open to the community.
I got some question when benchmarking the training speed:

My environment:
One node with 8 pieces of A100(80G), 128 Core, 2TB RAM, Ubuntu 20.04, CUDA 11.7, pytorch 2.0, xformers 0.0.19

Problem: GPU out of memory when running the utils/speed_test/lightning/run.py script. I have also tried the accelerate script at utils/speed_test/accelerate/run.sh, but oom too.

According to the lightning and accelerate script, it seems your are using 8 GPUS:

batch_size = 2
seq_length = 2048
vocab_size = 32000
total_step = 100
use_activation_ckpt = False

strategy = DeepSpeedStrategy(
    stage=2,
    offload_optimizer=False,
    offload_parameters=False,
    process_group_backend="nccl",
)
trainer = pl.Trainer(
    limit_train_batches=total_step,
    max_epochs=1,
    devices=8,
    accelerator="gpu",
    strategy=strategy,
    precision=16,
    enable_checkpointing=False,
)

I guess it works at your side with the same configs. But I got OOM with exactly the same script. So, I wonder are there any other factors may cause the difference? Am I miss anything?

This is the screenshot after the model initialization:

However, GPU OOM when the training starts:

Thank you!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training benchmarking question: unable to reproduce the training speed #70

training benchmarking question: unable to reproduce the training speed #70

tong-zeng commented May 17, 2023

training benchmarking question: unable to reproduce the training speed #70

training benchmarking question: unable to reproduce the training speed #70

Comments

tong-zeng commented May 17, 2023