Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training benchmarking question: unable to reproduce the training speed #70

Open
tong-zeng opened this issue May 17, 2023 · 0 comments
Open

Comments

@tong-zeng
Copy link

Hi @s-JoL, thank you very much for your efforts making llama open to the community.
I got some question when benchmarking the training speed:

My environment:
One node with 8 pieces of A100(80G), 128 Core, 2TB RAM, Ubuntu 20.04, CUDA 11.7, pytorch 2.0, xformers 0.0.19

Problem: GPU out of memory when running the utils/speed_test/lightning/run.py script. I have also tried the accelerate script at utils/speed_test/accelerate/run.sh, but oom too.

According to the lightning and accelerate script, it seems your are using 8 GPUS:

batch_size = 2
seq_length = 2048
vocab_size = 32000
total_step = 100
use_activation_ckpt = False

strategy = DeepSpeedStrategy(
    stage=2,
    offload_optimizer=False,
    offload_parameters=False,
    process_group_backend="nccl",
)
trainer = pl.Trainer(
    limit_train_batches=total_step,
    max_epochs=1,
    devices=8,
    accelerator="gpu",
    strategy=strategy,
    precision=16,
    enable_checkpointing=False,
)

I guess it works at your side with the same configs. But I got OOM with exactly the same script. So, I wonder are there any other factors may cause the difference? Am I miss anything?

This is the screenshot after the model initialization:
图片

However, GPU OOM when the training starts:
图片
图片

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant