-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow XLA training performance. #8541
Comments
Hi, In your script I didn't see anything that measures time. If you are measuring the time of the entire script, then, in XLA's case it would include the time of tracing & compilation. |
Hi, |
Just sharing some numbers, here: PyTorch CUDA device:
PyTorch/XLA device:
After the first epoch, in which compilation does take place, PyTorch native devices are still 1.7x faster on my GPU. Note that, even though compilation won't be an issue, tracing continues until the last epoch. One way to eliminate tracing completely would be to use dynamo. |
I tried modifying the code so as to have only one compiled graph for each iteration (i.e. each iteration in the inner loop would trigger only one graph execution). However, the performance was even worse, around # ...
for data, target in tqdm(train_loader):
data, target = data.to(device), target.to(device)
# ...
optimizer.step()
# Moved the mark_step above, so that the `loss` value wouldn't trigger
# another compilation.
if xla:
xm.mark_step()
running_loss += loss.item() |
❓ Questions and Help
I'm evaluating PyTorch-XLA for training, but noticed that there is a big degradation in performance compared to the native pytorch device. Is it a known problem, or is there a problem with the way I use PyTorch-XLA? I tested a simple MNIST training example, comparing the performance between PyTorch CUDA device and XLA CUDA device. The native CUDA device is twice faster.
Appreciate any thoughts, suggestions or links to known performance issues, thanks!
Environment
note: there is no difference in performance measurements with the latest 2.5.0
How To Reproduce
Run the test program with
xla = True
andxla = False
The text was updated successfully, but these errors were encountered: