Slow XLA training performance. #8541

tzstoyanov · 2025-01-07T09:49:12Z

❓ Questions and Help

I'm evaluating PyTorch-XLA for training, but noticed that there is a big degradation in performance compared to the native pytorch device. Is it a known problem, or is there a problem with the way I use PyTorch-XLA? I tested a simple MNIST training example, comparing the performance between PyTorch CUDA device and XLA CUDA device. The native CUDA device is twice faster.
Appreciate any thoughts, suggestions or links to known performance issues, thanks!

Environment

note: there is no difference in performance measurements with the latest 2.5.0

torch 2.4.0
torch-xla 2.4.0
torch_xla_cuda_plugin 2.4.0.dev20240902
torchvision 0.19.0

How To Reproduce

Run the test program with xla = True and xla = False

import os
from tqdm import tqdm
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch_xla.core.xla_model as xm

def get_device(xla):
  if xla:
    os.environ["PJRT_DEVICE"] = "CUDA"
    os.environ["GPU_NUM_DEVICES"] = "1"
    import torch_xla_cuda_plugin
    from torch_xla.experimental import plugins
    import torch_xla.runtime as xr
    plugins.use_dynamic_plugins()
    plugins.register_plugin('CUDA', torch_xla_cuda_plugin.CudaPlugin())
    xr.set_device_type('CUDA')
    device = xm.xla_device(devkind="CUDA")
  else:
    device = torch.device('cuda:0')
    os.environ["PJRT_DEVICE"] = "CUDA"
    os.environ["GPU_NUM_DEVICES"] = "1"
  return device

xla = True
device = get_device(xla)
print(f"Using device: {device}")

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 512)  # number of neurons
        self.fc2 = nn.Linear(512, 256)      # number of neurons
        self.fc3 = nn.Linear(256, 10)       # Output layer (10 classes for digits 0-9)
    def forward(self, x):
        x = x.view(-1, 28 * 28)  # Flatten the image
        x = torch.relu(self.fc1(x))  # Apply ReLU activation
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Load the MNIST dataset and apply transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))  # Normalize to [-1, 1]
])

train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Initialize the model and move it to the device
model = SimpleNN().to(device)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop, 20 epochs
for epoch in tqdm(range(20)):
    model.train()  # Set the model to training mode
    running_loss = 0.0

    for data, target in tqdm(train_loader):
        data, target = data.to(device), target.to(device)  # Move data to the device

        optimizer.zero_grad()  # Zero the gradients
        output = model(data)  # Get model predictions
        loss = criterion(output, target)  # Compute the loss
        loss.backward()  # Backpropagate the gradients
        optimizer.step()  # Update model parameters

        running_loss += loss.item()
        if xla:
          xm.mark_step()
    print(f'Epoch {epoch + 1}, Loss: {running_loss / len(train_loader)}')

# Test the model
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for data, target in test_loader:
        data, target = data.to(device), target.to(device)  # Move data to CUDA device
        output = model(data)
        _, predicted = torch.max(output, 1)
        total += target.size(0)
        correct += (predicted == target).sum().item()

print(f'Accuracy: {100 * correct / total}%')

The text was updated successfully, but these errors were encountered:

qihqi · 2025-01-15T21:02:58Z

Hi,

In your script I didn't see anything that measures time. If you are measuring the time of the entire script, then, in XLA's case it would include the time of tracing & compilation.

tzstoyanov · 2025-01-16T11:50:16Z

Hi,
Thank you for the feedback. I use tqdm to measure the time of each loop iteration. If I understand correctly your comment, the extra time in XLA case is for tracing & compilation? Is there a way to mitigate and optimize these steps?

ysiraichi · 2025-01-21T19:43:13Z

Just sharing some numbers, here:

PyTorch CUDA device:

Epoch 1, Loss: 0.2956220196512367, Time: 39.61663467499966 s
Epoch 2, Loss: 0.13386567791840479, Time: 39.37173689100018 s
Epoch 3, Loss: 0.10286055930923901, Time: 39.41532119399926 s

PyTorch/XLA device:

Epoch 1, Loss: 0.301224167409069, Time: 285.19267729799867 s
Epoch 2, Loss: 0.13521685553933066, Time: 68.94764359600049 s
Epoch 3, Loss: 0.10363763249774616, Time: 68.55173286699937 s

After the first epoch, in which compilation does take place, PyTorch native devices are still 1.7x faster on my GPU. Note that, even though compilation won't be an issue, tracing continues until the last epoch.

One way to eliminate tracing completely would be to use dynamo.

ysiraichi · 2025-01-22T14:27:33Z

I tried modifying the code so as to have only one compiled graph for each iteration (i.e. each iteration in the inner loop would trigger only one graph execution). However, the performance was even worse, around 71s.

    # ...
    for data, target in tqdm(train_loader):
        data, target = data.to(device), target.to(device)
        # ...
        optimizer.step()

        # Moved the mark_step above, so that the `loss` value wouldn't trigger
        # another compilation. 
        if xla:
          xm.mark_step()

        running_loss += loss.item()

ysiraichi added the xla:gpu label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow XLA training performance. #8541

Slow XLA training performance. #8541

tzstoyanov commented Jan 7, 2025 •

edited

Loading

qihqi commented Jan 15, 2025

tzstoyanov commented Jan 16, 2025 •

edited

Loading

ysiraichi commented Jan 21, 2025

ysiraichi commented Jan 22, 2025

Slow XLA training performance. #8541

Slow XLA training performance. #8541

Comments

tzstoyanov commented Jan 7, 2025 • edited Loading

❓ Questions and Help

Environment

How To Reproduce

qihqi commented Jan 15, 2025

tzstoyanov commented Jan 16, 2025 • edited Loading

ysiraichi commented Jan 21, 2025

ysiraichi commented Jan 22, 2025

tzstoyanov commented Jan 7, 2025 •

edited

Loading

tzstoyanov commented Jan 16, 2025 •

edited

Loading