use env to skip PJRT initialize #8609

zpcore · 2025-01-22T20:10:11Z

We skip the PJRT Megascale initialization by controlling the env.

This is a temporary fix, and is supposed to be rolled back.

Check #8609 (comment) for detailed motivation.

tengyifei

Suggest landing this after the libtpu change is approved.

torch_xla/experimental/custom_kernel.py

zpcore · 2025-01-22T20:18:27Z

@tengyifei , shall we cherry pick this PR to 2.6 release?

tengyifei · 2025-01-22T21:22:25Z

@zpcore cherrypicking is fine with me.

zpcore · 2025-01-23T04:15:34Z

The test failed. I think it is due to pytorch/pytorch#142859.

They have reverted the PR.

tengyifei · 2025-01-23T05:40:39Z

Ack

tengyifei · 2025-01-23T05:43:30Z

@zpcore thanks. next step is to follow the process in #8455 to create a cherrypick PR

bhavya01 · 2025-01-23T18:53:15Z

Retrospective LGTM!

miladm · 2025-01-31T20:48:23Z

Thanks @zpcore - can we add enough details to PR descriptions to help folks without context understand the intent of the contribution more clearly please?

zpcore · 2025-01-31T21:29:54Z

Thanks @zpcore - can we add enough details to PR descriptions to help folks without context understand the intent of the contribution more clearly please?

The issue in multipod run is that MegascaleXLA(MXLA) will trigger device discovery when we initialize PJRT runtime with the TPU backend. With the introduction of Pallas kernel, we did an extra MXLA trigger when call jax.jit(). Thus there are more than one device discovery been executed. Everytime we execute device discovery, all device will be assigned a ID. This will cause device ID mismatch after the second device discovery thus cause the device communication hang.

The hacky way to fix the issue is to use enviroment variable to control the device discovery won't be triggered when call jax.jit. This PR works with the fix we made internally in libtpu source code:

 const char* skip_megascale_pjrt_client = std::getenv("SKIP_MEGASCALE_PJRT_CLIENT");
  bool skip_megascale = false;
  if (skip_megascale_pjrt_client != nullptr) {
    skip_megascale = true;
  }
  if (absl::GetFlag(FLAGS_megascale_num_slices) != 1 && !skip_megascale) {
    client = xla::MegaScalePjRtClient::CreateMegaScalePjRtClient(
        std::move(tpu_client));
    ...
  }

With this fix, MegaScalePjRtClient will only be triggered in place (e.g.,)

xla/torch_xla/core/xla_model.py

Line 93 in 82d3504

devices = torch_xla._XLAC._xla_get_devices()

,
where we call runtime::GetComputationClient() and initialize the client.

use env to skip PJRT initialize

9fcf03f

zpcore requested review from tengyifei and bhavya01 January 22, 2025 20:10

tengyifei approved these changes Jan 22, 2025

View reviewed changes

torch_xla/experimental/custom_kernel.py Show resolved Hide resolved

tengyifei merged commit 557d9f3 into master Jan 23, 2025
11 of 12 checks passed

zpcore added a commit that referenced this pull request Jan 23, 2025

use env to skip PJRT initialize (#8609)

5b34757

zpcore mentioned this pull request Jan 23, 2025

2.6 backport PR request list #8455

Closed

tengyifei pushed a commit that referenced this pull request Jan 23, 2025

[Cherrypick] use env to skip PJRT initialize (#8609) (#8618)

a501088

zpcore deleted the piz/multipod_hack branch January 23, 2025 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use env to skip PJRT initialize #8609

use env to skip PJRT initialize #8609

zpcore commented Jan 22, 2025 •

edited

Loading

tengyifei left a comment

zpcore commented Jan 22, 2025

tengyifei commented Jan 22, 2025

zpcore commented Jan 23, 2025

tengyifei commented Jan 23, 2025

tengyifei commented Jan 23, 2025

bhavya01 commented Jan 23, 2025

miladm commented Jan 31, 2025

zpcore commented Jan 31, 2025

use env to skip PJRT initialize #8609

use env to skip PJRT initialize #8609

Conversation

zpcore commented Jan 22, 2025 • edited Loading

tengyifei left a comment

Choose a reason for hiding this comment

zpcore commented Jan 22, 2025

tengyifei commented Jan 22, 2025

zpcore commented Jan 23, 2025

tengyifei commented Jan 23, 2025

tengyifei commented Jan 23, 2025

bhavya01 commented Jan 23, 2025

miladm commented Jan 31, 2025

zpcore commented Jan 31, 2025

zpcore commented Jan 22, 2025 •

edited

Loading