Fix the torchax llama405b OOM at model init time #24

tengyifei · 2025-01-10T18:48:18Z

Instead of holding a large global weight_jax array, we hold the meta tensor, and create a local jax array whose size and dtype correspond to that of a shard.

The program still OOMs the host memory space later while compiling the training step but that will be addressed separately.

Instead of holding a large global `weight_jax` array, we hold the meta tensor, and create a local jax array whose size and dtype correspond to that of a shard. The program still OOMs the host memory space later while compiling the training step but that will be addressed separately.

tengyifei requested a review from qihqi January 10, 2025 18:48

qihqi approved these changes Jan 11, 2025

View reviewed changes

tengyifei merged commit 04cd0a9 into main Jan 11, 2025
6 checks passed

tengyifei deleted the yifeit/torchax-param-oom branch January 26, 2025 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the torchax llama405b OOM at model init time #24

Fix the torchax llama405b OOM at model init time #24

tengyifei commented Jan 10, 2025

Fix the torchax llama405b OOM at model init time #24

Fix the torchax llama405b OOM at model init time #24

Conversation

tengyifei commented Jan 10, 2025