Feature/transformer sequence sharding #90

japols · 2024-11-28T12:41:01Z

This PR adds a new sharding strategy shard_sequence for the transformer processor.

The current implementation (shard_heads) alternates between sharding across the sequence to sharding across heads for the sliding window attention mechanism. This requires two all-to-all communication steps per layer.

The shard_sequence strategy simplifies this process by keeping a sequence shard on each GPU and computing the sliding window attention locally. This requires a halo communication to exchange overlapping window segments (halos) between neighboring sequence shards.

Instead of 2 all-to-all communication steps per layer, the halo exchange only requires a single point-to-point communication between neighbouring GPUs, reducing communication time and improving scalability of model sharding across multiple GPUs.

The following benchmarking results show that using a 2 neighbor all-to-all (orange) is the best communication strategy to implement the halo exchange which consistently outperforms the old head-sharding strategy (blue):

This is an isolated fwd+bwds pass of 16 transformer layers with o96 input shapes, 1024 channels.

For a full training run on n320, o96 hidden we get the following increases in throughput (aligning with the benchmark results):

GPUs/Model	sharding strategy	avg time/batch (s)
2	shard_heads	1.38495
2	shard_sequence	1.29771
4	shard_heads	0.72034
4	shard_sequence	0.69254

mlflow

FussyDuck · 2024-11-28T12:41:08Z

All committers have signed the CLA.

codecov-commenter · 2024-11-28T12:52:51Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.85%. Comparing base (225315e) to head (063844b).

Additional details and impacted files

@@           Coverage Diff            @@
##           develop      #90   +/-   ##
========================================
  Coverage    99.85%   99.85%           
========================================
  Files           23       23           
  Lines         1374     1374           
========================================
  Hits          1372     1372           
  Misses           2        2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

japols added 2 commits October 29, 2024 10:49

feat: Initial transformer sequence sharding version

e3c5283

feat: shard_strategy configurable via config.model.processor

e727e3c

japols self-assigned this Nov 28, 2024

Merge branch 'develop' into feature/transformer_sequence_sharding

a847f1a

japols force-pushed the feature/transformer_sequence_sharding branch from 4ab4205 to a847f1a Compare November 29, 2024 10:55

japols added 3 commits December 11, 2024 14:26

feat: configurable halo comm strategies (for benchmarking)

ec0ac2d

cleanup, use all_to_all for halo exchange

0aaf0f9

docs: changelog

943a0d4

japols requested review from ssmmnn11 and mishooax December 17, 2024 18:44

japols marked this pull request as ready for review December 17, 2024 18:46

Merge branch 'develop' into feature/transformer_sequence_sharding

063844b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/transformer sequence sharding #90

Feature/transformer sequence sharding #90

japols commented Nov 28, 2024 •

edited

Loading

FussyDuck commented Nov 28, 2024 •

edited

Loading

codecov-commenter commented Nov 28, 2024 •

edited

Loading

Feature/transformer sequence sharding #90

Are you sure you want to change the base?

Feature/transformer sequence sharding #90

Conversation

japols commented Nov 28, 2024 • edited Loading

FussyDuck commented Nov 28, 2024 • edited Loading

codecov-commenter commented Nov 28, 2024 • edited Loading

Codecov Report

japols commented Nov 28, 2024 •

edited

Loading

FussyDuck commented Nov 28, 2024 •

edited

Loading

codecov-commenter commented Nov 28, 2024 •

edited

Loading