Reduced CPU overhead in `precompute_float8_dynamic_scale_for_fsdp` #331

awgu · 2024-07-25T14:13:39Z

Stack from ghstack (oldest at bottom):

-> Reduced CPU overhead in precompute_float8_dynamic_scale_for_fsdp #331

Description
For Llama3-8B on 8xH100 profiling with with_stack=True (which does add overhead), the precompute_float8_dynamic_scale_for_fsdp CPU time decreases from 24 ms to 15 ms.

Before:

After:

Test Plan

(pytorch-3.10) [[email protected] /data/users/andgu/float8_experimental (precompute_float8)]$ pytest test/test_fsdp2/test_fsdp2.py 
========================================================= test session starts =========================================================
platform linux -- Python 3.10.13, pytest-7.3.2, pluggy-1.3.0
rootdir: /data/users/andgu/float8_experimental
plugins: xdoctest-1.1.0, hypothesis-5.35.1, xdist-3.3.1, shard-0.1.2, rerunfailures-13.0, flakefinder-1.1.0, cpp-2.3.0
collected 8 items                                                                                                                     
Running 8 items in this shard

test/test_fsdp2/test_fsdp2.py ........                                                                                          [100%]

========================================================== warnings summary ===========================================================
test/test_fsdp2/test_fsdp2.py::TestFloat8MultiThread::test_fp32_fp8_multi_module_parity
test/test_fsdp2/test_fsdp2.py::TestFloat8MultiThread::test_fp32_fp8_single_module_parity
  /data/users/andgu/float8_experimental/float8_experimental/float8_linear_utils.py:272: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead.
    all_reduced_amax_tensor = all_reduce(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================== 8 passed, 2 warnings in 121.90s (0:02:01) ==============================================

Differential Revision: D60236258

[ghstack-poisoned]

ghstack-source-id: 9a22b865feaf67bb83910a99503975d09170fa06 Pull Request resolved: #331

awgu · 2024-07-25T14:15:37Z

float8_experimental/fsdp_utils.py

-        float8_linear.weight._local_tensor._precomputed_scale = (
-            scale._local_tensor.squeeze()
-        )
+    local_scale_tensor = scale_tensor.to_local()


We have to call to_local() here because DTensor does not support [i] int indexing. Int indexing might not be semantically clear if the DTensor is sharded; I think indexing should be on the local tensor.

awgu · 2024-07-25T14:59:50Z

@awgu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

drisspg · 2024-07-25T16:24:48Z

float8_experimental/fsdp_utils.py

@@ -57,18 +57,16 @@ def precompute_float8_dynamic_scale_for_fsdp(module: nn.Module) -> None:

    # inf-norm is equivalent to max(abs(w))
    max_weights = torch._foreach_norm(weights, ord=math.inf)  # Partial
-    amax_tensor = torch.vstack(max_weights)  # Partial
+    amax_tensor = torch.stack(max_weights)  # Partial


This is a TLDR for me, I like vstack because semantically I think about gluing lego blocks together lol Does it assert some contiguousnous that causes it to be less performant?

From offline: vstack incurs per-tensor reshape, which each redispatches through DTensor dispatch, which is what makes it slow. stack only goes through DTensor dispatch once.

facebook-github-bot · 2024-07-25T21:37:31Z

@awgu merged this pull request in 701647b.

Reduced CPU overhead in precompute_float8_dynamic_scale_for_fsdp

6fc14a3

[ghstack-poisoned]

awgu added a commit that referenced this pull request Jul 25, 2024

Reduced CPU overhead in precompute_float8_dynamic_scale_for_fsdp

06f416b

ghstack-source-id: 9a22b865feaf67bb83910a99503975d09170fa06 Pull Request resolved: #331

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 25, 2024

awgu commented Jul 25, 2024

View reviewed changes

awgu marked this pull request as ready for review July 25, 2024 14:20

weifengpy approved these changes Jul 25, 2024

View reviewed changes

drisspg reviewed Jul 25, 2024

View reviewed changes

drisspg approved these changes Jul 25, 2024

View reviewed changes

facebook-github-bot closed this in 701647b Jul 25, 2024

facebook-github-bot added the Merged label Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduced CPU overhead in `precompute_float8_dynamic_scale_for_fsdp` #331

Reduced CPU overhead in `precompute_float8_dynamic_scale_for_fsdp` #331

awgu commented Jul 25, 2024 •

edited

Loading

awgu Jul 25, 2024

awgu commented Jul 25, 2024

drisspg Jul 25, 2024

awgu Jul 25, 2024

facebook-github-bot commented Jul 25, 2024

Reduced CPU overhead in precompute_float8_dynamic_scale_for_fsdp #331

Reduced CPU overhead in precompute_float8_dynamic_scale_for_fsdp #331

Conversation

awgu commented Jul 25, 2024 • edited Loading

awgu Jul 25, 2024

Choose a reason for hiding this comment

awgu commented Jul 25, 2024

drisspg Jul 25, 2024

Choose a reason for hiding this comment

awgu Jul 25, 2024

Choose a reason for hiding this comment

facebook-github-bot commented Jul 25, 2024

Reduced CPU overhead in `precompute_float8_dynamic_scale_for_fsdp` #331

Reduced CPU overhead in `precompute_float8_dynamic_scale_for_fsdp` #331

awgu commented Jul 25, 2024 •

edited

Loading