Enable --fast_layer_norm for ROCm #94

hubertlu-tw · 2022-09-09T22:28:42Z

To run the --fast_layer_norm unit tests,

cd apex/contrib/test
pytest layer_norm/

As a reference, the following results were obtained from CUDA systems (with nvcr.io/nvidia/pytorch:22.08-py3 on a A100 node):

=========================================================================== test session starts ============================================================================
platform linux -- Python 3.8.13, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /apex_development/apex
plugins: cov-3.0.0, pythonpath-0.7.4, hypothesis-4.50.8
collected 4 items

layer_norm/test_fast_layer_norm.py E...                                                                                                                              [100%]

================================================================================== ERRORS ==================================================================================
_________________________________________________________________________ ERROR at setup of test_ __________________________________________________________________________
file /apex_development/apex/apex/contrib/test/layer_norm/test_fast_layer_norm.py, line 128
  def test_(S, B, hidden_size, itype, wtype, ctype=fp32):
E       fixture 'S' not found
>       available fixtures: cache, capfd, capfdbinary, caplog, capsys, capsysbinary, cov, doctest_namespace, monkeypatch, no_cover, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

/apex_development/apex/apex/contrib/test/layer_norm/test_fast_layer_norm.py:128
========================================================================= short test summary info ==========================================================================
ERROR layer_norm/test_fast_layer_norm.py::test_
======================================================================== 3 passed, 1 error in 3.61s ========================================================================

On ROCm,
it failed the following two checks with NaN outputs:

print(f"dg: relerr={re_dg:.4e} mse={mse_dg:.4e}")
print(f"db: relerr={re_db:.4e} mse={mse_db:.4e}")

in test_fast_layer_norm.py when wtype or ctype is bf16.

However, when we skip the tests with bf16 wtype or ctype, it failed the tests starting from hidden_size=8192 due to relerr > tol. Please find the results of the unit tests here: apex_fastlayernorm_unittest.txt

…nsion

aspanday · 2023-03-14T21:01:52Z

Reopening

…acy tests to fail in fwd and bwd pass. Making CTAS_PER_ROW=1 for all cases doesnt seem to be affecting the performance and is allowing all fwd pass tests to pass. In addition bwd pass tests fail for hidden_size >= 8192. With CTAS_PER_ROW=1 for all cases in bwd, tests fail when hidden_size >= 16K. This commit allows fastlayernorm capabale to be used for inference cases. Additional debugging required for fastlayernorm bwd.

…rm/apex into fastlayernorm

amathews-amd · 2023-04-26T15:26:48Z

What is this blocked by ?
cc: @jeffdaily @hubertlu-tw @sunway513 @dllehr-amd @aspanday

…eed to revert this change later if needed.

… helps with decoding impact of kernel parameters.

…eported performances.

…ed_layer_norm. Updated the framework for fast_layer_norm to be same as fused_layer_norm. Updated variables gamma and beta to unique variables gamma_ and beta_ to address some runtime errors. Added patch to call fused_layer_norm bwd when hidden_size > 12K since the tests fail for these hidden_sizes. This is currently only a patch and needs to be debugged. Potential for even better performance for fast_layer_norm. NOTE that fast_layer_norm fwd is still called for hidden_sizes > 12K.

…n_sizes > 12K since fast_layer_norm patch calls fused_layer_norm for bwd when hideen_Sizes > 12K

aspanday · 2023-05-03T14:54:15Z

Hi all,

I've updated all files with appropriate changes and ALL fast_layer_norm tests pass for fwd and bwd now. Please let me know if there are any concerns. Thanks!

Performance uplift can be found on the "fwd + bwd all" sheet [here]

(https://amdcloud.sharepoint.com/:x:/r/sites/MLSEPerfTeam/Shared%20Documents/Strategic%20Leadership%20Optimizations%20Team%20(SLOT)/Projects/FastLayerNorm/FastLayerNormBwd%20Ops%20breakdown.xlsx?d=w04019b500104408fbb5a8f4f37bdae5f&csf=1&web=1&e=aYfRlX).

Let me know if you need access to the above link.

jeffdaily and others added 9 commits September 1, 2022 18:20

add build for --fast_layer_norm

c4a4ee5

use HIP bfloat16 header

c1e3a72

missing <functional> header

09d7be8

warp size considerations, TODOs

fc13459

work around some compiler errors

1228e3c

it finally compiles

f2399d6

Manually hipify cudaLaunchCooperativeKernel in --fast_layer_norm exte…

b30d3ba

…nsion

Update setup.py for --fast_layer_norm extension

103e128

Merge remote-tracking branch 'origin/master' into fastlayernorm

f2908b7

hubertlu-tw requested review from jeffdaily and jithunnair-amd September 9, 2022 22:28

Merge remote-tracking branch 'origin/master' into fastlayernorm

d0e7073

aspanday closed this Mar 8, 2023

aspanday deleted the fastlayernorm branch March 8, 2023 23:34

aspanday reopened this Mar 14, 2023

aspanday added 3 commits March 14, 2023 21:13

retaining the previous CTAS_PER_ROW values for fwd pass.

03002d0

Merge branch 'fastlayernorm' of https://github.com/ROCmSoftwarePlatfo…

a4e4d78

…rm/apex into fastlayernorm

aspanday requested a review from abhinavvishnu April 7, 2023 19:58

aspanday added 9 commits May 3, 2023 14:17

removing checks for is_cuda. This interferes with the unittest. May n…

bdd5cdb

…eed to revert this change later if needed.

adding appropriate assert messages. This reduces #warnings as well as…

b78abf9

… helps with decoding impact of kernel parameters.

adding appropriate assert messages. This reduces #warnings as well as…

f47022d

… helps with decoding impact of kernel parameters.

adding appropriate assert messages. This reduces #warnings as well as…

12cb79f

… helps with decoding impact of kernel parameters.

adding appropriate assert messages. This reduces #warnings as well as…

f973057

… helps with decoding impact of kernel parameters.

adding hipFuncSetAttrbute when IS_ROCM is True

37feda6

Updating REGISTER_BWD_KERNEL macros for various hidden_sizes to get r…

5777035

…eported performances.

Updated fast_layer_norm test to call fused_layer_norm test when hidde…

ae9de3e

…n_sizes > 12K since fast_layer_norm patch calls fused_layer_norm for bwd when hideen_Sizes > 12K

Merge branch 'master' into fastlayernorm

fb79a52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable --fast_layer_norm for ROCm #94

Enable --fast_layer_norm for ROCm #94

hubertlu-tw commented Sep 9, 2022 •

edited

Loading

aspanday commented Mar 14, 2023

amathews-amd commented Apr 26, 2023

aspanday commented May 3, 2023

Enable --fast_layer_norm for ROCm #94

Are you sure you want to change the base?

Enable --fast_layer_norm for ROCm #94

Conversation

hubertlu-tw commented Sep 9, 2022 • edited Loading

aspanday commented Mar 14, 2023

amathews-amd commented Apr 26, 2023

aspanday commented May 3, 2023

hubertlu-tw commented Sep 9, 2022 •

edited

Loading