Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable --fast_layer_norm for ROCm #94

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

Conversation

hubertlu-tw
Copy link

@hubertlu-tw hubertlu-tw commented Sep 9, 2022

To run the --fast_layer_norm unit tests,

cd apex/contrib/test
pytest layer_norm/

As a reference, the following results were obtained from CUDA systems (with nvcr.io/nvidia/pytorch:22.08-py3 on a A100 node):

=========================================================================== test session starts ============================================================================
platform linux -- Python 3.8.13, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /apex_development/apex
plugins: cov-3.0.0, pythonpath-0.7.4, hypothesis-4.50.8
collected 4 items

layer_norm/test_fast_layer_norm.py E...                                                                                                                              [100%]

================================================================================== ERRORS ==================================================================================
_________________________________________________________________________ ERROR at setup of test_ __________________________________________________________________________
file /apex_development/apex/apex/contrib/test/layer_norm/test_fast_layer_norm.py, line 128
  def test_(S, B, hidden_size, itype, wtype, ctype=fp32):
E       fixture 'S' not found
>       available fixtures: cache, capfd, capfdbinary, caplog, capsys, capsysbinary, cov, doctest_namespace, monkeypatch, no_cover, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

/apex_development/apex/apex/contrib/test/layer_norm/test_fast_layer_norm.py:128
========================================================================= short test summary info ==========================================================================
ERROR layer_norm/test_fast_layer_norm.py::test_
======================================================================== 3 passed, 1 error in 3.61s ========================================================================

On ROCm,
it failed the following two checks with NaN outputs:

print(f"dg: relerr={re_dg:.4e} mse={mse_dg:.4e}")
print(f"db: relerr={re_db:.4e} mse={mse_db:.4e}")

in test_fast_layer_norm.py when wtype or ctype is bf16.

However, when we skip the tests with bf16 wtype or ctype, it failed the tests starting from hidden_size=8192 due to relerr > tol. Please find the results of the unit tests here: apex_fastlayernorm_unittest.txt

@aspanday aspanday closed this Mar 8, 2023
@aspanday aspanday deleted the fastlayernorm branch March 8, 2023 23:34
@aspanday
Copy link

Reopening

@aspanday aspanday reopened this Mar 14, 2023
…acy tests to fail in fwd and bwd pass.

Making CTAS_PER_ROW=1 for all cases doesnt seem to be affecting the performance and is allowing all fwd pass tests to pass. In addition bwd pass tests fail
for hidden_size >= 8192. With CTAS_PER_ROW=1 for all cases in bwd, tests fail when hidden_size >= 16K.
This commit allows fastlayernorm capabale to be used for inference cases.
Additional debugging required for fastlayernorm bwd.
@aspanday aspanday requested a review from abhinavvishnu April 7, 2023 19:58
@amathews-amd
Copy link

What is this blocked by ?
cc: @jeffdaily @hubertlu-tw @sunway513 @dllehr-amd @aspanday

aspanday added 9 commits May 3, 2023 14:17
… helps with decoding impact of kernel parameters.
… helps with decoding impact of kernel parameters.
… helps with decoding impact of kernel parameters.
… helps with decoding impact of kernel parameters.
…ed_layer_norm.

Updated the framework for fast_layer_norm to be same as fused_layer_norm.
Updated variables gamma and beta to unique variables gamma_ and beta_ to address some runtime errors.
Added patch to call fused_layer_norm bwd when hidden_size > 12K since the tests fail for these hidden_sizes. This is currently only a patch and needs to be debugged. Potential for even better performance for fast_layer_norm. NOTE that fast_layer_norm fwd is still called for hidden_sizes > 12K.
…n_sizes > 12K since fast_layer_norm patch calls fused_layer_norm for bwd when hideen_Sizes > 12K
@aspanday
Copy link

aspanday commented May 3, 2023

Hi all,

I've updated all files with appropriate changes and ALL fast_layer_norm tests pass for fwd and bwd now. Please let me know if there are any concerns. Thanks!

Performance uplift can be found on the "fwd + bwd all" sheet [here]

(https://amdcloud.sharepoint.com/:x:/r/sites/MLSEPerfTeam/Shared%20Documents/Strategic%20Leadership%20Optimizations%20Team%20(SLOT)/Projects/FastLayerNorm/FastLayerNormBwd%20Ops%20breakdown.xlsx?d=w04019b500104408fbb5a8f4f37bdae5f&csf=1&web=1&e=aYfRlX).

Let me know if you need access to the above link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants