Skip to content

Commit

Permalink
Revert "[NVIDIA ] Remove references to deprecated XLA flags."
Browse files Browse the repository at this point in the history
  • Loading branch information
laurentes authored Apr 30, 2024
1 parent 0bb6d13 commit cd06f1f
Show file tree
Hide file tree
Showing 6 changed files with 17 additions and 12 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ The workflow to use the Profile Guided Latency Estimator workflow in XLA/GPU is:
You could do so by setting:

```bash
export XLA_FLAGS="--xla_gpu_enable_latency_hiding_scheduler=true"
export XLA_FLAGS="--xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_enable_async_all_gather=true --xla_gpu_enable_async_reduce_scatter=true"
```

- 2. Collect and post process a profile by using JAX profiler, saving the extracted instruction latencies into a binary protobuf file.
Expand Down Expand Up @@ -172,7 +172,7 @@ After this step, you will get a `profile.pb` file under the `rundir` printed in
You need to pass the `profile.pb` file to the `--xla_gpu_pgle_profile_file_or_directory_path` flag.

```bash
export XLA_FLAGS="--xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_pgle_profile_file_or_directory_path=/path/to/profile/profile.pb"
export XLA_FLAGS="--xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_enable_async_all_gather=true --xla_gpu_enable_async_reduce_scatter=true --xla_gpu_pgle_profile_file_or_directory_path=/path/to/profile/profile.pb"
```

To enable logging in the XLA and check if the profile is good, set the logging level to include `INFO`:
Expand Down
5 changes: 3 additions & 2 deletions paxml/contrib/gpu/scripts_gpu/benchmark_gpt_multinode.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,10 @@ export VOCAB_PATH=gs://t5-data/vocabs/cc_all.32000.100extra/sentencepiece.model

export XLA_PYTHON_CLIENT_MEM_FRACTION=${XLA_PYTHON_CLIENT_MEM_FRACTION:-0.85}
BASE_XLA_FLAGS=${BASE_XLA_FLAGS:-"--xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_enable_triton_gemm=false
--xla_gpu_enable_highest_priority_async_stream=true
--xla_gpu_enable_async_all_gather=true
--xla_gpu_enable_async_reduce_scatter=true --xla_gpu_enable_highest_priority_async_stream=true
--xla_gpu_enable_triton_softmax_fusion=false --xla_gpu_all_reduce_combine_threshold_bytes=51200
--xla_gpu_graph_level=0"}
--xla_gpu_graph_level=0 --xla_gpu_enable_async_all_reduce=true"}
export XLA_FLAGS="$BASE_XLA_FLAGS ${XLA_FLAGS:-}"


Expand Down
5 changes: 3 additions & 2 deletions paxml/contrib/gpu/scripts_gpu/run_lambada_singlenode.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,10 @@ LOG_DIR=$6

export VOCAB_PATH=$VOCAB_PATH
BASE_XLA_FLAGS=${BASE_XLA_FLAGS:-"--xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_enable_triton_gemm=false
--xla_gpu_enable_highest_priority_async_stream=true
--xla_gpu_enable_async_all_gather=true
--xla_gpu_enable_async_reduce_scatter=true --xla_gpu_enable_highest_priority_async_stream=true
--xla_gpu_enable_triton_softmax_fusion=false --xla_gpu_all_reduce_combine_threshold_bytes=51200
--xla_gpu_graph_level=0"}
--xla_gpu_graph_level=0 --xla_gpu_enable_async_all_reduce=true"}
export XLA_FLAGS="$BASE_XLA_FLAGS ${XLA_FLAGS:-}"


Expand Down
5 changes: 3 additions & 2 deletions paxml/contrib/gpu/scripts_gpu/run_llama_boolq_multiprocess.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,10 @@ CONFIG=${7:-LLaMA7B}
export VOCAB_PATH=$VOCAB_PATH
export XLA_PYTHON_CLIENT_MEM_FRACTION=${XLA_PYTHON_CLIENT_MEM_FRACTION:-0.85}
BASE_XLA_FLAGS=${BASE_XLA_FLAGS:-"--xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_enable_triton_gemm=false
--xla_gpu_simplify_all_fp_conversions --xla_gpu_enable_highest_priority_async_stream=true
--xla_gpu_simplify_all_fp_conversions --xla_gpu_enable_async_all_gather=true
--xla_gpu_enable_async_reduce_scatter=true --xla_gpu_enable_highest_priority_async_stream=true
--xla_gpu_enable_triton_softmax_fusion=false --xla_gpu_all_reduce_combine_threshold_bytes=51200
--xla_gpu_graph_level=0"}
--xla_gpu_graph_level=0 --xla_gpu_enable_async_all_reduce=true"}
export XLA_FLAGS="$BASE_XLA_FLAGS ${XLA_FLAGS:-}"

## LLaMA currently incompatible with TE
Expand Down
5 changes: 3 additions & 2 deletions paxml/contrib/gpu/scripts_gpu/run_pile_multinode.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,10 @@ LOG_DIR=${6:-"test_logdir"}
export VOCAB_PATH=$VOCAB_PATH
export XLA_PYTHON_CLIENT_MEM_FRACTION=${XLA_PYTHON_CLIENT_MEM_FRACTION:-0.85}
BASE_XLA_FLAGS=${BASE_XLA_FLAGS:-"--xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_enable_triton_gemm=false
--xla_gpu_enable_highest_priority_async_stream=true
--xla_gpu_enable_async_all_gather=true
--xla_gpu_enable_async_reduce_scatter=true --xla_gpu_enable_highest_priority_async_stream=true
--xla_gpu_enable_triton_softmax_fusion=false --xla_gpu_all_reduce_combine_threshold_bytes=51200
--xla_gpu_graph_level=0"}
--xla_gpu_graph_level=0 --xla_gpu_enable_async_all_reduce=true"}
export XLA_FLAGS="$BASE_XLA_FLAGS ${XLA_FLAGS:-}"


Expand Down
5 changes: 3 additions & 2 deletions paxml/contrib/gpu/scripts_gpu/run_pile_singlenode.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,10 @@ LOG_DIR=${6:-"test_logdir"}
export VOCAB_PATH=$VOCAB_PATH

BASE_XLA_FLAGS=${BASE_XLA_FLAGS:-"--xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_enable_triton_gemm=false
--xla_gpu_enable_highest_priority_async_stream=true
--xla_gpu_enable_async_all_gather=true
--xla_gpu_enable_async_reduce_scatter=true --xla_gpu_enable_highest_priority_async_stream=true
--xla_gpu_enable_triton_softmax_fusion=false --xla_gpu_all_reduce_combine_threshold_bytes=51200
--xla_gpu_graph_level=0"}
--xla_gpu_graph_level=0 --xla_gpu_enable_async_all_reduce=true"}
export XLA_FLAGS="$BASE_XLA_FLAGS ${XLA_FLAGS:-}"


Expand Down

0 comments on commit cd06f1f

Please sign in to comment.