Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata logging fix #544

Merged
merged 33 commits into from
Oct 16, 2023
Merged
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
0f20678
set max split size
priyakasimbeg Sep 26, 2023
6cf192a
tune max split size
priyakasimbeg Sep 28, 2023
b2f8ff9
typo
priyakasimbeg Sep 28, 2023
70625b0
add back deleted block
priyakasimbeg Sep 28, 2023
179abba
undo disable torch compile for conformer
priyakasimbeg Sep 28, 2023
a2beafb
remove whitespace
priyakasimbeg Sep 28, 2023
255d835
remove training whitespace
priyakasimbeg Sep 28, 2023
b7f4cbc
isort fix
priyakasimbeg Sep 28, 2023
bb29602
formatting
priyakasimbeg Sep 28, 2023
3738f35
print step hint
priyakasimbeg Sep 28, 2023
0e4dd85
make pytorch cuda alloc config specific to conformer
priyakasimbeg Oct 6, 2023
da89a8b
tune max split size
priyakasimbeg Oct 6, 2023
416b88d
fix
priyakasimbeg Oct 7, 2023
4600d78
reduce max split size
priyakasimbeg Oct 7, 2023
7a764e1
move env var
priyakasimbeg Oct 7, 2023
1dbf3e4
logging
priyakasimbeg Oct 7, 2023
04f5c94
debugging
priyakasimbeg Oct 7, 2023
b0b9f40
debugging
priyakasimbeg Oct 7, 2023
318202e
debug logging
priyakasimbeg Oct 7, 2023
3cec8c5
update
priyakasimbeg Oct 7, 2023
4fc6e1c
update_logging
priyakasimbeg Oct 7, 2023
557bf0d
fix
priyakasimbeg Oct 7, 2023
2598d39
fix
priyakasimbeg Oct 7, 2023
9418f4f
fix
priyakasimbeg Oct 7, 2023
931337d
remove logging
priyakasimbeg Oct 9, 2023
aeed475
revert checkpoint utils debugging
priyakasimbeg Oct 9, 2023
7098843
extend max_allowed_runtime_sec for conformer
priyakasimbeg Oct 9, 2023
cb68dba
Merge branch 'dev' into conformer_oom_debugging_2
priyakasimbeg Oct 9, 2023
24edc3b
Merge branch 'dev' into conformer_oom_debugging_2
priyakasimbeg Oct 11, 2023
09ceeec
remove conformer oom fixes from this branch
priyakasimbeg Oct 11, 2023
a0b624e
lint
priyakasimbeg Oct 11, 2023
061d5b3
pr feedback
priyakasimbeg Oct 13, 2023
a4bb0f0
isort
priyakasimbeg Oct 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
tune max split size
  • Loading branch information
priyakasimbeg committed Sep 28, 2023
commit 6cf192a870be0e5340190eafa51a46661c3b2d6a
16 changes: 5 additions & 11 deletions submission_runner.py
Original file line number Diff line number Diff line change
@@ -43,18 +43,14 @@
from algorithmic_efficiency.pytorch_utils import sync_ddp_time
from algorithmic_efficiency.workloads import workloads


# Hide any GPUs form TensorFlow. Otherwise TF might reserve memory and make
# it unavailable to JAX.
tf.config.set_visible_devices([], 'GPU')

# disable only for deepspeech if it works fine for other workloads.
os.environ['XLA_FLAGS'] = '--xla_gpu_enable_triton_gemm=false'

# os.environ["CUDA_VISIBLE_DEVICES"]='0'
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:256"

# TODO(znado): make a nicer registry of workloads that lookup in.
BASE_WORKLOADS_DIR = workloads.BASE_WORKLOADS_DIR
os.environ['XLA_FLAGS'] = '--xla_gpu_enable_triton_gemm=falseos.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:256"# TODO(znado): make a nicer registry of workloads that lookup in.
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:256'

# Workload_path will be appended by '_pytorch' or '_jax' automatically.
WORKLOADS = workloads.WORKLOADS
@@ -209,9 +205,8 @@ def train_once(
model_params, model_state = workload.init_model_fn(
model_init_rng, dropout_rate, aux_dropout_rate)
if FLAGS.framework == 'pytorch' and FLAGS.torch_compile:
compile_error_workloads = ['ogbg', 'criteo1tb']
eager_backend_workloads = [
'librispeech_conformer', 'librispeech_deepspeech'
compile_error_workloads = ['ogbg', 'criteo1tb', 'librispeech_conformer']
eager_backend_workloads = ['librispeech_deepspeech'
]
aot_eager_backend_workloads = []
if FLAGS.workload in compile_error_workloads:
@@ -422,7 +417,6 @@ def train_once(
_reset_cuda_mem()

train_state['last_step_end_time'] = get_time()

metrics = {'eval_results': eval_results, 'global_step': global_step}

if log_dir is not None: