Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dev -> main #670

Merged
merged 42 commits into from
Mar 5, 2024
Merged
Changes from 1 commit
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
a245b3c
Reduce BS for the variant
pomonam Feb 20, 2024
8733487
add logs for prize qualification baseline runs
priyakasimbeg Feb 24, 2024
9ffeba9
add script to package logs
priyakasimbeg Feb 24, 2024
b470197
add documentation
priyakasimbeg Feb 27, 2024
9c03902
formatting
priyakasimbeg Feb 27, 2024
7f3e59f
Merge pull request #656 from mlcommons/imagenet_variant_bs
priyakasimbeg Feb 27, 2024
bd12900
remove debugging statemetns
priyakasimbeg Feb 27, 2024
8ce8772
Fix WMT jax config for decoding
runame Feb 28, 2024
74124d5
Fix deepspeech model_state when batchnorm is not used
runame Feb 28, 2024
a4dff63
Fix lint
runame Feb 28, 2024
dfbe642
Merge pull request #658 from runame/fix-bugs
priyakasimbeg Feb 28, 2024
3398aac
Merge pull request #657 from mlcommons/prize_qualificaton_logs
priyakasimbeg Feb 28, 2024
d425994
add self tuning options to docker startup and run_workloads.py
priyakasimbeg Feb 29, 2024
e7edb75
add self_tuning options to run_workloads script
priyakasimbeg Feb 29, 2024
0635590
add metadata files
priyakasimbeg Feb 29, 2024
0f2322f
fix startup script
priyakasimbeg Feb 29, 2024
4df4dc4
fix
priyakasimbeg Feb 29, 2024
2b36824
fix
priyakasimbeg Feb 29, 2024
83af93b
fix
priyakasimbeg Feb 29, 2024
5eb2f6f
fix
priyakasimbeg Feb 29, 2024
df53b18
fix
priyakasimbeg Feb 29, 2024
216f2e0
fix self tuning prize qualification baselines
priyakasimbeg Feb 29, 2024
06ac708
fix
priyakasimbeg Feb 29, 2024
0c0f469
reduce eval interval duration
priyakasimbeg Feb 29, 2024
202fd2c
fix typo in workload entry for criteo1tb_embed_init
priyakasimbeg Mar 1, 2024
e956ac8
fix criteo1tb resnet variant
priyakasimbeg Mar 1, 2024
2d7c27a
formatting
priyakasimbeg Mar 1, 2024
3fba843
Merge pull request #666 from mlcommons/criteo_variants_debugging
priyakasimbeg Mar 2, 2024
77f1367
Merge pull request #665 from mlcommons/criteo_evals
priyakasimbeg Mar 2, 2024
84d4d31
fix fastmri datasetup
priyakasimbeg Mar 2, 2024
7d5229d
Merge pull request #661 from mlcommons/self_tuning_docker
priyakasimbeg Mar 2, 2024
d8a098d
Merge pull request #669 from mlcommons/fastmri_data_setup_fix
priyakasimbeg Mar 2, 2024
7f46ae0
fix networkx package dependency version
priyakasimbeg Mar 2, 2024
5201633
add CL updates
priyakasimbeg Mar 2, 2024
12ebb82
Merge pull request #671 from mlcommons/env_fix
priyakasimbeg Mar 2, 2024
ea01c05
Add dropout_rng in fastmri model_init
runame Mar 3, 2024
e23a889
Fix yapf
runame Mar 3, 2024
ab9e3fb
Merge pull request #672 from runame/fix-rng
priyakasimbeg Mar 3, 2024
4347a8c
tuning search space fix
priyakasimbeg Mar 3, 2024
31f5620
Merge pull request #675 from mlcommons/docker_startup_fix
priyakasimbeg Mar 3, 2024
75a0db6
update date
priyakasimbeg Mar 5, 2024
24632ad
Merge pull request #680 from mlcommons/cl_03_02
priyakasimbeg Mar 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add self tuning options to docker startup and run_workloads.py
priyakasimbeg committed Feb 29, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit d425994e5a21cca20c30c7345bb7c78fa29a5a6b
42 changes: 36 additions & 6 deletions docker/scripts/startup.sh
Original file line number Diff line number Diff line change
@@ -50,6 +50,7 @@ HOME_DIR=""
RSYNC_DATA="true"
OVERWRITE="false"
SAVE_CHECKPOINTS="true"
TUNING_RULESET="external"

# Pass flag
while [ "$1" != "" ]; do
@@ -107,6 +108,10 @@ while [ "$1" != "" ]; do
shift
HOME_DIR=$1
;;
--tuning_ruleset)
shift
TUNING_RULESET=$1
;;
--num_tuning_trials)
shift
NUM_TUNING_TRIALS=$1
@@ -157,6 +162,7 @@ VALID_WORKLOADS=("criteo1tb" "imagenet_resnet" "imagenet_resnet_silu" "imagenet_
"librispeech_deepspeech_tanh" \
"librispeech_deepspeech_no_resnet" "librispeech_deepspeech_norm_and_spec_aug"
"fastmri_layernorm" "ogbg_gelu" "ogbg_silu" "ogbg_model_size")
VALID_RULESETS=("self" "external")

# Set data and experiment paths
ROOT_DATA_BUCKET="gs://mlcommons-data"
@@ -167,14 +173,21 @@ EXPERIMENT_DIR="${HOME_DIR}/experiment_runs"

if [[ -n ${DATASET+x} ]]; then
if [[ ! " ${VALID_DATASETS[@]} " =~ " $DATASET " ]]; then
echo "Error: invalid argument for dataset (d)."
echo "Error: invalid argument $DATASET for dataset (d)."
exit 1
fi
fi

if [[ -n ${WORKLOAD+x} ]]; then
if [[ ! " ${VALID_WORKLOADS[@]} " =~ " $WORKLOAD " ]]; then
echo "Error: invalid argument for workload (w)."
echo "Error: invalid argument $WORKLOAD for workload (w)."
exit 1
fi
fi

if [[ -n ${TUNING_RULESET+x} ]]; then
if [[ ! " ${VALID_RULESETS[@]} " =~ " $TUNING_RULESET " ]]; then
echo "Error: invalid argument $TUNING_RULESET gtfor tuning ruleset (tuning_ruleset)."
exit 1
fi
fi
@@ -243,6 +256,10 @@ if [[ ! -z ${SUBMISSION_PATH+x} ]]; then
if [[ ${FRAMEWORK} == "pytorch" ]]; then
TORCH_COMPILE_FLAG="--torch_compile=true"
fi

# Flags for rulesets
if [[ ${TUNING_RULESET} == "external "]]; then
TUNING_SEARCH_SPACE_FLAG = "--submission_path=${SUBMISSION_PATH}"

# The TORCH_RUN_COMMAND_PREFIX is only set if FRAMEWORK is "pytorch"
COMMAND="${COMMAND_PREFIX} submission_runner.py \
@@ -256,13 +273,26 @@ if [[ ! -z ${SUBMISSION_PATH+x} ]]; then
--experiment_name=${EXPERIMENT_NAME} \
--overwrite=${OVERWRITE} \
--save_checkpoints=${SAVE_CHECKPOINTS} \
${NUM_TUNING_TRIALS_FLAG} \
${HPARAM_START_INDEX_FLAG} \
${HPARAM_END_INDEX_FLAG} \
${RNG_SEED_FLAG} \
${MAX_STEPS_FLAG} \
${SPECIAL_FLAGS} \
${TORCH_COMPILE_FLAG} 2>&1 | tee -a ${LOG_FILE}"
${TORCH_COMPILE_FLAG}"

if [[ ${TUNING_RULESET} == "external" ]]; then
COMMAND = "${COMMAND} \
${TUNING_RULESET_FLAG} \
${TUNING_SEARCH_SPACE_FLAG} \
${NUM_TUNING_TRIALS_FLAG} \
${HPARAM_START_INDEX_FLAG} \
${HPARAM_END_INDEX_FLAG}"

else
COMMAND = "${COMMAND} \
${TUNING_RULESET_FLAG}"
fi

COMMAND = "$COMMAND 2>&1 | tee -a ${LOG_FILE}"

echo $COMMAND > ${LOG_FILE}
echo $COMMAND
eval $COMMAND
4 changes: 4 additions & 0 deletions scoring/run_workloads.py
Original file line number Diff line number Diff line change
@@ -50,6 +50,10 @@
False,
'Whether or not to actually run the docker containers. '
'If False, simply print the docker run commands. ')
flags.DEFINE_enum('tuning_ruleset',
'external',
enum_values=['external', 'self'],
help='Can be either external of self.')
flags.DEFINE_integer('num_studies', 5, 'Number of studies to run')
flags.DEFINE_integer('study_start_index', None, 'Start index for studies.')
flags.DEFINE_integer('study_end_index', None, 'End index for studies.')