Skip to content

Commit

Permalink
Fix preemptibles and maxRetries on GCP Batch [AN-274] [AN-377] (#7684)
Browse files Browse the repository at this point in the history
  • Loading branch information
mcovarr authored Feb 7, 2025
1 parent ab3f9b5 commit 648f65a
Show file tree
Hide file tree
Showing 35 changed files with 724 additions and 210 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ be found [here](https://cromwell.readthedocs.io/en/stable/backends/HPC/#optional
- The `genomics` configuration entry was renamed to `batch`, see [ReadTheDocs](https://cromwell.readthedocs.io/en/stable/backends/GCPBatch/) for more information.
- Fixes a bug with not being able to recover jobs on Cromwell restart.
- Fixes machine type selection to match the Google Cloud Life Sciences backend, including default n1 non shared-core machine types and correct handling of `cpuPlatform` to select n2 or n2d machine types as appropriate.
- Fixes the preemption error handling, now, the correct error message is printed, this also handles the other potential exit codes.
- Fixes preemption and maxRetries behavior. In particular, once a task has exhausted its allowed preemptible attempts, the task will be scheduled again on a non-preemptible VM.
- Fixes error message reporting for failed jobs.
- Fixes the "retry with more memory" feature.
- Fixes the reference disk feature.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: checkpointing
testFormat: workflowsuccess
backends: [Papiv2, GCPBATCH]
backends: [Papiv2, GCPBATCH_ALT]

files {
workflow: checkpointing/checkpointing.wdl
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
version 1.0

workflow checkpointing {
call count { input: count_to = 100 }
output {
String preempted = count.preempted
}
}

task count {
input {
Int count_to
}

meta {
volatile: true
}

command <<<
# Read from the my_checkpoint file if there's content there:
FROM_CKPT=$(cat my_checkpoint | tail -n1 | awk '{ print $1 }')
FROM_CKPT=${FROM_CKPT:-1}

# We don't want any single VM run the entire count, so work out the max counter value for this attempt:
MAX="$(($FROM_CKPT + 66))"

INSTANCE_NAME=$(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google")
echo "Discovered instance: $INSTANCE_NAME"

# Run the counter:
echo '--' >> my_checkpoint
for i in $(seq $FROM_CKPT ~{count_to})
do
echo $i
echo $i ${INSTANCE_NAME} $(date) >> my_checkpoint

# If we're over our max, "preempt" the VM by simulating a maintenance event:
if [ "${i}" -gt "${MAX}" ]
then
fully_qualified_zone=$(curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/zone)
zone=$(basename "$fully_qualified_zone")
gcloud beta compute instances simulate-maintenance-event $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q
sleep 60
fi

sleep 1
done

# Prove that we got preempted at least once:
FIRST_INSTANCE=$(cat my_checkpoint | head -n1 | awk '{ print $2 }')
LAST_INSTANCE=$(cat my_checkpoint | tail -n1 | awk '{ print $2 }')
if [ "${FIRST_INSTANCE}" != "LAST_INSTANCE" ]
then
echo "GOTPREEMPTED" > preempted.txt
else
echo "NEVERPREEMPTED" > preempted.txt
fi
>>>

runtime {
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:slim"
preemptible: 3
checkpointFile: "my_checkpoint"
}

output {
File checkpoint_log = "my_checkpoint"
String preempted = read_string("preempted.txt")
}
}

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: gcpbatch_checkpointing
testFormat: workflowsuccess
backends: [GCPBATCH]

files {
workflow: checkpointing/gcpbatch_checkpointing.wdl
}

metadata {
workflowName: checkpointing
status: Succeeded
"outputs.checkpointing.preempted": "GOTPREEMPTED"
}
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@ metadata {
"calls.required_files.check_it.executionStatus": "Done"
"calls.required_files.do_it.executionStatus": "Failed"
"calls.required_files.do_it.retryableFailure": "false"
"calls.required_files.do_it.failures.0.message": ~~"failed"
"calls.required_files.do_it.failures.0.message": ~~"Job exited without an error, exit code 0. Batch error code 0. Job failed with an unknown reason"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: gcpbatch_papi_preemptible_and_max_retries
testFormat: workflowfailure
backends: [GCPBATCH]

files {
workflow: papi_preemptible_and_max_retries/gcpbatch_papi_preemptible_and_max_retries.wdl
}

metadata {
workflowName: papi_preemptible_and_max_retries
status: Failed
"papi_preemptible_and_max_retries.delete_self.-1.attempt": 3
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: gcpbatch_preemptible_and_memory_retry
testFormat: workflowfailure
# The original version of this test was tailored to the quirks of Papi v2 in depending on the misdiagnosis of its own
# VM deletion as a preemption event. However GCP Batch perhaps more correctly diagnoses VM deletion as a weird
# non-preemption event. The GCPBATCH version of this test uses `gcloud beta compute instances simulate-maintenance-event`
# to simulate a preemption in a way that GCP Batch actually perceives as a preemption.
backends: [GCPBATCH]

files {
workflow: retry_with_more_memory/gcpbatch/preemptible_and_memory_retry.wdl
options: retry_with_more_memory/retry_with_more_memory.options
}

metadata {
workflowName: preemptible_and_memory_retry
status: Failed
"failures.0.message": "Workflow failed"
"failures.0.causedBy.0.message": "stderr for job `preemptible_and_memory_retry.imitate_oom_error_on_preemptible:NA:3` contained one of the `memory-retry-error-keys: [OutOfMemory,Killed]` specified in the Cromwell config. Job might have run out of memory."
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.1.preemptible": "true"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.1.executionStatus": "RetryableFailure"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.1.runtimeAttributes.memory": "1 GB"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.2.preemptible": "false"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.2.executionStatus": "RetryableFailure"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.2.runtimeAttributes.memory": "1 GB"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.3.preemptible": "false"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.3.executionStatus": "Failed"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.3.runtimeAttributes.memory": "1.1 GB"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: gcpbatch_preemptible_basic
testFormat: workflowsuccess
backends: [GCPBATCH]

files {
workflow: preemptible_basic/gcpbatch_preemptible_basic.wdl
}

metadata {
status: Succeeded
}
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,5 @@ metadata {
workflowName: requester_pays_localization
status: Failed
"failures.0.message": "Workflow failed"
"failures.0.causedBy.0.message": ~~"failed"
"failures.0.causedBy.0.message": ~~"The job was stopped before the command finished. Batch error code 0. Job failed with an unknown reason"
}
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: gcpbatch_retry_with_more_memory
testFormat: workflowfailure
testFormat: workflowsuccess
backends: [GCPBATCH]

files {
Expand All @@ -9,13 +9,10 @@ files {

metadata {
workflowName: retry_with_more_memory
status: Failed
"failures.0.message": "Workflow failed"
"failures.0.causedBy.0.message": "stderr for job `retry_with_more_memory.imitate_oom_error:NA:3` contained one of the `memory-retry-error-keys: [OutOfMemory,Killed]` specified in the Cromwell config. Job might have run out of memory."
status: Succeeded
"retry_with_more_memory.imitate_oom_error.-1.1.executionStatus": "RetryableFailure"
"retry_with_more_memory.imitate_oom_error.-1.1.runtimeAttributes.memory": "1 GB"
"retry_with_more_memory.imitate_oom_error.-1.2.executionStatus": "RetryableFailure"
"retry_with_more_memory.imitate_oom_error.-1.2.runtimeAttributes.memory": "1.1 GB"
"retry_with_more_memory.imitate_oom_error.-1.3.executionStatus": "Failed"
"retry_with_more_memory.imitate_oom_error.-1.3.runtimeAttributes.memory": "1.2100000000000002 GB"
"outputs.retry_with_more_memory.memory_output": "1.2100000000000002 GB"
}
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name: papi_preemptible_and_max_retries
testFormat: workflowfailure
# faking own preemption doesn't work on GCP Batch
backends: [Papiv2, GCPBATCH_TESTING_PAPIV2_QUIRKS]
# Faking own preemption has to be done differently on GCP Batch
backends: [Papiv2, GCPBATCH_ALT]

files {
workflow: papi_preemptible_and_max_retries/papi_preemptible_and_max_retries.wdl
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
version 1.0

task delete_self {

command {
preemptible=$(curl -H "Metadata-Flavor: Google" "http://metadata.google.internal/computeMetadata/v1/instance/scheduling/preemptible")

# Simulate a maintenance event on ourselves if running on a preemptible VM, otherwise delete ourselves.
fully_qualified_zone=$(curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/zone)
zone=$(basename "$fully_qualified_zone")

if [ "$preemptible" = "TRUE" ]; then
gcloud beta compute instances simulate-maintenance-event $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q
sleep 60
else
# We need to actually delete ourselves if the VM is not preemptible; simulated maintenance events don't seem to
# precipitate the demise of on-demand VMs.
gcloud compute instances delete $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q
fi
}

runtime {
preemptible: 1
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:slim"
maxRetries: 1
}
}

workflow papi_preemptible_and_max_retries {
call delete_self
}
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
name: preemptible_and_memory_retry
testFormat: workflowfailure
# The original version of this test seems to have been tailored to the quirks of Papi v2 in depending on the misdiagnosis of its own VM deletion as a preemption event. GCP Batch perhaps more correctly diagnoses the VM deletion as a weird non-preemption happening, but that frustrates the logic of this test.
# Disabling this as it's not possible to induce a real preemption.
backends: [Papiv2, GCPBATCH_TESTING_PAPIV2_QUIRKS]
# The original version of this test was tailored to the quirks of Papi v2 in depending on the misdiagnosis of its own
# VM deletion as a preemption event. However GCP Batch perhaps more correctly diagnoses VM deletion as a weird
# non-preemption event. The GCPBATCH version of this test uses `gcloud beta compute instances simulate-maintenance-event`
# to simulate a preemption in a way that GCP Batch actually perceives as a preemption.
backends: [Papiv2, GCPBATCH_ALT]

files {
workflow: retry_with_more_memory/preemptible_and_memory_retry.wdl
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: preemptible_basic
testFormat: workflowsuccess
backends: [Papiv2, GCPBATCH_ALT]

files {
workflow: preemptible_basic/preemptible_basic.wdl
}

metadata {
status: Succeeded
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
version 1.0

task delete_self_if_preemptible {

command <<<
# Prepend date, time and pwd to xtrace log entries.
PS4='\D{+%F %T} \w $ '
set -o errexit -o nounset -o pipefail -o xtrace

preemptible=$(curl -H "Metadata-Flavor: Google" "http://metadata.google.internal/computeMetadata/v1/instance/scheduling/preemptible")

# Perform a maintenance event on this VM if it is preemptible, which should cause it to be preempted.
# Since `preemptible: 1` the job should be restarted on a non-preemptible VM.
if [ "$preemptible" = "TRUE" ]; then
fully_qualified_zone=$(curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/zone)
zone=$(basename "$fully_qualified_zone")

gcloud beta compute instances simulate-maintenance-event $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q
sleep 60
fi

>>>

runtime {
preemptible: 1
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:slim"
}
}


workflow preemptible_basic {
call delete_self_if_preemptible
}
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ task delete_self_if_preemptible {
# Delete self if running on a preemptible VM. This should produce an "error 10" which Cromwell should treat as a preemption.
# Since `preemptible: 1` the job should be restarted on a non-preemptible VM.
if [ "$preemptible" = "TRUE" ]; then

fully_qualified_zone=$(curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/zone)
zone=$(basename "$fully_qualified_zone")

Expand All @@ -25,6 +25,6 @@ task delete_self_if_preemptible {
}


workflow error_10_preemptible {
workflow preemptible_basic {
call delete_self_if_preemptible
}
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,14 @@ task imitate_oom_error_on_preemptible {

preemptible=$(curl -H "Metadata-Flavor: Google" "http://metadata.google.internal/computeMetadata/v1/instance/scheduling/preemptible")

# Delete self if running on a preemptible VM
# Simulate a maintenance event on ourselves if running on a preemptible VM
# Since `preemptible: 1` the job should be restarted on a non-preemptible VM.
if [ "$preemptible" = "TRUE" ]; then
fully_qualified_zone=$(curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/zone)
zone=$(basename "$fully_qualified_zone")

gcloud compute instances delete $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q
gcloud beta compute instances simulate-maintenance-event $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q
sleep 60
fi

# Should reach here on the second attempt
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,21 @@ version 1.0

task imitate_oom_error {
command {
printf "Exception in thread "main" java.lang.OutOfMemoryError: testing\n\tat Test.main(Test.java:1)\n" >&2 && (exit 1)
# As a simulation of an OOM condition, do not create the 'foo' file. Cromwell should still be able to delocalize important detritus.
# touch foo
echo "$MEM_SIZE $MEM_UNIT"

# Current bashes do not do floating point arithmetic, Python to the rescue.
LESS=$(python -c "print($MEM_SIZE < 1.21)")

if [[ "$LESS" = "True" ]]
then
printf "Exception in thread "main" java.lang.OutOfMemoryError: testing\n\tat Test.main(Test.java:1)\n" >&2
exit 1
fi

echo "$MEM_SIZE $MEM_UNIT" > memory_output.txt
}
output {
File foo = "foo"
String memory_output = read_string("memory_output.txt")
}
runtime {
docker: "python:latest"
Expand All @@ -19,4 +28,8 @@ task imitate_oom_error {

workflow retry_with_more_memory {
call imitate_oom_error

output {
String memory_output = imitate_oom_error.memory_output
}
}
8 changes: 7 additions & 1 deletion docs/RuntimeAttributes.md
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,13 @@ runtime {
}
```


In GCP Batch, preempted jobs can be identified in job metadata (`gcloud batch jobs describe`) by a `statusEvent` with a description that looks like:
```
Job state is set from RUNNING to FAILED for job projects/abc/locations/us-central1/jobs/job-abc.Job
failed due to task failure. Specifically, task with index 0 failed due to the
following task event: "Task state is updated from RUNNING to FAILED on zones/us-central1-b/instances/8675309
due to Spot VM preemption with exit code 50001."
```


### `bootDiskSizeGb`
Expand Down
Loading

0 comments on commit 648f65a

Please sign in to comment.