Containerization is a solution to using VEBA on any system and portability for resources such as AWS or Google Cloud. Here is the guide for using these containers specifically with AWS.
- Set up AWS infrastructure
- Create and register a job definition
- Submit job definition
Out of scope for this tutorial but essentially you need to do the following:
- Set up AWS EFS (Elastic File System) via Terraform to read/write/mount data
- Compile database in EFS
- Create compute environment
- Create job queue linked to compute environment
Once the job queue is properly set up, next is to create a job definition and then submit the job definition to the queue.
The preferred way to submit jobs with AWS Batch is using json files for the job definition through Fargate.
Here is a template you can use for a job definition.
This job definition pulls the jolespin/veba_preprocess Docker image and mounts EFS directories to volumes within the Docker container. The actual job runs the preprocess.py module of VEBA for a sample called S1.
{
"jobDefinitionName": "preprocess__S1",
"type": "container",
"containerProperties": {
"image": "jolespin/veba_preprocess:2.4.2",
"command": [
"preprocess.py",
"-1",
"/volumes/input/Fastq/S1_1.fastq.gz",
"-2",
"/volumes/input/Fastq/S1_2.fastq.gz",
"-n",
"1",
"-o",
"/volumes/output/veba_output/preprocess",
"-p",
"16"
"-x",
"/volumes/database/Contamination/chm13v2.0/chm13v2.0"
],
"jobRoleArn": "arn:aws:iam::xxx:role/ecsTaskExecutionRole",
"executionRoleArn": "arn:aws:iam::xxx:role/ecsTaskExecutionRole",
"volumes": [
{
"name": "efs-volume-database",
"efsVolumeConfiguration": {
"fileSystemId": "fs-xxx",
"transitEncryption": "ENABLED",
"rootDirectory": "databases/veba/VDB_v8/"
}
},
{
"name": "efs-volume-input",
"efsVolumeConfiguration": {
"fileSystemId": "fs-xxx",
"transitEncryption": "ENABLED",
"rootDirectory": "path/to/efs/input/"
}
},
{
"name": "efs-volume-output",
"efsVolumeConfiguration": {
"fileSystemId": "fs-xxx",
"transitEncryption": "ENABLED",
"rootDirectory": "path/to/efs/output/"
}
}
],
"mountPoints": [
{
"sourceVolume": "efs-volume-database",
"containerPath": "/volumes/database",
"readOnly": true
},
{
"sourceVolume": "efs-volume-input",
"containerPath": "/volumes/input",
"readOnly": true
},
{
"sourceVolume": "efs-volume-output",
"containerPath": "/volumes/output",
"readOnly": false
}
],
"environment": [],
"ulimits": [],
"resourceRequirements": [
{
"value": "16.0",
"type": "VCPU"
},
{
"value": "8000",
"type": "MEMORY"
}
],
"networkConfiguration": {
"assignPublicIp": "ENABLED"
},
"fargatePlatformConfiguration": {
"platformVersion": "LATEST"
},
"ephemeralStorage": {
"sizeInGiB": 40
}
},
"platformCapabilities": [
"FARGATE"
]
}
Now register the job definition:
FILE=/path/to/preprocess/S1.json
aws batch register-job-definition --cli-input-json file://${FILE}
Next step is to submit the job to the queue.
QUEUE="some-aws-job-queue-name"
JOB_NAME="preprocess__S1"
aws batch submit-job --job-definition ${JOB_NAME} --job-name ${JOB_NAME} --job-queue ${QUEUE}