Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update readme to AWS #1

Open
wants to merge 45 commits into
base: r1.6.1
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
7ce9ea7
first update
yinshiyi Jun 21, 2024
c05c778
plan update
yinshiyi Sep 13, 2024
53d9a9e
add coverage script and make sure happy is run using coverage >2
yinshiyi Sep 17, 2024
11f4bc3
launch small instance
yinshiyi Sep 26, 2024
919c2c1
connect aws ec2
yinshiyi Oct 2, 2024
9f14838
oct 2 2024 progress
yinshiyi Oct 3, 2024
adf2e41
oct 3 2024 progress
yinshiyi Oct 3, 2024
b253057
oct 4 2024
yinshiyi Oct 4, 2024
a43f19f
finish making example
yinshiyi Oct 4, 2024
9c1bb65
add the transfer script
yinshiyi Oct 4, 2024
85bf27c
add shuffle path
yinshiyi Oct 7, 2024
b4f8941
add worker and update dependency
yinshiyi Oct 16, 2024
ddaf0a3
add upload script
yinshiyi Oct 16, 2024
f223138
add validation data
yinshiyi Oct 17, 2024
2c99126
add shuffle validation
yinshiyi Oct 17, 2024
db46bd8
add more cpu thread
yinshiyi Oct 17, 2024
07899f6
update shuffle code
yinshiyi Oct 17, 2024
2d9f0cf
testing shuffle
yinshiyi Oct 17, 2024
f30ad08
update file path
yinshiyi Oct 18, 2024
5dc1599
update dateflow to spark
yinshiyi Oct 23, 2024
05c0db2
add training command
yinshiyi Oct 23, 2024
e455e56
fix file path
yinshiyi Oct 23, 2024
311c6d4
add final eval
yinshiyi Oct 23, 2024
9e5bea8
add baseline and hap
yinshiyi Oct 24, 2024
3d91f9a
typo fix
yinshiyi Nov 7, 2024
cd6276d
update flags
yinshiyi Nov 7, 2024
84363f6
Update shuffle_tfrecords_beam.py
yinshiyi Nov 23, 2024
7efb829
add the real data to try
yinshiyi Dec 4, 2024
7a54cce
Update shuffle_tfrecords_beam.py
yinshiyi Dec 5, 2024
8a55673
Update shuffle_tfrecords_beam.py
yinshiyi Dec 5, 2024
85b58dd
Create beam_test.py
yinshiyi Dec 5, 2024
92a8c28
add test files
yinshiyi Dec 5, 2024
8fb4a2d
fix typo and update subversion to contain aws and boto3
yinshiyi Dec 5, 2024
f641344
create cluster programatically
yinshiyi Dec 5, 2024
25edaa7
beam spark local test file
yinshiyi Dec 5, 2024
0f9f527
emr cluster
yinshiyi Dec 6, 2024
fa757ac
docker file
yinshiyi Dec 6, 2024
12ed0b9
Create test2024.sh
yinshiyi Dec 10, 2024
6b3fa4b
Create beam_test_2024.py
yinshiyi Dec 10, 2024
8a8c99d
Update beam_test_2024.py
yinshiyi Dec 10, 2024
ca85bfe
Update beam_test_2024.py
yinshiyi Dec 10, 2024
abb57fe
Update beam_test_2024.py
yinshiyi Dec 10, 2024
7b719ea
update notes
yinshiyi Dec 13, 2024
b27a451
to do list
yinshiyi Dec 13, 2024
be7b6fd
manually input the yarn address, future to dynamically capture the ad…
yinshiyi Jan 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@
bazel-*

**/.ipynb_checkpoints
shuffle*
23 changes: 23 additions & 0 deletions baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
sudo docker run --gpus 1 \
-v /home/${USER}:/home/${USER} \
google/deepvariant:"${BIN_VERSION}-gpu" \
/opt/deepvariant/bin/run_deepvariant \
--model_type WGS \
--ref "${REF}" \
--reads "${BAM_CHR20}" \
--regions "chr20" \
--output_vcf "${OUTPUT_DIR}/baseline.vcf.gz" \
--num_shards=4

time sudo docker run -it \
-v "${DATA_DIR}:${DATA_DIR}" \
-v "${OUTPUT_DIR}:${OUTPUT_DIR}" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
"${TRUTH_VCF}" \
"${OUTPUT_DIR}/baseline.vcf.gz" \
-f "${TRUTH_BED}" \
-r "${REF}" \
-o "${OUTPUT_DIR}/chr20-calling_general.happy.output" \
-l chr20 \
--engine=vcfeval \
--pass-only
38 changes: 38 additions & 0 deletions docs/add_alarm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/bin/bash

# Variables
host="my-ec2-instance"
region="us-east-1"

# Step 1: Launch the EC2 instance
instance_id=$(aws ec2 run-instances \
--image-id ami-096ea6a12ea24a797 \
--count 1 \
--instance-type t4g.small \
--security-group-id sg-0b734813083db4ba2 \
--key-name gpu \
--block-device-mappings DeviceName=/dev/sda1,Ebs={VolumeSize=20} \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value='"${host}"'}]' \
--query "Instances[0].InstanceId" \
--output text \
--region $region \
--profile gpu)

echo "Launched EC2 instance with ID: $instance_id"

# Step 2: Create the CloudWatch alarm
aws cloudwatch put-metric-alarm \
--alarm-name "CPUUtilization-Low-${instance_id}" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 3600 \
--threshold 1 \
--comparison-operator LessThanOrEqualToThreshold \
--dimensions "Name=InstanceId,Value=${instance_id}" \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:940583394710:idle-instance-alarm \
--region $region \
--profile gpu

echo "Alarm created for instance: $instance_id"
97 changes: 79 additions & 18 deletions docs/deepvariant-training-case-study.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,40 +27,99 @@ accuracy comparing to the WGS model as a baseline:
This tutorial is meant as an example for training; all the other processing in
this tutorial were done serially with no pipeline optimization.

## bam processing
Since PicoV3 is very low coverage, we need to only take the Bam files regions that have enough coverage >2,
it make sense at least 3 reads to vote to have a majority
Using Bed file or something like that
Learn from the Pacbio examples

First set up AWS deepvariant

Collect the bams files with ground truth data available.
Do variant calling using the generalized model first to see the performance on the regions that have coverage >2
Then see if we could improve on that

```bash
BAM_CHR1="${DATA_DIR}/BGISEQ_PE100_NA12878.sorted.chr1.bam"
BAM_CHR20="${DATA_DIR}/BGISEQ_PE100_NA12878.sorted.chr20.bam"
BAM_CHR21="${DATA_DIR}/BGISEQ_PE100_NA12878.sorted.chr21.bam"
merged_bam="merged.bam"
mininum_coverage=2
coverage_bed="pass_threshold.bed"
# why do we need this bed file?
TRUTH_BED="${DATA_DIR}/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_chr.bed"
# https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html
samtools merge $merged_bam $BAM_CHR1 $BAM_CHR20 $BAM_CHR21
bedtools genomecov -ibam $merged_bam -bg | \
awk -v min_cov="$minimum_coverage" '$4 > min_cov {print $1, $2, $3}' | \
bedtools intersect -a $TRUTH_BED -b - > $coverage_bed
```

## Request a machine

For this case study, we use a [GPU machine] with 16 vCPUs. You can request this
machine on Google Cloud using the following command:
machine on AWS using the following command:

```bash
# public.ecr.aws/aws-genomics/google/deepvariant:1.4.0
# https://cloud-images.ubuntu.com/locator/ec2/
aws ec2 run-instances \
--image-id ami-0c272455b0778ebeb \ # Replace with the correct AMI ID for Ubuntu 20.04 LTS in your region
--count 1 \
--instance-type p3.2xlarge \ # p3 instances use Nvidia Tesla V100 GPUs, which is close to the Tesla P100
--key-name MyKeyPair \ # Replace with your key pair name
--block-device-mappings DeviceName=/dev/sda1,Ebs={VolumeSize=300} \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value='"${USER}-deepvariant-vm"'}]' \
--region us-west-2 \
--iam-instance-profile Name=gpu
--placement AvailabilityZone=us-west-2b
```
```bash
aws ec2 run-instances \
--image-id ami-0c272455b0778ebeb \ # Replace with the correct AMI ID for Ubuntu 20.04 LTS in your region
--count 1 \
--instance-type t4g.small \ # just get a small instance to try it out first
--key-name gpu \ # Replace with your key pair name
--block-device-mappings DeviceName=/dev/sda1,Ebs={VolumeSize=20} \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value='"${USER}-deepvariant-vm"'}]' \
--region us-west-2 \
--iam-instance-profile Name=gpu
--placement AvailabilityZone=us-west-2b
```
```bash
# this actually works
host="${USER}-deepvariant-vm"
zone="us-west1-b"

gcloud compute instances create ${host} \
--scopes "compute-rw,storage-full,cloud-platform" \
--maintenance-policy "TERMINATE" \
--accelerator=type=nvidia-tesla-p100,count=1 \
--image-family "ubuntu-2004-lts" \
--image-project "ubuntu-os-cloud" \
--machine-type "n1-standard-16" \
--boot-disk-size "300" \
--zone "${zone}" \
--min-cpu-platform "Intel Skylake"
region="us-east-1"
chmod 400 ~/gpu.pem
# this image id is not right
aws ec2 run-instances \
--image-id ami-096ea6a12ea24a797 \
--count 1 \
--instance-type t4g.small \
--security-group-id sg-0b734813083db4ba2 \
--key-name gpu \
--block-device-mappings DeviceName=/dev/sda1,Ebs={VolumeSize=20} \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value='"${host}"'}]' \
--region $region \
--profile gpu
```

After a minute or two, your VM should be ready and you can ssh into it using the
following command:

```bash
gcloud compute ssh ${host} --zone ${zone}
aws ec2 stop-instances --instance-ids i-0e4f059f74edbb771 -profile gpu
# elastic IP
ssh -i "~/gpu.pem" [email protected]
ssh gpu
# ssh -i ~/gpu.pem ubuntu@${host}
```

Once you have logged in, set the variables:

```bash
YOUR_PROJECT=REPLACE_WITH_YOUR_PROJECT
OUTPUT_GCS_BUCKET=REPLACE_WITH_YOUR_GCS_BUCKET

# might have to install gsutil to make sure the instance connect to deepvariant's standard files
BUCKET="gs://deepvariant"
VERSION="1.6.1"
DOCKER_IMAGE="google/deepvariant:${VERSION}"
Expand Down Expand Up @@ -113,6 +172,8 @@ gsutil -m cp -r "${DATA_BUCKET}/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Io
### Download extra packages

```bash
snap install gh
gh auth login
sudo apt -y update
sudo apt -y install parallel
curl -O https://raw.githubusercontent.com/google/deepvariant/r1.6.1/scripts/install_nvidia_docker.sh
Expand Down Expand Up @@ -538,7 +599,7 @@ time sudo docker run -it \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
"${TRUTH_VCF}" \
"${OUTPUT_DIR}/test_set.vcf.gz" \
-f "${TRUTH_BED}" \
-f "${TRUTH_BED}" \ # this is important in my study to make sure coverage >2
-r "${REF}" \
-o "${OUTPUT_DIR}/chr20-calling.happy.output" \
-l chr20 \
Expand Down Expand Up @@ -588,7 +649,7 @@ sudo docker run --gpus all \
--output_vcf "${OUTPUT_DIR}/baseline.vcf.gz" \
--num_shards=${N_SHARDS}
```

baseline vcf run happy
Baseline:

| Type | TRUTH.TP | TRUTH.FN | QUERY.FP | METRIC.Recall | METRIC.Precision | METRIC.F1_Score |
Expand Down
23 changes: 23 additions & 0 deletions eval.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
BASE="/home/${USER}/data/training-case-study"
OUTPUT_DIR="${BASE}/output"
model="/home/${USER}/data/model/model.ckpt"
TRAINING_DIR="${OUTPUT_DIR}/training_dir"
BIN_VERSION="1.4.0"
INPUT_DIR="${BASE}/input"
LOG_DIR="${OUTPUT_DIR}/logs"
DATA_DIR="${INPUT_DIR}/data"
REF="${DATA_DIR}/ucsc_hg19.fa"
BAM_CHR1="${DATA_DIR}/BGISEQ_PE100_NA12878.sorted.chr1.bam"
BAM_CHR20="${DATA_DIR}/BGISEQ_PE100_NA12878.sorted.chr20.bam"

sudo docker run --gpus 1 \
-v /home/${USER}:/home/${USER} \
google/deepvariant:"${BIN_VERSION}-gpu" \
/opt/deepvariant/bin/run_deepvariant \
--model_type WGS \
--customized_model "${TRAINING_DIR}/model.ckpt-50000" \
--ref "${REF}" \
--reads "${BAM_CHR20}" \
--regions "chr20" \
--output_vcf "${OUTPUT_DIR}/test_set.vcf.gz" \
--num_shards=4
2 changes: 2 additions & 0 deletions index_too_old.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#index is older than file
solve by touch the index files
54 changes: 54 additions & 0 deletions make_example.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# amd64 architecture, but your machine is running an arm64 architecture

YOUR_PROJECT=takara
OUTPUT_GCS_BUCKET=REPLACE_WITH_YOUR_GCS_BUCKET
# might have to install gsutil to make sure the instance connect to deepvariant's standard files
BUCKET="gs://deepvariant"
VERSION="1.6.1"
DOCKER_IMAGE="google/deepvariant:${VERSION}"

MODEL_BUCKET="${BUCKET}/models/DeepVariant/${VERSION}/DeepVariant-inception_v3-${VERSION}+data-wgs_standard"
GCS_PRETRAINED_WGS_MODEL="${MODEL_BUCKET}/model.ckpt"

OUTPUT_BUCKET="${OUTPUT_GCS_BUCKET}/customized_training"
TRAINING_DIR="${OUTPUT_BUCKET}/training_dir"

BASE="/home/ubuntu/data/training-case-study"
DATA_BUCKET=gs://deepvariant/training-case-study/BGISEQ-HG001

INPUT_DIR="${BASE}/input"
BIN_DIR="${INPUT_DIR}/bin"
DATA_DIR="${INPUT_DIR}/data"
OUTPUT_DIR="${BASE}/output2"
LOG_DIR="${OUTPUT_DIR}/logs"
SHUFFLE_SCRIPT_DIR="${HOME}/deepvariant/tools"

REF="${DATA_DIR}/ucsc_hg19.fa"
BAM_CHR1="${DATA_DIR}/BGISEQ_PE100_NA12878.sorted.chr1.bam"
BAM_CHR20="${DATA_DIR}/BGISEQ_PE100_NA12878.sorted.chr20.bam"
BAM_CHR21="${DATA_DIR}/BGISEQ_PE100_NA12878.sorted.chr21.bam"
TRUTH_VCF="${DATA_DIR}/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer_chrs_FIXED.vcf.gz"
TRUTH_BED="${DATA_DIR}/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_chr.bed"

N_SHARDS=7
mkdir -p "${OUTPUT_DIR}"
mkdir -p "${BIN_DIR}"
mkdir -p "${DATA_DIR}"
mkdir -p "${LOG_DIR}"

( time seq 0 $((N_SHARDS-1)) | \
parallel --halt 2 --line-buffer \
sudo docker run \
-v ${HOME}:${HOME} \
${DOCKER_IMAGE} \
make_examples \
--mode training \
--ref "${REF}" \
--reads "${BAM_CHR1}" \
--examples "${OUTPUT_DIR}/training_set.with_label.tfrecord@${N_SHARDS}.gz" \
--truth_variants "${TRUTH_VCF}" \
--confident_regions "${TRUTH_BED}" \
--task {} \
--regions "'chr1'" \
--channels "insert_size" \
) 2>&1 | tee "${LOG_DIR}/training_set.with_label.make_examples.log"
22 changes: 22 additions & 0 deletions run_hap.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
BAM_CHR1="${DATA_DIR}/BGISEQ_PE100_NA12878.sorted.chr1.bam"
BAM_CHR20="${DATA_DIR}/BGISEQ_PE100_NA12878.sorted.chr20.bam"
BAM_CHR21="${DATA_DIR}/BGISEQ_PE100_NA12878.sorted.chr21.bam"
TRUTH_VCF="${DATA_DIR}/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer_chrs_FIXED.vcf.gz"
TRUTH_BED="${DATA_DIR}/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_chr.bed"
REF="${DATA_DIR}/ucsc_hg19.fa"


sudo docker pull jmcdani20/hap.py:v0.3.12

time sudo docker run -it \
-v "${DATA_DIR}:${DATA_DIR}" \
-v "${OUTPUT_DIR}:${OUTPUT_DIR}" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
"${TRUTH_VCF}" \
"${OUTPUT_DIR}/test_set.vcf.gz" \
-f "${TRUTH_BED}" \
-r "${REF}" \
-o "${OUTPUT_DIR}/chr20-calling.happy.output" \
-l chr20 \
--engine=vcfeval \
--pass-only
26 changes: 26 additions & 0 deletions shuffle.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#
# git clone https://github.com/apache/beam-starter-python.git
# cd beam-starter-python
# python3 -m venv env
# source env/bin/activate

# pip3 install setuptools --upgrade
# pip3 install apache_beam # installed 2.59.0
# pip3 install tensorflow # For parsing tf.Example in shuffle_tfrecords_beam.py.

# play around with snappy will make it crash in local server
# python-snappy
# python3 -m pip install snappy
# source ../beam-starter-python/shiyi/bin/activate
YOUR_PROJECT=takara
BASE="/home/syin/lol/data/training-case-study"
OUTPUT_DIR="${BASE}/output2"
time python3 tools/shuffle_tfrecords_beam.py \
--project="${YOUR_PROJECT}" \
--input_pattern_list="${OUTPUT_DIR}"/training_set.with_label.tfrecord-?????-of-00007.gz \
--output_pattern_prefix="${OUTPUT_DIR}/training_set.with_label.shuffled" \
--output_dataset_name="HG001" \
--output_dataset_config_pbtxt="${OUTPUT_DIR}/training_set.dataset_config.pbtxt" \
--job_name=shuffle-tfrecords \
--runner=DirectRunner \
--direct_num_workers=32
13 changes: 13 additions & 0 deletions shuffle_validation.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
YOUR_PROJECT=takara
BASE="/home/syin/lol/data/training-case-study"
OUTPUT_DIR="${BASE}/output"
time python3 tools/shuffle_tfrecords_beam.py \
--project="${YOUR_PROJECT}" \
--input_pattern_list="${OUTPUT_DIR}"/validation_set.with_label.tfrecord-?????-of-?????.gz \
--output_pattern_prefix="${OUTPUT_DIR}/2/validation_set.with_label.shuffled" \
--output_dataset_name="HG001" \
--output_dataset_config_pbtxt="${OUTPUT_DIR}/2/validation_set.dataset_config.pbtxt" \
--job_name=shuffle-tfrecords \
--runner=DirectRunner \
--direct_num_workers=0
# --direct_running_mode=multi_threading \
Loading