-
Notifications
You must be signed in to change notification settings - Fork 10
WGS Benchmarking
The run scripts for WGS were moved from the root WGS directory to inside Mouse
and Human
respectfully. The path structure was changed in the NEAT call to reflect this.
The target files were removed, and the target coverage was reduced to 10x for the simulation.
The NEAT package. was used. This package can generate fastq, 'gold standard' (truth) VCF, and 'gold standard' (truth) BAM files. The package inserts SNP and InDels into a provided fasta reference. In simulating insertions of SNPs and InDels, NEAT can be provided a VCF file with variants, or can generate SNP and InDel regions de novo. In both cases, NEAT outputs a 'truth' VCF files that contains a total accounting of the modified genome positions, which is used for validating variant calling tools. For this project, NEAT was allowed to add variants de novo.
The main issue with NEAT is that it generates mutations genome wide when you use a target BED file. This required a two step simulation run. The first is used to generated an array of mutations, the second is re-running the simulation using just those mutations that are present in the target regions.
The second issue is that NEAT under-simulates coverage based on your request. In my experience, it was ~2x lower than expectation (e.g., if you want 60x as for 120x).
Steps:
- Run NEAT to get fastq and gold truth files.
- Run pipeline
- Validate results.
Validation results are available here
The following GitHub repository was used:
git clone https://github.com/ncsa/NEAT.git
# Specifically code from the following commit was used:
# https://github.com/ncsa/NEAT/commit/1d7b6146b95814ce18b99bcf2fac27d286df3300
conda activate cleanenv
cp /projects/omics_share/mouse/GRCm38/genome/sequence/ensembl/v102/Mus_musculus.GRCm38.dna.toplevel.fa reference_input/
samtools faidx reference_input/Mus_musculus.GRCm38.dna.toplevel.fa
samtools faidx reference_input/Mus_musculus.GRCm38.dna.toplevel.fa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y MT > reference_input/Mus_musculus.GRCm38.dna.primary_CHR_Only.fa
cd reference_input/
faidx --split-files Mus_musculus.GRCm38.dna.primary_CHR_Only.fa
rm Mus_musculus.GRCm38.dna.primary_CHR_Only.fa.fai Mus_musculus.GRCm38.dna.toplevel.fa Mus_musculus.GRCm38.dna.toplevel.fa.fai Mus_musculus.GRCm38.dna.primary_CHR_Only.fa
#!/bin/bash
for i in `ls reference_input/*.fa`
do
chrname=`basename ${i%.fa}`
sbatch --job-name mouse_$chrname --output mouse_$chrname-%j sim_reads_mouse.sh $i
done
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=72:00:00
#SBATCH --mem=50G
#SBATCH [email protected]
#SBATCH --mail-type=END,FAIL
#SBATCH -p compute
#SBATCH -q batch
cd $SLURM_SUBMIT_DIR
module load singularity
echo $1
filename=`basename ${1%.fa}`
python /projects/omics_share/human/GRCh38/supporting_files/benchmarking_data/NEAT/gen_reads.py \
-r $1 \
-R 150 \
-o Mus_musculus.GRCm38_simVar_10x_$filename \
-c 10 \
-E 0.001 \
--vcf \
--pe 350 30
conda activate cleanenv
gunzip *golden*
ls *.vcf | xargs -n1 bgzip
for i in `ls *vcf.gz`
do
tabix -f -p vcf $i
done
ls *.vcf.gz > vcfout.list
bcftools merge -m none --file-list vcfout.list -Oz -o Mus_musculus.GRCm38_simVar_10x_ALLchr_golden.vcf.gz
vcf-sort Mus_musculus.GRCm38_simVar_10x_ALLchr_golden.vcf.gz > Mus_musculus.GRCm38_simVar_10x_ALLchr_golden_sorted.vcf
cat *read1.fq.gz > Mus_musculus.GRCm38_simVar_10x_allChr_R1_001.fastq.gz
cat *read2.fq.gz > Mus_musculus.GRCm38_simVar_19x_allChr_R2_001.fastq.gz
bgzip Mus_musculus.GRCm38_simVar_10x_ALLchr_golden_sorted.vcf
tabix -f -p vcf Mus_musculus.GRCm38_simVar_10x_ALLchr_golden_sorted.vcf.gz
No changes were needed to the human fasta files extracted from the WES generation, and they were copied over.
#!/bin/bash
for i in `ls reference_input/chr*.fa`
do
chrname=`basename ${i%.fa}`
sbatch --job-name human_$chrname --output human_$chrname-%j sim_reads_human.sh $i
done
#!/bin/bash
#SBATCH --job-name=Human_NeatGenReads_WES
#SBATCH --ntasks=1
#SBATCH --time=72:00:00
#SBATCH --mem=50G
#SBATCH [email protected]
#SBATCH --mail-type=END,FAIL
#SBATCH -p compute
#SBATCH -q batch
#SBATCH --output=slurm-%j.out
cd $SLURM_SUBMIT_DIR
module load singularity
filename=`basename ${1%.fa}`
python /projects/omics_share/human/GRCh38/supporting_files/benchmarking_data/NEAT/gen_reads.py \
-r $1 \
-R 150 \
-o Homo_sapiens.GRCh38_simVar_10x_$filename \
-c 10 \
-E 0.001 \
--pe 350 30 \
--vcf
conda activate cleanenv
gunzip *golden*
ls *.vcf | xargs -n1 bgzip
for i in `ls *vcf.gz`
do
tabix -f -p vcf $i
done
ls *.vcf.gz > vcfout.list
bcftools merge -m none --file-list vcfout.list -Oz -o Homo_sapiens.GRCh38_simVar_10x_ALLchr_golden.vcf.gz
vcf-sort Homo_sapiens.GRCh38_simVar_10x_ALLchr_golden.vcf.gz > Homo_sapiens.GRCh38_simVar_10x_ALLchr_golden_sorted.vcf
cat *read1.fq.gz > Homo_sapiens.GRCh38_simVar_10x_allChr_R1_001.fastq.gz
cat *read2.fq.gz > Homo_sapiens.GRCh38_simVar_10x_allChr_R2_001.fastq.gz
bgzip Homo_sapiens.GRCh38_simVar_10x_ALLchr_golden_sorted.vcf
tabix -f -p vcf Homo_sapiens.GRCh38_simVar_10x_ALLchr_golden_sorted.vcf.gz
mv *_golden* gold_truth_vcf/
mv *allChr_R* simulated_5x_AllChr
mv *_chr*_r* simulated_5x_individual_chr