Skip to content

WGS Benchmarking

Mike Lloyd edited this page Mar 17, 2022 · 3 revisions

WGS Benchmarking Data

The run scripts for WGS were moved from the root WGS directory to inside Mouse and Human respectfully. The path structure was changed in the NEAT call to reflect this.

The target files were removed, and the target coverage was reduced to 10x for the simulation.

The NEAT package. was used. This package can generate fastq, 'gold standard' (truth) VCF, and 'gold standard' (truth) BAM files. The package inserts SNP and InDels into a provided fasta reference. In simulating insertions of SNPs and InDels, NEAT can be provided a VCF file with variants, or can generate SNP and InDel regions de novo. In both cases, NEAT outputs a 'truth' VCF files that contains a total accounting of the modified genome positions, which is used for validating variant calling tools. For this project, NEAT was allowed to add variants de novo.

The main issue with NEAT is that it generates mutations genome wide when you use a target BED file. This required a two step simulation run. The first is used to generated an array of mutations, the second is re-running the simulation using just those mutations that are present in the target regions.

The second issue is that NEAT under-simulates coverage based on your request. In my experience, it was ~2x lower than expectation (e.g., if you want 60x as for 120x).

Steps:

  1. Run NEAT to get fastq and gold truth files.
  2. Run pipeline
  3. Validate results.

Validation Results

Validation results are available here

Generation of Data

The following GitHub repository was used:

git clone https://github.com/ncsa/NEAT.git

# Specifically code from the following commit was used: 
# https://github.com/ncsa/NEAT/commit/1d7b6146b95814ce18b99bcf2fac27d286df3300

Mouse

Generating FASTA inputs

conda activate cleanenv

cp /projects/omics_share/mouse/GRCm38/genome/sequence/ensembl/v102/Mus_musculus.GRCm38.dna.toplevel.fa reference_input/

samtools faidx reference_input/Mus_musculus.GRCm38.dna.toplevel.fa

samtools faidx reference_input/Mus_musculus.GRCm38.dna.toplevel.fa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y MT > reference_input/Mus_musculus.GRCm38.dna.primary_CHR_Only.fa

cd reference_input/

faidx --split-files Mus_musculus.GRCm38.dna.primary_CHR_Only.fa

rm Mus_musculus.GRCm38.dna.primary_CHR_Only.fa.fai Mus_musculus.GRCm38.dna.toplevel.fa Mus_musculus.GRCm38.dna.toplevel.fa.fai Mus_musculus.GRCm38.dna.primary_CHR_Only.fa

Submission Script

#!/bin/bash

for i in `ls reference_input/*.fa`
do
	chrname=`basename ${i%.fa}`
	sbatch --job-name mouse_$chrname --output mouse_$chrname-%j sim_reads_mouse.sh $i
done

sim_reads_mouse.sh

#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --time=72:00:00
#SBATCH --mem=50G
#SBATCH [email protected]
#SBATCH --mail-type=END,FAIL
#SBATCH -p compute
#SBATCH -q batch

cd $SLURM_SUBMIT_DIR

module load singularity

echo $1

filename=`basename ${1%.fa}`

python /projects/omics_share/human/GRCh38/supporting_files/benchmarking_data/NEAT/gen_reads.py \
        -r $1 \
        -R 150 \
        -o Mus_musculus.GRCm38_simVar_10x_$filename \
	-c 10 \
        -E 0.001 \
        --vcf \
        --pe 350 30

Post File Build

conda activate cleanenv

gunzip *golden*
ls *.vcf | xargs -n1 bgzip

for i in `ls *vcf.gz`
do
tabix -f -p vcf $i
done

ls *.vcf.gz > vcfout.list

bcftools merge -m none --file-list vcfout.list -Oz -o Mus_musculus.GRCm38_simVar_10x_ALLchr_golden.vcf.gz

vcf-sort Mus_musculus.GRCm38_simVar_10x_ALLchr_golden.vcf.gz > Mus_musculus.GRCm38_simVar_10x_ALLchr_golden_sorted.vcf

cat *read1.fq.gz >  Mus_musculus.GRCm38_simVar_10x_allChr_R1_001.fastq.gz

cat *read2.fq.gz >  Mus_musculus.GRCm38_simVar_19x_allChr_R2_001.fastq.gz

bgzip Mus_musculus.GRCm38_simVar_10x_ALLchr_golden_sorted.vcf

tabix -f -p vcf Mus_musculus.GRCm38_simVar_10x_ALLchr_golden_sorted.vcf.gz

Human

No changes were needed to the human fasta files extracted from the WES generation, and they were copied over.

Submission Script

#!/bin/bash

for i in `ls reference_input/chr*.fa`
do
	chrname=`basename ${i%.fa}`
	sbatch --job-name human_$chrname --output human_$chrname-%j sim_reads_human.sh $i
done

sim_reads_human.sh

#!/bin/bash

#SBATCH --job-name=Human_NeatGenReads_WES
#SBATCH --ntasks=1
#SBATCH --time=72:00:00
#SBATCH --mem=50G
#SBATCH [email protected]
#SBATCH --mail-type=END,FAIL
#SBATCH -p compute
#SBATCH -q batch
#SBATCH --output=slurm-%j.out

cd $SLURM_SUBMIT_DIR

module load singularity

filename=`basename ${1%.fa}`

python /projects/omics_share/human/GRCh38/supporting_files/benchmarking_data/NEAT/gen_reads.py \
        -r $1 \
        -R 150 \
        -o Homo_sapiens.GRCh38_simVar_10x_$filename \
	-c 10 \
        -E 0.001 \
        --pe 350 30 \
	--vcf

Post File Build

conda activate cleanenv

gunzip *golden*
ls *.vcf | xargs -n1 bgzip

for i in `ls *vcf.gz`
do
tabix -f -p vcf $i
done

ls *.vcf.gz > vcfout.list

bcftools merge -m none --file-list vcfout.list -Oz -o Homo_sapiens.GRCh38_simVar_10x_ALLchr_golden.vcf.gz

vcf-sort Homo_sapiens.GRCh38_simVar_10x_ALLchr_golden.vcf.gz > Homo_sapiens.GRCh38_simVar_10x_ALLchr_golden_sorted.vcf

cat *read1.fq.gz >  Homo_sapiens.GRCh38_simVar_10x_allChr_R1_001.fastq.gz

cat *read2.fq.gz >  Homo_sapiens.GRCh38_simVar_10x_allChr_R2_001.fastq.gz

bgzip Homo_sapiens.GRCh38_simVar_10x_ALLchr_golden_sorted.vcf

tabix -f -p vcf Homo_sapiens.GRCh38_simVar_10x_ALLchr_golden_sorted.vcf.gz

mv *_golden* gold_truth_vcf/

mv *allChr_R* simulated_5x_AllChr

mv *_chr*_r* simulated_5x_individual_chr

Clone this wiki locally