Skip to content

Tutorial

Medhat edited this page May 12, 2022 · 32 revisions

What you will learn

  • Build a basic understanding of Princess workflow
  • Analyzing long-reads like PacBio HiFi or Oxford Nanopore (ONT)
  • Sequence data QC
  • Call Single Nucleotide Variations (SNPs), short Insertion and Deletions (Indels)
  • Call Structural Variants (SVs)
  • Phase SNPs and SVs
  • Get familiar with VCF files

Problems that you will be able to solve

Starting with a capture sample, we will identify SNVs, indels, SVs, and their haplotype with minimum installation and no prior expertise with bioinformatics tools optimization.

Requirements

  • Some essential experience with the Linux system and command line will be required

Installation

Princess was tested on CentOS release 6.7, Conda version 4.7.12 is installed: for more information about Installing Conda press here To download same Conda version here

Install Conda

Create working directory

mkdir princess_tutorial
cd princess_tutorial

Create environment

conda create --name princess_env python=3.7

Install Snakemake and the tools we will need

conda activate princess_env
conda install snakemake=5.7.1
conda install pyyaml
conda install bcftools # for SNVs and SVs in depth analysis
conda install bedtools # for bed files intersection

Download Princess

git clone https://github.com/MeHelmy/princess.git
chmod +x install.sh
./install.sh

Test installation

python princess -h

Downloading data

Running the analysis

python princess all -r ont -d analysis -f hs37d5_mainchr.fa -s $PWD/MdaMb231_brPanel_MinION.12.fastq.gz -e --printshellcmds --dry-run

Questions

How many folders can you see in the analysis directory, what are they called?

benchmark log PrincessLog.txt result snake_log statistics

In the ‘benchmark/align/minimap’ directory, can you identify the run time of the mapping process?

awk '{print $2}' stat.benchmark.txt

In the directory ‘benchmark/sv/minimap’ can you detect the run time of Sniffles and the CPU time?

awk '{print $2}' sv.benchmark.txt

awk '{print $(NF)}' sv.benchmark.txt

What do you learn from the statistics output?

How many reads, bases, mean read length, and max read?

cd statistics/raw_reads cat reads_stat.txt

Reads: 28396
Bases: 409038723
Mean read length: 14404.800781800253
Median: 7733.5
Max: 178579
N50: 29994

Aligning statistics

cd statitics/minimap
cat data.stat| grep ^SN | cut -f 2-

What is the number of “reads mapped”? Compare to the number of raw reads from statistics/raw_reads/reads_stat.txt.
What is the “average quality” of mapped reads? The number of reads with a mapping quality of zero, “reads MQ0”?

SV basic statistics

cd statitics/sv

What is the distribution of SV sizes and how many Deletion (DEL) with sizes between 1000-10000bp

cat data.stat

How many deletions were detected in chromosome 5?

cat data.stat_CHR

SNVs statistics

What is the number of SNPs, Indels, and how many records (variants)?

grep "^SN" snp.txt

Do we have “multiallelic SNP sites”?


 



 

So far all the analysis we got is directly from princess, now let us use other tools to get more information about our results.

Select the heterozygote SNPs, How many are they?

cd analysis/result bcftools view -v snps -i 'GT="het"' minimap.phased.SNVs.vcf.gz | wc -l # ~25079

Number of Phased SNPs

bcftools view -p minimap.phased.SNVs.vcf.gz | wc -l

Are there any SNPs in gene BRAF?

Note: gene is located on chromosome 16 starts at 67952369 and ends at 67976758

bcftools view -r 16:67952369-67976758 minimap.phased.SNVs.vcf.gz

How many of those could not we phase?

bcftools view -r 16:67952369-67976758 minimap.phased.SNVs.vcf.gz | bcftools view -H -P | wc -l

For the same gene, How many phase blocks are there? And what is the phase block number?

bcftools view -r 16:67952369-67976758 minimap.phased.SNVs.vcf.gz | bcftools query -f '[%PS]\n' | sort | uniq -c # 15 67965575

Do we identified an SV in this location 5:57679993-57686103?

bcftools view -r 5:57679993-57686103 minimap.SVs.phased.vcf.gz | less

what is the SV type, is it phased?

Bonus: If you identified SV in the previous location, try to take a look in the IGV using the bam file minimap.hap.bam and color reads by tag HP

Clone this wiki locally