-
Notifications
You must be signed in to change notification settings - Fork 8
Tutorial
- Build a basic understanding of Princess workflow
- Analyzing long-reads like PacBio HiFi or Oxford Nanopore (ONT)
- Sequence data QC
- Call Single Nucleotide Variations (SNPs), short Insertion and Deletions (Indels)
- Call Structural Variants (SVs)
- Phase SNPs and SVs
- Get familiar with VCF files
Starting with a capture sample, we will identify SNVs, indels, SVs, and their haplotype with minimum installation and no prior expertise with bioinformatics tools optimization.
- Some essential experience with the Linux system and command line will be required
mkdir princess_tutorial
cd princess_tutorial
conda create --name princess_env python=3.7
conda activate princess_env
.
conda install snakemake=5.7.1
.
conda install pyyaml
conda install bcftools
# for SNVs and SVs in depth analysis
conda install bedtools
# for bed files intersection
conda install mosdepth
# Coverage analysis
git clone https://github.com/MeHelmy/princess.git
chmod +x install.sh
./install.sh
python princess -h
python princess all -r ont -d analysis -f hs37d5_mainchr.fa -s $PWD/MdaMb231_brPanel_MinION.12.fastq.gz -e --printshellcmds --dry-run
benchmark log PrincessLog.txt result snake_log statitics
awk '{print $2}' stat.benchmark.txt
awk '{print $2}' sv.benchmark.txt
awk '{print $(NF)}' sv.benchmark.txt
cd statistics/raw_reads
cat reads_stat.txt
Reads: 28396
Bases: 409038723
Mean read length: 14404.800781800253
Median: 7733.5
Max: 178579
N50: 29994
cd statitics/minimap
cat data.stat| grep ^SN | cut -f 2-
What is the number of “reads mapped”? compare to the number of raw reads from statitics/raw_reads/reads_stat.txt.
What is the “average quality” of mapped reads? The number of reads with a mapping quality of zero, “reads MQ0”?
cd statitics/sv
cat data.stat
cat data.stat_CHR
grep "^SN" snp.txt
So far all the analysis we got is directly from princess, now let us use other tools to get more informatin about our results.
cd analysis/result
bcftools view -v snps -i 'GT="het"' minimap.phased.SNVs.vcf.gz | wc -l
# ~25079
bcftools view -p minimap.phased.SNVs.vcf.gz | wc -l
Note: gene is located on chromosome 16 starts at 67952369 and ends at 67976758
bcftools view -r 16:67952369-67976758 minimap.phased.SNVs.vcf.gz
bcftools view -r 16:67952369-67976758 minimap.phased.SNVs.vcf.gz | bcftools view -H -P | wc -l
bcftools view -r 16:67952369-67976758 minimap.phased.SNVs.vcf.gz | bcftools query -f '[%PS]\n' | sort | uniq -c
# 15 67965575
- How many Deletions do we detect in the SV file
- Are there other SVs? What is their count?
From the result file, how many SVs did we detect?
Answer:
Grep -v “^#” sniffles.vcf
What is the number of each identified SV?
Bcftools query -f ‘%SVTYPE\n’ vcf | sort | uniq -c
- Is gene XX affected by any SV? If so, what kind of SV it is?
- Target vs. target ration
Starting from the bam xx file from the result, using the bed file xx to identify number of reads aligned in targeted regions:
Answer:
Mosedpth -by -f …..