-
Notifications
You must be signed in to change notification settings - Fork 23
gnotate
gnotate
is a format and an annotation engine. it is short for "genome annotation".
vcfanno is good at providing flexible annotation of VCFs with VCFs, BEDs, and other tabix-able files. It is very fast and does decent parallelization. However, given a very dense file like gnomad whole genomes, it must parse a lot of data (the whole genomes file is > 600GB!).
Often, a user only requires 1 or 2 fields from that file. gnotate
facilitates the extraction and encoding of a single field from a VCF/BCF into a compressed, reduced format. It stores each chromosome in 2(*) separate files:
- a 64 bit integer that encodes:
- position up to 2^28 (which is more than enough for the longest human chromosome)
- encoded REF and ALT allele
- FILTER (a boolean indicating a non-PASS filter)
- a 32 bit float that encodes a single value from the VCF.
With this encoding, we can store the popmax_AF
from the union of gnomad genomes and exomes in 1.5GB, a > 400X reduction.
In addition, this format allows for extremely rapid annotation. The encoded positions are sortable, so, given a query position, gnotate
does a binary search and finds variants that match on position, REF, and ALT.
In a small percentage of cases, the REF+ALT length is too long to store (along with the position) in a 64 bit integer. For those, the variants are stored in a text file with a pointer in the encoded list that indicates there is a large variant at that position. gnotate
handles all this internally.
A gnotate file is simply a zip file with this information encoded.
The following command will make a gnotate zip file of the controls_nhomalt
field (number of homozygous alternates in gnomad controls) using the exomes and genomes files.
slivar make-gnotate --prefix gnomad-num-hom-alt --field controls_nhomalt \
gnomad.exomes.r2.1.sites.vcf.bgz \
gnomad.genomes.r2.1.sites.chr*.vcf.bgz
The output will be gnomad-num-hom-alt.zip
. To make a similar zip for the maximum allele frequency among the populations in gnomad (popmax_AF, we do:
slivar make-gnotate --prefix gnomad-popmax-AF --field popmax_AF \
gnomad.exomes.r2.1.sites.vcf.bgz \
gnomad.genomes.r2.1.sites.chr*.vcf.bgz
This will create gnomad-popmax-AF.zip
.