Skip to content

gnotate

Brent Pedersen edited this page Feb 25, 2019 · 9 revisions

gnotate is a format and an annotation engine. it is short for "genome annotation".

motivation

vcfanno is good at providing flexible annotation of VCFs with VCFs, BEDs, and other tabix-able files. It is very fast and does decent parallelization. However, given a very dense file like gnomad whole genomes, it must parse a lot of data (the whole genomes file is > 600GB!).

Often, a user only requires 1 or 2 fields from that file. gnotate facilitates the extraction and encoding of a single field from a VCF/BCF into a compressed, reduced format. It stores each chromosome in 2(*) separate files:

  • a 64 bit integer that encodes:
    • position up to 2^28 (which is more than enough for the longest human chromosome)
    • encoded REF and ALT allele
    • FILTER (a boolean indicating a non-PASS filter)
  • a 32 bit float that encodes a single value from the VCF.

With this encoding, we can store the popmax_AF from the union of gnomad genomes and exomes in 1.5GB, a > 400X reduction.

In addition, this format allows for extremely rapid annotation. The encoded positions are sortable, so, given a query position, gnotate does a binary search and finds variants that match on position, REF, and ALT.

In a small percentage of cases, the REF+ALT length is too long to store (along with the position) in a 64 bit integer. For those, the variants are stored in a text file with a pointer in the encoded list that indicates there is a large variant at that position. gnotate handles all this internally.

A gnotate file is simply a zip file with this information encoded.

Usage

creating gnotate files.

The following command will make a gnotate zip file of the controls_nhomalt field (number of homozygous alternates in gnomad controls) using the exomes and genomes files.

slivar make-gnotate --prefix gnomad-num-hom-alt --field controls_nhomalt \
    gnomad.exomes.r2.1.sites.vcf.bgz \
    gnomad.genomes.r2.1.sites.chr*.vcf.bgz

The output will be gnomad-num-hom-alt.zip. To make a similar zip for the maximum allele frequency among the populations in gnomad (popmax_AF, we do:

slivar make-gnotate --prefix gnomad-popmax-AF --field popmax_AF \
    gnomad.exomes.r2.1.sites.vcf.bgz \
    gnomad.genomes.r2.1.sites.chr*.vcf.bgz

This will create gnomad-popmax-AF.zip.

Annotating with gnotate files.

Clone this wiki locally