Octopus 0.7.0
This is a major release since v0.6.3-beta
and is the first non-beta
release. Highlights include:
- The pair HMM used for the core haplotype likelihood model has been completely re-written to support AVX2 and AVX-512 instruction sets. This can result in some nice performance improvements on machines supporting these instructions. Also, the HMM now supports variable band-widths and 32-bit integer scores, which is necessary to evaluate long reads.
- Evidence BAMs are now annotated with supporting haplotype(s) and other information. Automatic 'splitting' by haplotype is gone but there is a [script] provided to do this.
- Octopus is now paired and linked read aware! Reads are assumed paired by default, but can be assumed unpaired or linked with the
--read-linkage
option. This improves accuracy and phasing for most analysis. - Random forests now store the annotations used for training as meta information in the forest file, allowing different annotations to be used for different forests. Note that this change makes previous forest versions incompatible with this version, it also means that a modified ranger must be used for training (the main ranger package does not store variable names in the meta info).
- Allele-level annotations (e.g.
AD
) are now supported; they can be requested with the--annotations
option. - The phasing algorithm has been completely re-written to improve accuracy and to allow discontiguous phase sets, which can frequently occur in some analysis (e.g. linked reads, or somatic phasing).
- Calling from PacBio CCS reads is now supported - although improvements are still needed, especially regarding runtime. See the PacBio CCS config.
- The haplotype generator now supports 'backtracking' - where a block of partially resolved haplotypes is buffered, and then restored when downstream haplotypes have also been partially resolved. This can lead to long haplotypes much faster than keeping all haplotypes in the tree simultaneously. Backtracking is turned off by default, but can be. enabled by using
--backtrack-level
option. - Mixing of distinct sample ploidies is now supported by the population calling model.
- Overflows on
QUAL
andGQ
have been reduced allowing for much greater ranges on these statistics. - The use of
*
ALT
allele has been brought inline with the updated VCF v4.3 specification. The--legacy
option has therefore been removed. - New
RFGQ_ALL
INFO
measure for random forest filtered runs - the empirical probability (Phred) of all genotypes being correct (derived from eachFORMAT
RFGQ
). Use this for filtering tumour-normal calls etc. - Handling of ALT supplementary alignments (for GRCh38 etc) has been improved, resulting in better accuracy.
- Polyploid calling much faster, especially when the
--max-genotypes
option is used (recommended for anything over triploid). - The local re-assembler now automatically considers the average region depth when evaluating bubbles, resulting in fewer spurious candidate variants.
- The local re-assembler no longer allows cyclic graphs by default, resulting in far fewer spurious candidates with very little loss in sensitivity. Cyclic graphs can be re-enabled with the
--allow-cycles
option. - Haplotypes (i.e. phased
GT
entires) are now reported in a consistent manner - always lexicographical (w.r.t the implied haplotype). This breaks the previous rule that somatic haplotypes always appeared after germline ones - somatic haplotypes are now identified with theHSS
FORMAT
annotation. - The way genotypes are represented has been completely re-written, resulting in some nice runtime performance improvements for all calling models.
- The way filtering measures are calculated has been re-written, resulting in a nice runtime performance improvement for filtering.
- The way Octopus identifies 'uncallable' regions that tend to slow down analysis has been much improved, resulting in much better runtimes.
- Automatic dependency installation in the installation script has been much improved, and is now the recommend way to install Octopus on all operating systems.
- Many bug fixes.