D4 support!
This release adds support for writing d4 files. See Aaron's poster here
d4 is awesome
d4
is a toolset and format written by Hao Hou from the Quinlan Lab.
mosdepth
provides many options while calculating depth because it is slow to re-parse the per-base.bed.gz files. In
many cases, it's faster to re-parse a cram file than to scan large regions from the per-base bed files. In addition, writing per-base.bed.gz has always been a bottleneck in mosdepth even after it was optimized some in last release.
This release has a static d4utils binary for linux below that will allow users to manipulate d4 files.
d4 is much faster to write:
Here are mosdepth run times on a smallish cram test-case:
- mosdepth without per-base: 5.9s
- mosdepth with per-base bed.gz: 24.8s
- mosdepth with per-base d4: 7.7s
Note that using d4
output greatly mitigates the cost of writing the per-base output.
With d4 mosdepth can write per-base output for a 23X CRAM in 2m15s
d4 output is much more useful.
Once the d4 file is created, it is much faster to access. d4 includes command line utilities to view, get stats, and manipulate d4 files. These eventually will replace much of the functionality in mosdepth like quantize
, histogram (dist.txt)
, regions.bed.gz
etc since the operations are so fast.
why not bigwig
I made several pull requests to Devon Ryan's excellent BigWig library to improve speed and attempt to reduce memory usage: #41, #42, #43.
I also wrote a bigwig library for nim that uses libBigWig and used that to prototype bigwig output for mosdepth
. However, bigwig output dramatically increased the memory usage in mosdepth
such that it was not viable.
We will show in the coming manuscript (and see the poster) that d4
is much faster to create and use than bigwig
and results in smaller file sizes.