Releases: brentp/mosdepth
bugfix
D4 support!
This release adds support for writing d4 files. See Aaron's poster here
d4 is awesome
d4
is a toolset and format written by Hao Hou from the Quinlan Lab.
mosdepth
provides many options while calculating depth because it is slow to re-parse the per-base.bed.gz files. In
many cases, it's faster to re-parse a cram file than to scan large regions from the per-base bed files. In addition, writing per-base.bed.gz has always been a bottleneck in mosdepth even after it was optimized some in last release.
This release has a static d4utils binary for linux below that will allow users to manipulate d4 files.
d4 is much faster to write:
Here are mosdepth run times on a smallish cram test-case:
- mosdepth without per-base: 5.9s
- mosdepth with per-base bed.gz: 24.8s
- mosdepth with per-base d4: 7.7s
Note that using d4
output greatly mitigates the cost of writing the per-base output.
With d4 mosdepth can write per-base output for a 23X CRAM in 2m15s
d4 output is much more useful.
Once the d4 file is created, it is much faster to access. d4 includes command line utilities to view, get stats, and manipulate d4 files. These eventually will replace much of the functionality in mosdepth like quantize
, histogram (dist.txt)
, regions.bed.gz
etc since the operations are so fast.
why not bigwig
I made several pull requests to Devon Ryan's excellent BigWig library to improve speed and attempt to reduce memory usage: #41, #42, #43.
I also wrote a bigwig library for nim that uses libBigWig and used that to prototype bigwig output for mosdepth
. However, bigwig output dramatically increased the memory usage in mosdepth
such that it was not viable.
We will show in the coming manuscript (and see the poster) that d4
is much faster to create and use than bigwig
and results in smaller file sizes.
speed and region.dist.txt coverage
0.2.9
- modifies region.dist.txt to contain the aggregate coverage of each window when -b (integer) is specified
(otherwise region.dist.txt and global.disk.txt are identical with -b (integer) ) - improve speed by ~30% when using per-base output with better int2str method (see below fore more details)
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
mosdepth_v028 -x $exome |
231.300 ± 8.175 | 222.166 | 242.883 | 1.73 ± 0.07 |
mosdepth_v029 -x $exome |
184.653 ± 7.520 | 176.238 | 192.636 | 1.38 ± 0.07 |
mosdepth_v028 -x -t 4 $exome |
170.924 ± 3.811 | 166.359 | 175.284 | 1.28 ± 0.04 |
mosdepth_v029 -x -t 4 $exome |
133.504 ± 3.151 | 129.220 | 138.062 | 1.00 |
fix indexing
0.2.8
- fix off-by-one error in CSI index (but not data) of output bed files (#98)
htslib 1.10
this release updates mosdepth to work with htslib 1.10 and the static binary is built with htslib 1.10.
this fixes several bugs opened for mosdepth.
0.2.7
- small optimizations
- exit with 1 on bad help #80
- fix check on remote bam (brentp/hts-nim#48)
- fix erroneous assert #99
- update static binary to htslib 1.10 (this fixes other bugs reported and closed in mosdepth)
median and summary file
- this release adds a new
*.mosdepth.summary.txt
output file added by @danielecook. It reports some statistics for each chromosome. - it also adds a
--median
flag to be applied to the regions given in--by
. The default is to use mean. This mode is recommended for more stable estimate of depth. - fix for #54 for quantize.
To get started, use:
wget https://github.com/brentp/mosdepth/releases/download/v0.2.6/mosdepth && chmod +x ./mosdepth && ./mosdepth -h
That is the (recommended) static binary. To use one that depends on your local htslib (libhts.so), download this binary
static build. pair overlap edge-cases.
0.2.5
- remove dependency on PCRE (this makes it easier to run on many older systems)
- don't double count fully overlapping reads (thanks to @jaudoux for the fix in #73)
- static binary : the binary is completely static but will not allow access over S3/Http
wget https://github.com/brentp/mosdepth/releases/download/v0.2.5/mosdepth && chmod +x ./mosdepth && ./mosdepth -h
should work on all linux 64 bit systems.
fast mode
this release adds a --fast-mode
flag that makes mosdepth almost twice as fast. It does not look at mate overlap and it doesn't look at insertion or deletion events in the cigar -- it will still show large deletions with coverage changes and it still skips soft clipped portions of reads. This behavior is likely desirable in many cases and will result in an additional 2X speedup.
0.2.4
- Add optional
--include-flag
to allow counting only reads that have some bits in the specified flag set.
This will only be used rarely--e.g. to count only supplemental reads, use-F 0 --include-flag 2048
. - Fix case when only a single argument was given to --quantize
- add --read-groups option to allow specifying that only certain read-groups should be used in the depth calculation. (#60)
- add --fast-mode that does not look at internal cigar operations like (I)insertions or (D)eletions, but does consider soft and
hard-clips at the end of the alignment. Also does not correct for mate overlap. This makes mosdepth as much as 2X faster for
CRAM and is likely the desired mode for people using the depth for CNV or general coverage values as drops in coverage
due to CIGAR operations are often not of interest for coverage-based analyses.
large chroms and region.dist bug.
0.2.3
- fix bug in region.dist with chromosomes in bam header, but without any reads. thanks (@vladsaveliev for reporting)
- support for chromosomes larger than 2^29. (thanks @kaspernie for reporting #41)
dist changes!
This contains a bugfix for a very rare (but major) bug that occurs when successive chromosomes have the same length. The data from the first chrom was not cleared and then polluted the counts for the subsequent chrom. Thanks to Kate B. for reporting and providing a simple test-case.
It also changes the dist output file name(s). Before only a single dist file was created. Now, there will always be a $prefix.global.dist.txt
and if --by
is specified, there will also be a $prefix.region.dist.txt
. Thanks to Alistair W for suggesting.
See below for more details.
0.2.2
- fix overflow with huge intervals to --by
- NOTE change to output file name of
*.dist.txt
. A file named$prefix.mosdepth.global.dist.txt
will always be created and$prefix.mosdepth.region.dist.txt
will be created if--by
is specified.
Previously, there was only a single file named$prefix.mosdepth.dist.txt
which no longer exists.
This allows users to, for example, use --by to see coverage of gene regions for WGS, and to see the
global WGS coverage and the coverage in their genes of interest. - fix bug that would manifest with consecutive chromosomes of the same length. chromosomes other than
the first of a given length would have incorrect values.