Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Slightly speed up various cram decoding functions (samtools#1580)
None of this is huge, but it all adds up. - bam_set1 has been refactored so -O3 is more likely to do unrolling and vectorisation. // Old time inst cyc // gcc -O2 12.36 78936832183 36853852204 // gcc -O3 12.37 78713347525 36867027825 // clang13 -O2 12.43 77451926728 37012866717 // clang13 -O3 12.32 77627221907 36691623424 // gcc12 -O2 12.43 78895089091 37081260172 // gcc12 -O3 12.36 78505904437 36829216967 // New // gcc -O2 12.47 78832021505 37200597109 + // gcc -O3 12.14 76499369401 36390334338 -- // clang13 -O2 12.38 76678460761 36920111561 ~ // clang13 -O3 12.26 76678023071 36548488492 ~ // gcc12 -O2 12.38 78581694397 36880034181 - // gcc12 -O3 12.15 76356625541 36293921439 -- - Improve the MD/NM generation in CRAM decoding. With decode_md=1 (default) by decode changed from 12.91s to 12.57s With decode_md=0 it's 11.92, so that's 1/3rd of the overhead removed. - Changed the block_resize to resize in slightly smaller chunks and to use integer maths. - Reduce excessive pointer redirection in cram_decode_seq. Unsure if this speeds things up much (sometimes it seems to), but it provides tidier code too. Comparisons with Dev(/D) and this commit (/4) on Revio (re/) and NovaSeq (nv/) with a variety of compilers and optimisations. Figures are cycle counts from perf stat Xeon E5-2660 Xeon Gold 6142 re/D gcc12-O2 85699982958 74752510144 re/4 gcc12-O2 82265084038 71947558666 -3.7/3.7 re/D gcc12-O3 85837077212 74392223354 re/4 gcc12-O3 82024293685 71861154116 -4.4/3.4 re/D clang12-3 85608876213 73934329619 re/4 clang12-3 84390364926 73961392095 -1.4/0 re/D clang12-2 86861787827 74255338533 re/4 clang12-2 83186843797 72421845542 -4.2/2.5; better than O3 nv/D gcc12-O2 36694089398 31444641828 nv/4 gcc12-O2 34949122875 30061074125 -4.8/-4.4 nv/D gcc12-O3 36528573980 30792932748 nv/4 gcc12-O3 35069572111 30066058127 -4.0/2.4 nv/D clang12-3 37906764004 32459168883 nv/4 clang12-3 36344679534 30786987972 -4.1/-5.2 nv/D clang12-2 38443827308 32304948037 nv/4 clang12-2 36361384580 31022553379 -5.4/-4.0 Benchmarks on 10 million NovaSeq records, showing billions of cycles as more robust than CPU time. EPYC 7543 before after gcc(7) -O2 28.6 28.3 -1.0 gcc12 -O2 28.2 28.3 +0.4 clang7 -O2 30.2 28.2 -6.6 clang13 -O2 29.9 28.2 -5.7 gcc(7) -O3 28.7 28.2 -1.7 gcc12 -O3 28.0 27.2 -2.9 clang7 -O3 30.1 28.3 -6.0 clang13 -O3 29.7 28.3 -4.7 Xeon Gold 6142 before after gcc(7) -O2 32.8 30.5 -7.0 gcc12 -O2 31.8 30.1 -5.3 clang7 -O2 33.1 29.9 -9.7 clang13 -O2 34.1 30.8 -9.7 gcc(7) -O3 32.7 30.2 -7.6 gcc12 -O3 31.6 29.1 -7.9 clang7 -O3 34.3 30.0 -12.5 clang13 -O3 33.3 30.9 -7.2
- Loading branch information