Loading reference genome takes a lot of time #56

demis001 · 2024-08-09T14:37:03Z

Describe the bug
I am trying it for the first time. Can you tell me why loading a reference genome takes so long?

BISCUIT version
Program: BISCUIT (BISulfite-seq CUI Toolkit)
Version: 1.5.0

Minimally Reproducible Example

Reference genome: hg38

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 7419966 sequences (1000000194 bp)...
[M::process] read 7432338 sequences (1000000207 bp)...
[M::mem_pestat] # candidate unique pairs: 380317
[M::mem_pestat] (25, 50, 75) percentile: (93, 314, 418)

It has been performing the aforementioned process for the last four hours,  still no data in the output folder

3.  Paired-end data

for SAMPLE in cat ./sample.txt
do
#FL1=$PWD/rawData/${SAMPLE}_R1.fastq.gz
#FL2=$PWD/rawData/${SAMPLE}_R2.fastq.gz
#trim_galore -j 100 -a "AGATCGGAAGAGC" -a2 "AGATCGGAAGAGC" --paired --retain_unpaired --2colour 20
# --clip_r1 12 --clip_r2 12 --output_dir trimmed_fastq ${FL1} ${FL2}

 # Step 1
 echo $SAMPLE
 FL1=${TRIMDIR}/${SAMPLE}_R1_val_1.fq.gz
 FL2=${TRIMDIR}/${SAMPLE}_R2_val_2.fq.gz
 #  mapping
 biscuit align -b 1 -@ 100 -R  "@RG\tID:proper_pair\tSM:${SAMPLE}" ${GDIR}/hg38_r109.fa ${FL1} ${FL2} \
     | samtools sort  -@ 100 -n   -o ${BAMOUT}/${SAMPLE}_paired_sorted.bam -O BAM -
 dupsifter --stats-output  --remove-dups ${BAMOUT}/${SAMPLE}_dupsifter_report.txt \
     -o ${BAMOUT}/${SAMPLE}_paired_sorted_dedup.bam ${GDIR}/hg38_r109.fa  ${BAMOUT}/${SAMPLE}_paired_sorted.bam

done


*System info: *

Ubuntu 24.04
CPU: 100
RAM: 1.5 TB
DISK: ALL SSD

I am running 12 paired-end libraries (the adapter trimming was completed in less than an hour for all these 12 paired-end libraries on this system. 

@demis001

The text was updated successfully, but these errors were encountered:

jamorrison · 2024-08-12T14:04:29Z

Hi @demis001,

Thanks for reaching out. Did your alignments finish over the weekend? Based on the output you sent, the alignment had started already (the [M::process] lines are printed after the reference has been loaded and reads are starting to be processed).

As for runtime, one issue that may affect how long the alignments take is the quality of the input data. For example, samples with high fractions of poly-G reads from the NextSeq/NovaSeq sequencers will take significantly longer than a comparable number of reads with a low fraction of poly-G reads.

With your data being piped straight to samtools sort, I'm not surprised that output wasn't seen for some time as samtools was collecting reads to start sorting, particularly with using all threads for both biscuit and samtools. If you need to rerun your alignments, I would probably suggest giving biscuit most of the threads available (say 85 or 90 in your case) and giving the remaining threads to samtools. This will ensure the two tools aren't competing with each other for resources.

On a side note, based on what you sent, the --stats-output and --remove-dups options in your dupsifter command will need to be switched prior to running dupsifter (i.e., dupsifter --remove-dups --stats-output dupsifter.stats.txt ...). You may have already encountered this issue, but I wanted to make you aware of this if you hadn't gotten to deduplicating yet.

Best,
Jacob

demis001 assigned jamorrison Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading reference genome takes a lot of time #56

Loading reference genome takes a lot of time #56

demis001 commented Aug 9, 2024 •

edited

Loading

jamorrison commented Aug 12, 2024

Loading reference genome takes a lot of time #56

Loading reference genome takes a lot of time #56

Comments

demis001 commented Aug 9, 2024 • edited Loading

jamorrison commented Aug 12, 2024

demis001 commented Aug 9, 2024 •

edited

Loading