Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading reference genome takes a lot of time #56

Open
demis001 opened this issue Aug 9, 2024 · 1 comment
Open

Loading reference genome takes a lot of time #56

demis001 opened this issue Aug 9, 2024 · 1 comment
Assignees

Comments

@demis001
Copy link

demis001 commented Aug 9, 2024

Describe the bug
I am trying it for the first time. Can you tell me why loading a reference genome takes so long?

BISCUIT version
Program: BISCUIT (BISulfite-seq CUI Toolkit)
Version: 1.5.0

Minimally Reproducible Example

  1. Reference genome: hg38
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 7419966 sequences (1000000194 bp)...
[M::process] read 7432338 sequences (1000000207 bp)...
[M::mem_pestat] # candidate unique pairs: 380317
[M::mem_pestat] (25, 50, 75) percentile: (93, 314, 418)

It has been performing the aforementioned process for the last four hours,  still no data in the output folder

3.  Paired-end data

for SAMPLE in cat ./sample.txt
do
#FL1=$PWD/rawData/${SAMPLE}_R1.fastq.gz
#FL2=$PWD/rawData/${SAMPLE}_R2.fastq.gz
#trim_galore -j 100 -a "AGATCGGAAGAGC" -a2 "AGATCGGAAGAGC" --paired --retain_unpaired --2colour 20
# --clip_r1 12 --clip_r2 12 --output_dir trimmed_fastq ${FL1} ${FL2}

 # Step 1
 echo $SAMPLE
 FL1=${TRIMDIR}/${SAMPLE}_R1_val_1.fq.gz
 FL2=${TRIMDIR}/${SAMPLE}_R2_val_2.fq.gz
 #  mapping
 biscuit align -b 1 -@ 100 -R  "@RG\tID:proper_pair\tSM:${SAMPLE}" ${GDIR}/hg38_r109.fa ${FL1} ${FL2} \
     | samtools sort  -@ 100 -n   -o ${BAMOUT}/${SAMPLE}_paired_sorted.bam -O BAM -
 dupsifter --stats-output  --remove-dups ${BAMOUT}/${SAMPLE}_dupsifter_report.txt \
     -o ${BAMOUT}/${SAMPLE}_paired_sorted_dedup.bam ${GDIR}/hg38_r109.fa  ${BAMOUT}/${SAMPLE}_paired_sorted.bam

done


*System info: *

Ubuntu 24.04
CPU: 100
RAM: 1.5 TB
DISK: ALL SSD

I am running 12 paired-end libraries (the adapter trimming was completed in less than an hour for all these 12 paired-end libraries on this system. 

@demis001
@jamorrison
Copy link

Hi @demis001,

Thanks for reaching out. Did your alignments finish over the weekend? Based on the output you sent, the alignment had started already (the [M::process] lines are printed after the reference has been loaded and reads are starting to be processed).

As for runtime, one issue that may affect how long the alignments take is the quality of the input data. For example, samples with high fractions of poly-G reads from the NextSeq/NovaSeq sequencers will take significantly longer than a comparable number of reads with a low fraction of poly-G reads.

With your data being piped straight to samtools sort, I'm not surprised that output wasn't seen for some time as samtools was collecting reads to start sorting, particularly with using all threads for both biscuit and samtools. If you need to rerun your alignments, I would probably suggest giving biscuit most of the threads available (say 85 or 90 in your case) and giving the remaining threads to samtools. This will ensure the two tools aren't competing with each other for resources.

On a side note, based on what you sent, the --stats-output and --remove-dups options in your dupsifter command will need to be switched prior to running dupsifter (i.e., dupsifter --remove-dups --stats-output dupsifter.stats.txt ...). You may have already encountered this issue, but I wanted to make you aware of this if you hadn't gotten to deduplicating yet.

Best,
Jacob

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants