You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I am trying it for the first time. Can you tell me why loading a reference genome takes so long?
BISCUIT version
Program: BISCUIT (BISulfite-seq CUI Toolkit)
Version: 1.5.0
Minimally Reproducible Example
Reference genome: hg38
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 7419966 sequences (1000000194 bp)...
[M::process] read 7432338 sequences (1000000207 bp)...
[M::mem_pestat] # candidate unique pairs: 380317
[M::mem_pestat] (25, 50, 75) percentile: (93, 314, 418)
It has been performing the aforementioned process for the last four hours, still no data in the output folder
3. Paired-end data
for SAMPLE in cat ./sample.txt
do
#FL1=$PWD/rawData/${SAMPLE}_R1.fastq.gz
#FL2=$PWD/rawData/${SAMPLE}_R2.fastq.gz
#trim_galore -j 100 -a "AGATCGGAAGAGC" -a2 "AGATCGGAAGAGC" --paired --retain_unpaired --2colour 20
# --clip_r1 12 --clip_r2 12 --output_dir trimmed_fastq ${FL1} ${FL2}
*System info: *
Ubuntu 24.04
CPU: 100
RAM: 1.5 TB
DISK: ALL SSD
I am running 12 paired-end libraries (the adapter trimming was completed in less than an hour for all these 12 paired-end libraries on this system.
@demis001
The text was updated successfully, but these errors were encountered:
Thanks for reaching out. Did your alignments finish over the weekend? Based on the output you sent, the alignment had started already (the [M::process] lines are printed after the reference has been loaded and reads are starting to be processed).
As for runtime, one issue that may affect how long the alignments take is the quality of the input data. For example, samples with high fractions of poly-G reads from the NextSeq/NovaSeq sequencers will take significantly longer than a comparable number of reads with a low fraction of poly-G reads.
With your data being piped straight to samtools sort, I'm not surprised that output wasn't seen for some time as samtools was collecting reads to start sorting, particularly with using all threads for both biscuit and samtools. If you need to rerun your alignments, I would probably suggest giving biscuit most of the threads available (say 85 or 90 in your case) and giving the remaining threads to samtools. This will ensure the two tools aren't competing with each other for resources.
On a side note, based on what you sent, the --stats-output and --remove-dups options in your dupsifter command will need to be switched prior to running dupsifter (i.e., dupsifter --remove-dups --stats-output dupsifter.stats.txt ...). You may have already encountered this issue, but I wanted to make you aware of this if you hadn't gotten to deduplicating yet.
Describe the bug
I am trying it for the first time. Can you tell me why loading a reference genome takes so long?
BISCUIT version
Program: BISCUIT (BISulfite-seq CUI Toolkit)
Version: 1.5.0
Minimally Reproducible Example
for SAMPLE in
cat ./sample.txt
do
#FL1=$PWD/rawData/${SAMPLE}_R1.fastq.gz
#FL2=$PWD/rawData/${SAMPLE}_R2.fastq.gz
#trim_galore -j 100 -a "AGATCGGAAGAGC" -a2 "AGATCGGAAGAGC" --paired --retain_unpaired --2colour 20
# --clip_r1 12 --clip_r2 12 --output_dir trimmed_fastq ${FL1} ${FL2}
done
The text was updated successfully, but these errors were encountered: