Dorado duplex using older dna model for V10? #1210

Patricie34 · 2025-01-06T08:27:11Z

Hi everyone,

I previously discussed an issue here regarding merging data from samples sequenced using V10 and V14 chemistries (#1160)).

My goal was to perform dorado duplex basecalling, map the reads to a ref. genome, and then merge BAM files before variant calling. However, I encountered an error when trying to use dorado duplex with the model [email protected].
I assume duplex basecalling is not available for these older models. Is that correct?

If so, could I use duplex basecalling for V14 samples with [email protected], simplex basecalling for V10 samples, and then proceed with mapping and merging?

Additionally, I would like to ask for advice on performing duplex basecalling on Kubernetes.
Is there a way to speed it up?

Thank you very much!

Regards,

Patricie

HalfPhoton · 2025-01-06T09:24:45Z

Hi @Patricie34,

I assume duplex basecalling is not available for these older models. Is that correct?

You're correct, we do not have duplex models for the v10 condition. My apologies here, I should have spotted that this would have been a problem in your original issue.

If so, could I use duplex basecalling for V14 samples with [email protected], simplex basecalling for V10 samples, and then proceed with mapping and merging?

Your pipeline sounds reasonable where you have duplex v14 and simplex v10 data, basecalled in duplex and simplex respectively, and then mapped and merged.

Additionally, I would like to ask for advice on performing duplex basecalling on Kubernetes.
Is there a way to speed it up?

Please checkout this documentation on duplex performance.

Best regards,
Rich

Patricie34 · 2025-01-14T06:29:06Z

Hi @HalfPhoton,

thank you very much for your response. Over the last few days, I have been working on setting up duplex basecalling on Kubernetes. I followed the documentation, as you advised, and first ran the split Pod5 process by channel, followed by the duplex basecalling.

Unfortunately, the basecalling process is failing after three days, probably due to memory limits. I have only 1 GPU available for this. Do you have any suggestions on how to make it work?

In the documentation, there is a mention of running one job per, for example, 100 channels. Does this mean that if I have, for instance, 2000 Pod5 files for one sample after splitting by channel, I should run around 20 processes (jobs), each processing 100 files?

I am using the Nextflow pipeline. Below is the code for these two processes:

process SPLIT_POD5 {
	tag "SPLIT_POD5 on $sample.name using $task.cpus CPUs $task.memory"
	publishDir  "${params.outDir}/${sample.name}/duplex", mode:'copy'
	label "l_cpu"
	label "xl_mem"

	input:
        tuple val(name), val(sample), val(pod5_files)

        output:
        tuple val(name), val(sample), path("split_by_channel"), emit: split_pod5
        path "summary.tsv", emit: summary

    script:
    """
    echo "SPLIT_POD5 on ${name}"
    pip install pod5
    mkdir split_by_channel
    pod5 view ${pod5_files} --include "read_id,channel" --output summary.tsv
    pod5 subset ${pod5_files} --summary summary.tsv --columns channel --output split_by_channel
    """
}

process DORADO_DUPLEX {
    tag "DORADO_DUPLEX on ${sample.name} using ${task.cpus} CPUs and GPU ${sample.gpu}"
    publishDir "${params.outDir}/${sample.name}/mapped/duplex", mode: 'copy'
    container 'ontresearch/dorado:latest'
    accelerator 1, type: "${sample.gpu}"
    label "xl_cpu"
    label "xxxl_mem"
	maxForks 2

    input:
    tuple val(name), val(sample), path(split_pod5s)

    output:
    tuple val(name), val(sample), path("${sample.name}.calls.duplex.bam"), path("${sample.name}.summary.tsv")

    script:
    def model = get_model(sample.basecall_qual, sample.chemistry)
    """
    echo "DORADO_DUPLEX on ${sample.name} "
    echo "Model: ${model}"
    
    echo "Downloading model..."
    dorado download --model ${model}
    
    echo "Basecalling without methylation..."
    dorado duplex ${model} ${split_pod5s} \
        --reference ${sample.ref} \
        --verbose \
		-o ${sample.name}.calls.duplex.bam

    dorado summary ${sample.name}.calls.duplex.bam > ${sample.name}.summary.tsv

	echo "Finished DORADO_DUPLEX on ${sample.name} "
    """
}

Thank you!

Best regards,

Patricie

HalfPhoton · 2025-01-14T10:11:53Z

Hi @Patricie34,

I believe the issue here is that you're basecalling all of the split pod5 files together in one job. The purpose of splitting the pod5 files into channels is to improve the pair searching efficiency by providing mostly plausible reads instead of all reads. This requires you to be more specific with the dorado call so that only data from one (or a few) channels are loaded in one process.

What you need to do is to run process DORADO_DUPLEX for collections of per-channel pod5 files from SPLIT_POD5 ::split_by_channel.

The size of the collections of per-channel pod5 files is up to you, mostly depending on how long you want each process to run. There's a diminishing return when calling single per-channel pod5 files as the nextflow process / dorado load time becomes a significant fraction of the overall wall time. I'd recommend trying collections in the range of 20 to 200.

Best regards,
Rich

Patricie34 · 2025-01-16T07:33:18Z

Hi @HalfPhoton,

thank you for your advice.
I’ve already tried it this way, but unfortunately, it always ended with the following error:

[error] Zero positional arguments expected, did you mean --dump_stats_file VAR

That’s why I directly used the entire split_by_channel directory; only this way did it actually work.

Do you have any idea what could be causing the issue? I’m guessing it might have something to do with how the input split_pod5 files are being streamed from the previous step, but I haven’t been able to figure it out.

I’ve tried different approaches and always ended up with the same error. My guess is that it’s more likely a problem with nextflow rather than the dorado command itself.

The code I used is following:

process SPLIT_POD5 {
  tag "SPLIT_POD5 on $sample.name using $task.cpus CPUs $task.memory"
  publishDir  "${params.outDir}/${sample.name}/nano/duplex", mode:'copy'
  label "l_cpu"
  label "xl_mem"
  
  input:
  tuple val(name), val(sample), val(pod5_files)
  
  output:
  tuple val(name), val(sample), path("split_by_channel/*.pod5"), emit: split_pod5
  path "summary.tsv", emit: summary
  
  script:
  """
  echo "SPLIT_POD5 on ${name}"
  pip install pod5
  mkdir split_by_channel
  pod5 view ${pod5_files} --include "read_id,channel" --output summary.tsv
  pod5 subset ${pod5_files} --summary summary.tsv --columns channel --output split_by_channel
  """
  }

process DORADO_DUPLEX {
    tag "DORADO_DUPLEX on ${sample.name} chunk of ${chunk_size} files using ${task.cpus} CPUs and GPU ${sample.gpu}"
    container 'ontresearch/dorado:latest'
    publishDir "${params.outDir}/${sample.name}/nano/mapped/duplex/calls", mode: 'copy'
    accelerator 1, type: "${sample.gpu}"
    label "xl_cpu"
    label "xxxl_mem"

    input:
    tuple val(name), val(sample), val(chunk_size), val(pod5_files)

    output:
    tuple val(name), val(sample), path("${sample.name}_chunk${task.index}.calls.duplex.bam"), path("${sample.name}.summary.tsv")
    //tuple val(name), val(sample), path("${sample.name}_batch${batch_index}.methyl.duplex.bam"), path("${sample.name}_batch${batch_index}.methyl.summary.tsv"), optional: true

    script:
    def model = get_model(sample.basecall_qual, sample.chemistry)
    """
    echo "DORADO_DUPLEX on ${sample.name} chunk ${task.index} (${chunk_size} files)"
    echo "Model: ${model}"
    
    echo "Downloading model..."
    dorado download --model ${model}
    
    echo "Basecalling without methylation..."
    dorado duplex ${model} ${pod5_files} \
        --reference ${sample.ref} \
        --verbose \
		-o ${sample.name}_chunk${task.index}.calls.duplex.bam

    dorado summary ${sample.name}.calls.duplex.bam > ${sample.name}.summary.tsv

	echo "Finished DORADO_DUPLEX on ${sample.name} chunk ${task.index}"
    """
}

input channel looks like this:

Split_pod5s.split_pod5
  .flatMap { name, meta, pod5s -> pod5s.collate(200).collect { chunk -> tuple(name, meta, chunk.size(), chunk)} }
  .set { chunked_pod5s }
Dorado_duplex = DORADO_DUPLEX(chunked_pod5s)

Thanks a lot for your help!

Best regards,

Patricie

HalfPhoton · 2025-01-17T11:18:53Z

Hi @Patricie34,

[error] Zero positional arguments expected, did you mean --dump_stats_file VAR

I'm guessing that what caused this error is that multiple pod5 files were passed to dorado dorado duplex ${model} ${pod5_files} which is not supported.

All you might have to do is move staged pod5 files into a sub directory and call that directory instead of the list of pod5 files which I think is what's currently happening.

Alternatively you could batch together collections of pod5 files at the end of process::SPLIT_POD5 and have an output of directories like this:

  output tuple val(name), val(sample), path("batches/batch_*"), emit: split_pod5

You can then run dorado for each of these directories in separate process:: DORADO_DUPLEX with some manipulation of the output of process::SPLIT_POD5.

Best regards,
Rich

Patricie34 · 2025-01-17T11:57:44Z

Hi @HalfPhoton,

That's actually what I did the first time – I used the subdirectory where the split POD5 files were located. However, the problem was that it ended up using all >2000 files. So, I thought I needed to separate them into batches of 20-200, as you suggested.
Does this mean I can simply set the --batchsize parameter to 200?
Otherwise, I’m not really sure how to accomplish this.

Best regards,

Patricie

HalfPhoton · 2025-01-17T13:57:35Z

To elaborate on creating subdirectories of batches; extend the bash script in process SPLIT_POD5 to move 20 files into new subdirectories incrementally increasing a suffix in the subdirectory name.

... # the pod5 subset by channel stuff

mkdir batches
BATCH_SIZE=20  
IDX=1
files=($(find "split_by_channel/" -maxdepth 1 -type f))

# Move BATCH_SIZE files into a new subdirectory batch_${IDX}
while [ ${#files[@]} -gt 0 ]; do
    # Create new directory
    sub="batches/batch_${IDX}"
    mkdir -p "${sub}"

    # Move up to BATCH_SIZE files
    for ((i = 0; i < BATCH_SIZE && ${#files[@]} > 0; i++)); do
        mv "${files[$i]}" "$sub"
    done

    # Remove moved files from the list
    files=("${files[@]:BATCH_SIZE}")
    ((IDX++))
done

HalfPhoton added question Issue is a question duplex Issues related to duplex basecalling labels Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dorado duplex using older dna model for V10? #1210

Dorado duplex using older dna model for V10? #1210

Patricie34 commented Jan 6, 2025

HalfPhoton commented Jan 6, 2025

Patricie34 commented Jan 14, 2025 •

edited by HalfPhoton

Loading

HalfPhoton commented Jan 14, 2025

Patricie34 commented Jan 16, 2025 •

edited by HalfPhoton

Loading

HalfPhoton commented Jan 17, 2025

Patricie34 commented Jan 17, 2025

HalfPhoton commented Jan 17, 2025 •

edited

Loading

Dorado duplex using older dna model for V10? #1210

Dorado duplex using older dna model for V10? #1210

Comments

Patricie34 commented Jan 6, 2025

HalfPhoton commented Jan 6, 2025

Patricie34 commented Jan 14, 2025 • edited by HalfPhoton Loading

HalfPhoton commented Jan 14, 2025

Patricie34 commented Jan 16, 2025 • edited by HalfPhoton Loading

HalfPhoton commented Jan 17, 2025

Patricie34 commented Jan 17, 2025

HalfPhoton commented Jan 17, 2025 • edited Loading

Patricie34 commented Jan 14, 2025 •

edited by HalfPhoton

Loading

Patricie34 commented Jan 16, 2025 •

edited by HalfPhoton

Loading

HalfPhoton commented Jan 17, 2025 •

edited

Loading