Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dorado duplex using older dna model for V10? #1210

Open
Patricie34 opened this issue Jan 6, 2025 · 7 comments
Open

Dorado duplex using older dna model for V10? #1210

Patricie34 opened this issue Jan 6, 2025 · 7 comments
Labels
duplex Issues related to duplex basecalling question Issue is a question

Comments

@Patricie34
Copy link

Hi everyone,

I previously discussed an issue here regarding merging data from samples sequenced using V10 and V14 chemistries (#1160)).

My goal was to perform dorado duplex basecalling, map the reads to a ref. genome, and then merge BAM files before variant calling. However, I encountered an error when trying to use dorado duplex with the model [email protected].
I assume duplex basecalling is not available for these older models. Is that correct?

If so, could I use duplex basecalling for V14 samples with [email protected], simplex basecalling for V10 samples, and then proceed with mapping and merging?

Additionally, I would like to ask for advice on performing duplex basecalling on Kubernetes.
Is there a way to speed it up?

Thank you very much!

Regards,

Patricie

@HalfPhoton
Copy link
Collaborator

Hi @Patricie34,

I assume duplex basecalling is not available for these older models. Is that correct?

You're correct, we do not have duplex models for the v10 condition. My apologies here, I should have spotted that this would have been a problem in your original issue.

If so, could I use duplex basecalling for V14 samples with [email protected], simplex basecalling for V10 samples, and then proceed with mapping and merging?

Your pipeline sounds reasonable where you have duplex v14 and simplex v10 data, basecalled in duplex and simplex respectively, and then mapped and merged.

Additionally, I would like to ask for advice on performing duplex basecalling on Kubernetes.
Is there a way to speed it up?

Please checkout this documentation on duplex performance.

Best regards,
Rich

@HalfPhoton HalfPhoton added question Issue is a question duplex Issues related to duplex basecalling labels Jan 6, 2025
@Patricie34
Copy link
Author

Patricie34 commented Jan 14, 2025

Hi @HalfPhoton,

thank you very much for your response. Over the last few days, I have been working on setting up duplex basecalling on Kubernetes. I followed the documentation, as you advised, and first ran the split Pod5 process by channel, followed by the duplex basecalling.

Unfortunately, the basecalling process is failing after three days, probably due to memory limits. I have only 1 GPU available for this. Do you have any suggestions on how to make it work?

In the documentation, there is a mention of running one job per, for example, 100 channels. Does this mean that if I have, for instance, 2000 Pod5 files for one sample after splitting by channel, I should run around 20 processes (jobs), each processing 100 files?

I am using the Nextflow pipeline. Below is the code for these two processes:

process SPLIT_POD5 {
	tag "SPLIT_POD5 on $sample.name using $task.cpus CPUs $task.memory"
	publishDir  "${params.outDir}/${sample.name}/duplex", mode:'copy'
	label "l_cpu"
	label "xl_mem"

	input:
        tuple val(name), val(sample), val(pod5_files)

        output:
        tuple val(name), val(sample), path("split_by_channel"), emit: split_pod5
        path "summary.tsv", emit: summary

    script:
    """
    echo "SPLIT_POD5 on ${name}"
    pip install pod5
    mkdir split_by_channel
    pod5 view ${pod5_files} --include "read_id,channel" --output summary.tsv
    pod5 subset ${pod5_files} --summary summary.tsv --columns channel --output split_by_channel
    """
}

process DORADO_DUPLEX {
    tag "DORADO_DUPLEX on ${sample.name} using ${task.cpus} CPUs and GPU ${sample.gpu}"
    publishDir "${params.outDir}/${sample.name}/mapped/duplex", mode: 'copy'
    container 'ontresearch/dorado:latest'
    accelerator 1, type: "${sample.gpu}"
    label "xl_cpu"
    label "xxxl_mem"
	maxForks 2

    input:
    tuple val(name), val(sample), path(split_pod5s)

    output:
    tuple val(name), val(sample), path("${sample.name}.calls.duplex.bam"), path("${sample.name}.summary.tsv")

    script:
    def model = get_model(sample.basecall_qual, sample.chemistry)
    """
    echo "DORADO_DUPLEX on ${sample.name} "
    echo "Model: ${model}"
    
    echo "Downloading model..."
    dorado download --model ${model}
    
    echo "Basecalling without methylation..."
    dorado duplex ${model} ${split_pod5s} \
        --reference ${sample.ref} \
        --verbose \
		-o ${sample.name}.calls.duplex.bam

    dorado summary ${sample.name}.calls.duplex.bam > ${sample.name}.summary.tsv

	echo "Finished DORADO_DUPLEX on ${sample.name} "
    """
}

Thank you!

Best regards,

Patricie

@HalfPhoton
Copy link
Collaborator

Hi @Patricie34,

I believe the issue here is that you're basecalling all of the split pod5 files together in one job. The purpose of splitting the pod5 files into channels is to improve the pair searching efficiency by providing mostly plausible reads instead of all reads. This requires you to be more specific with the dorado call so that only data from one (or a few) channels are loaded in one process.

What you need to do is to run process DORADO_DUPLEX for collections of per-channel pod5 files from SPLIT_POD5 ::split_by_channel.

The size of the collections of per-channel pod5 files is up to you, mostly depending on how long you want each process to run. There's a diminishing return when calling single per-channel pod5 files as the nextflow process / dorado load time becomes a significant fraction of the overall wall time. I'd recommend trying collections in the range of 20 to 200.

Best regards,
Rich

@Patricie34
Copy link
Author

Patricie34 commented Jan 16, 2025

Hi @HalfPhoton,

thank you for your advice.
I’ve already tried it this way, but unfortunately, it always ended with the following error:

[error] Zero positional arguments expected, did you mean --dump_stats_file VAR

That’s why I directly used the entire split_by_channel directory; only this way did it actually work.

Do you have any idea what could be causing the issue? I’m guessing it might have something to do with how the input split_pod5 files are being streamed from the previous step, but I haven’t been able to figure it out.

I’ve tried different approaches and always ended up with the same error. My guess is that it’s more likely a problem with nextflow rather than the dorado command itself.

The code I used is following:

process SPLIT_POD5 {
  tag "SPLIT_POD5 on $sample.name using $task.cpus CPUs $task.memory"
  publishDir  "${params.outDir}/${sample.name}/nano/duplex", mode:'copy'
  label "l_cpu"
  label "xl_mem"
  
  input:
  tuple val(name), val(sample), val(pod5_files)
  
  output:
  tuple val(name), val(sample), path("split_by_channel/*.pod5"), emit: split_pod5
  path "summary.tsv", emit: summary
  
  script:
  """
  echo "SPLIT_POD5 on ${name}"
  pip install pod5
  mkdir split_by_channel
  pod5 view ${pod5_files} --include "read_id,channel" --output summary.tsv
  pod5 subset ${pod5_files} --summary summary.tsv --columns channel --output split_by_channel
  """
  }

process DORADO_DUPLEX {
    tag "DORADO_DUPLEX on ${sample.name} chunk of ${chunk_size} files using ${task.cpus} CPUs and GPU ${sample.gpu}"
    container 'ontresearch/dorado:latest'
    publishDir "${params.outDir}/${sample.name}/nano/mapped/duplex/calls", mode: 'copy'
    accelerator 1, type: "${sample.gpu}"
    label "xl_cpu"
    label "xxxl_mem"

    input:
    tuple val(name), val(sample), val(chunk_size), val(pod5_files)

    output:
    tuple val(name), val(sample), path("${sample.name}_chunk${task.index}.calls.duplex.bam"), path("${sample.name}.summary.tsv")
    //tuple val(name), val(sample), path("${sample.name}_batch${batch_index}.methyl.duplex.bam"), path("${sample.name}_batch${batch_index}.methyl.summary.tsv"), optional: true

    script:
    def model = get_model(sample.basecall_qual, sample.chemistry)
    """
    echo "DORADO_DUPLEX on ${sample.name} chunk ${task.index} (${chunk_size} files)"
    echo "Model: ${model}"
    
    echo "Downloading model..."
    dorado download --model ${model}
    
    echo "Basecalling without methylation..."
    dorado duplex ${model} ${pod5_files} \
        --reference ${sample.ref} \
        --verbose \
		-o ${sample.name}_chunk${task.index}.calls.duplex.bam

    dorado summary ${sample.name}.calls.duplex.bam > ${sample.name}.summary.tsv

	echo "Finished DORADO_DUPLEX on ${sample.name} chunk ${task.index}"
    """
}

input channel looks like this:

Split_pod5s.split_pod5
  .flatMap { name, meta, pod5s -> pod5s.collate(200).collect { chunk -> tuple(name, meta, chunk.size(), chunk)} }
  .set { chunked_pod5s }
Dorado_duplex = DORADO_DUPLEX(chunked_pod5s)

Thanks a lot for your help!

Best regards,

Patricie

@HalfPhoton
Copy link
Collaborator

Hi @Patricie34,

[error] Zero positional arguments expected, did you mean --dump_stats_file VAR

I'm guessing that what caused this error is that multiple pod5 files were passed to dorado dorado duplex ${model} ${pod5_files} which is not supported.

All you might have to do is move staged pod5 files into a sub directory and call that directory instead of the list of pod5 files which I think is what's currently happening.

Alternatively you could batch together collections of pod5 files at the end of process::SPLIT_POD5 and have an output of directories like this:

  output tuple val(name), val(sample), path("batches/batch_*"), emit: split_pod5

You can then run dorado for each of these directories in separate process:: DORADO_DUPLEX with some manipulation of the output of process::SPLIT_POD5.

Best regards,
Rich

@Patricie34
Copy link
Author

Hi @HalfPhoton,

That's actually what I did the first time – I used the subdirectory where the split POD5 files were located. However, the problem was that it ended up using all >2000 files. So, I thought I needed to separate them into batches of 20-200, as you suggested.
Does this mean I can simply set the --batchsize parameter to 200?
Otherwise, I’m not really sure how to accomplish this.

Best regards,

Patricie

@HalfPhoton
Copy link
Collaborator

HalfPhoton commented Jan 17, 2025

To elaborate on creating subdirectories of batches; extend the bash script in process SPLIT_POD5 to move 20 files into new subdirectories incrementally increasing a suffix in the subdirectory name.

... # the pod5 subset by channel stuff

mkdir batches
BATCH_SIZE=20  
IDX=1
files=($(find "split_by_channel/" -maxdepth 1 -type f))

# Move BATCH_SIZE files into a new subdirectory batch_${IDX}
while [ ${#files[@]} -gt 0 ]; do
    # Create new directory
    sub="batches/batch_${IDX}"
    mkdir -p "${sub}"

    # Move up to BATCH_SIZE files
    for ((i = 0; i < BATCH_SIZE && ${#files[@]} > 0; i++)); do
        mv "${files[$i]}" "$sub"
    done

    # Remove moved files from the list
    files=("${files[@]:BATCH_SIZE}")
    ((IDX++))
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplex Issues related to duplex basecalling question Issue is a question
Projects
None yet
Development

No branches or pull requests

2 participants