Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sensitivity on chunk-context parameter #193

Open
saroudant opened this issue Oct 19, 2024 · 2 comments
Open

Sensitivity on chunk-context parameter #193

saroudant opened this issue Oct 19, 2024 · 2 comments

Comments

@saroudant
Copy link

saroudant commented Oct 19, 2024

Dear Remora development,

We are contacting you regarding an issue we are facing when training a base-analog classifier on a particular plasmid-like DNA fragment (mitochondrial DNA to be specific). We reached out to you at an event a week ago and you indicated us to contact you directly through GitHub.

We generated the following training dataset:

  • A control, without base analog, covering around 75% of the sequence from our plasmid (fragments of size 3000bp).
  • A set of pure BrdU-labelled reads, around 3000bp each, spanning the exact same sequences as the control.

This training data was generated with pore10 and pore09, and we have trained one classifier for each version, employing the following steps:

  • Base-call all reads using the relevant “sup” base-caller (Dorado).
  • Down-sample regions to insure a limited difference in coverage across the different regions of the contig.
  • Generate training chunks using ‘remora dataset prepare’. We crop the reads at 40bp at each end by providing a bed file, we use the k-mer-level table provided by Nanopore, and focus on the motif T (‘--motif T 0’). Finally, we played with the chunk-context, using 50, 100 and 200 (repeated twice).
  • We create a config file using ‘remora dataset make_config’ and put a weight of 2 for the control samples, to avoid false positives.
  • We train the model using ‘remora model train’, using the ConvLSTM model. We played with some hyper-parameters (batch-size and learning rate mostly, but these did not make much difference).

We then use this classifier on real data, but observed stark differences depending on the chunk-size. Specifically, a whole region of the plasmid, of size 600-700bp, harbors the following predictions for a large number of reads:

  • BrdU-labelled when chunk-context is 100 or 200.
  • Unlabelled when chunk-context is 50.
    On the rest of our sequence (16kb), the performances are very similar. Maybe of importance: this sequence falls within the regions covered by the training data.

Would you have any material that could help us shed some light on this phenomenon? We are unsure which results to trust, although an external tool on pore09 yields predictions similar to the chunk-size 50.

Thank you in advance for your help.
Best regards,
Chih-Yao (@chihyaochung) and Soufiane

@saroudant saroudant changed the title Sensitivity on chunk-size parameter Sensitivity on chunk-context parameter Oct 19, 2024
@marcus1487
Copy link
Collaborator

Sorry for the delay in responding. The longer chunk certainly could be due to overtraining. I think it would help if you could expand on the nature of the training and validation data.

One possibility is that for training sets from such small references it can be quite important to match the number of canonical and modified training sites over each of the training sites. If you randomly had more canonical or modified sites from these site in training it could be that the real issue is a training data bias and not really have to do with the chunk context.

Some other questions/thoughts that could help diagnose this issue: For the training data, how are these generated? How confident are the modified labels? Do the validation reads match the training reads in mod context (e.g. is the training data completely modified while validation data is sparsely modified)? If the training data is completely modified and the validation data is sparse this can produce a strong bias in the model to the fully modified strands. Giving more context for the training and testing datasets will help to indicate next steps to investigate the chunk context differences.

@spoweekkk
Copy link

spoweekkk commented Jan 18, 2025

Dear Remora development,

We are contacting you regarding an issue we are facing when training a base-analog classifier on a particular plasmid-like DNA fragment (mitochondrial DNA to be specific). We reached out to you at an event a week ago and you indicated us to contact you directly through GitHub.

We generated the following training dataset:

  • A control, without base analog, covering around 75% of the sequence from our plasmid (fragments of size 3000bp).
  • A set of pure BrdU-labelled reads, around 3000bp each, spanning the exact same sequences as the control.

This training data was generated with pore10 and pore09, and we have trained one classifier for each version, employing the following steps:

  • Base-call all reads using the relevant “sup” base-caller (Dorado).
  • Down-sample regions to insure a limited difference in coverage across the different regions of the contig.
  • Generate training chunks using ‘remora dataset prepare’. We crop the reads at 40bp at each end by providing a bed file, we use the k-mer-level table provided by Nanopore, and focus on the motif T (‘--motif T 0’). Finally, we played with the chunk-context, using 50, 100 and 200 (repeated twice).
  • We create a config file using ‘remora dataset make_config’ and put a weight of 2 for the control samples, to avoid false positives.
  • We train the model using ‘remora model train’, using the ConvLSTM model. We played with some hyper-parameters (batch-size and learning rate mostly, but these did not make much difference).

We then use this classifier on real data, but observed stark differences depending on the chunk-size. Specifically, a whole region of the plasmid, of size 600-700bp, harbors the following predictions for a large number of reads:

  • BrdU-labelled when chunk-context is 100 or 200.
  • Unlabelled when chunk-context is 50.
    On the rest of our sequence (16kb), the performances are very similar. Maybe of importance: this sequence falls within the regions covered by the training data.

Would you have any material that could help us shed some light on this phenomenon? We are unsure which results to trust, although an external tool on pore09 yields predictions similar to the chunk-size 50.

Thank you in advance for your help. Best regards, Chih-Yao (@chihyaochung) and Soufiane

Have you tried to test the model on an external validating dataset whose sequences are the different from those you train. I met the same problem. When I try to test on another sequence, it seems that the model can't distinguish any control and modified bases at all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants