Question about remora training #200

spoweekkk · 2025-01-14T17:59:24Z

Hello, I'm trying to use remora to train the model for a novel modification. In this process, I firstly used just one sequence (about 900bp) which all of C were modificated and its control to train and I found that the model got a great performance on the same sequence. However, when I tried to use another sequence as external validating dataset, I found the model can't distinguish modification and control.

I wonder if I'd like to use remora to recognize a kind of modification, is it feasible to use just one sequence ? If not, how many sequences did you use in training to recognize classical modification like 5mC. Could you provide the scale for a reference?

marcus1487 · 2025-01-17T19:50:27Z

Thank you for your question!

The process for training our internal models has been discussed in previous Nanopore Community Meeting and London Calling presentations. Let me know if you’d like specific links.

In summary, we use “randomers” where each strand has a unique sequence with a single modified base at a defined position. Our models are typically trained on datasets containing at least 50M unique sequences, often exceeding 100M or even 1B. Unfortunately, the code for generating these datasets is proprietary and not publicly available.

This approach is designed to make our production models robust across diverse organisms and conditions, but it’s not without limitations. For your case:
• Training on fully modified strands and applying the model to sparsely modified strands (even with the same sequence) is unlikely to perform well.
• Training on a single sequence and validating on another sequence is similarly challenging, as the model won’t generalize well.

For community models, I recommend using training data that closely matches your target conditions (e.g., sequence context, modification density, organism). If you can share more details about your final target, I’d be happy to provide specific suggestions to improve your training approach.

spoweekkk · 2025-01-17T20:05:06Z

Thanks a lot for your reply. I think I need the link you mentioned above. Can I have a copy.
Actually, I'd like to detect a kind of novel modification, which has not been detected on single base site before, so I originally wanted to train remora and detect it on the genome-wide nanopore signal. But it seems that the pipeline doesn't work.

marcus1487 · 2025-01-17T20:17:08Z

Thank you for following up! Here are some presentations that provide insights into our approaches to model training and modified base detection:

It’s exciting that you’re working on detecting a novel modification—this is an ambitious and meaningful goal! While it is absolutely possible to train Remora models for genome-wide detection, achieving robust performance does come with significant challenges. Success will require expertise in multiple areas, including wet lab protocols for generating high-quality data, bioinformatics for preprocessing and analysis, and machine learning for model training and evaluation.

This process is complex and often iterative, so don’t be discouraged if initial pipelines don’t work as expected. If you have specific questions or challenges, feel free to share them—I’d be happy to provide more targeted advice to help you move forward.

spoweekkk · 2025-01-18T15:30:08Z

Thank you for following up! Here are some presentations that provide insights into our approaches to model training and modified base detection:

NCM 2021

NCM 2022

NCM 2023

London Calling 2024

NCM 2024

It’s exciting that you’re working on detecting a novel modification—this is an ambitious and meaningful goal! While it is absolutely possible to train Remora models for genome-wide detection, achieving robust performance does come with significant challenges. Success will require expertise in multiple areas, including wet lab protocols for generating high-quality data, bioinformatics for preprocessing and analysis, and machine learning for model training and evaluation.

This process is complex and often iterative, so don’t be discouraged if initial pipelines don’t work as expected. If you have specific questions or challenges, feel free to share them—I’d be happy to provide more targeted advice to help you move forward.

Greatly appreciated! I have watched the presentations you provide above. It helps me a lot, but I still have a question that I notice that the randomers are centered with the modified base when training. Take 5mC as an example, I wonder if it means that only the centered C base is modified while other C bases in context remain unmodified? That seems different from the input of Remora training which needs a fully modified dataset and control. Is the source code not open for public?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about remora training #200

Question about remora training #200

spoweekkk commented Jan 14, 2025

marcus1487 commented Jan 17, 2025

spoweekkk commented Jan 17, 2025 •

edited

Loading

marcus1487 commented Jan 17, 2025

spoweekkk commented Jan 18, 2025 •

edited

Loading

Question about remora training #200

Question about remora training #200

Comments

spoweekkk commented Jan 14, 2025

marcus1487 commented Jan 17, 2025

spoweekkk commented Jan 17, 2025 • edited Loading

marcus1487 commented Jan 17, 2025

spoweekkk commented Jan 18, 2025 • edited Loading

spoweekkk commented Jan 17, 2025 •

edited

Loading

spoweekkk commented Jan 18, 2025 •

edited

Loading