Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about remora training #200

Open
spoweekkk opened this issue Jan 14, 2025 · 4 comments
Open

Question about remora training #200

spoweekkk opened this issue Jan 14, 2025 · 4 comments

Comments

@spoweekkk
Copy link

Hello, I'm trying to use remora to train the model for a novel modification. In this process, I firstly used just one sequence (about 900bp) which all of C were modificated and its control to train and I found that the model got a great performance on the same sequence. However, when I tried to use another sequence as external validating dataset, I found the model can't distinguish modification and control.

I wonder if I'd like to use remora to recognize a kind of modification, is it feasible to use just one sequence ? If not, how many sequences did you use in training to recognize classical modification like 5mC. Could you provide the scale for a reference?

@marcus1487
Copy link
Collaborator

Thank you for your question!

The process for training our internal models has been discussed in previous Nanopore Community Meeting and London Calling presentations. Let me know if you’d like specific links.

In summary, we use “randomers” where each strand has a unique sequence with a single modified base at a defined position. Our models are typically trained on datasets containing at least 50M unique sequences, often exceeding 100M or even 1B. Unfortunately, the code for generating these datasets is proprietary and not publicly available.

This approach is designed to make our production models robust across diverse organisms and conditions, but it’s not without limitations. For your case:
• Training on fully modified strands and applying the model to sparsely modified strands (even with the same sequence) is unlikely to perform well.
• Training on a single sequence and validating on another sequence is similarly challenging, as the model won’t generalize well.

For community models, I recommend using training data that closely matches your target conditions (e.g., sequence context, modification density, organism). If you can share more details about your final target, I’d be happy to provide specific suggestions to improve your training approach.

@spoweekkk
Copy link
Author

spoweekkk commented Jan 17, 2025

Thanks a lot for your reply. I think I need the link you mentioned above. Can I have a copy.
Actually, I'd like to detect a kind of novel modification, which has not been detected on single base site before, so I originally wanted to train remora and detect it on the genome-wide nanopore signal. But it seems that the pipeline doesn't work.

@marcus1487
Copy link
Collaborator

Thank you for following up! Here are some presentations that provide insights into our approaches to model training and modified base detection:

It’s exciting that you’re working on detecting a novel modification—this is an ambitious and meaningful goal! While it is absolutely possible to train Remora models for genome-wide detection, achieving robust performance does come with significant challenges. Success will require expertise in multiple areas, including wet lab protocols for generating high-quality data, bioinformatics for preprocessing and analysis, and machine learning for model training and evaluation.

This process is complex and often iterative, so don’t be discouraged if initial pipelines don’t work as expected. If you have specific questions or challenges, feel free to share them—I’d be happy to provide more targeted advice to help you move forward.

@spoweekkk
Copy link
Author

spoweekkk commented Jan 18, 2025

Thank you for following up! Here are some presentations that provide insights into our approaches to model training and modified base detection:

It’s exciting that you’re working on detecting a novel modification—this is an ambitious and meaningful goal! While it is absolutely possible to train Remora models for genome-wide detection, achieving robust performance does come with significant challenges. Success will require expertise in multiple areas, including wet lab protocols for generating high-quality data, bioinformatics for preprocessing and analysis, and machine learning for model training and evaluation.

This process is complex and often iterative, so don’t be discouraged if initial pipelines don’t work as expected. If you have specific questions or challenges, feel free to share them—I’d be happy to provide more targeted advice to help you move forward.

Greatly appreciated! I have watched the presentations you provide above. It helps me a lot, but I still have a question that I notice that the randomers are centered with the modified base when training. Take 5mC as an example, I wonder if it means that only the centered C base is modified while other C bases in context remain unmodified? That seems different from the input of Remora training which needs a fully modified dataset and control. Is the source code not open for public?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants