-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about remora training #200
Comments
Thank you for your question! The process for training our internal models has been discussed in previous Nanopore Community Meeting and London Calling presentations. Let me know if you’d like specific links. In summary, we use “randomers” where each strand has a unique sequence with a single modified base at a defined position. Our models are typically trained on datasets containing at least 50M unique sequences, often exceeding 100M or even 1B. Unfortunately, the code for generating these datasets is proprietary and not publicly available. This approach is designed to make our production models robust across diverse organisms and conditions, but it’s not without limitations. For your case: For community models, I recommend using training data that closely matches your target conditions (e.g., sequence context, modification density, organism). If you can share more details about your final target, I’d be happy to provide specific suggestions to improve your training approach. |
Thanks a lot for your reply. I think I need the link you mentioned above. Can I have a copy. |
Thank you for following up! Here are some presentations that provide insights into our approaches to model training and modified base detection: It’s exciting that you’re working on detecting a novel modification—this is an ambitious and meaningful goal! While it is absolutely possible to train Remora models for genome-wide detection, achieving robust performance does come with significant challenges. Success will require expertise in multiple areas, including wet lab protocols for generating high-quality data, bioinformatics for preprocessing and analysis, and machine learning for model training and evaluation. This process is complex and often iterative, so don’t be discouraged if initial pipelines don’t work as expected. If you have specific questions or challenges, feel free to share them—I’d be happy to provide more targeted advice to help you move forward. |
Greatly appreciated! I have watched the presentations you provide above. It helps me a lot, but I still have a question that I notice that the randomers are centered with the modified base when training. Take 5mC as an example, I wonder if it means that only the centered C base is modified while other C bases in context remain unmodified? That seems different from the input of Remora training which needs a fully modified dataset and control. Is the source code not open for public? |
Hello, I'm trying to use remora to train the model for a novel modification. In this process, I firstly used just one sequence (about 900bp) which all of C were modificated and its control to train and I found that the model got a great performance on the same sequence. However, when I tried to use another sequence as external validating dataset, I found the model can't distinguish modification and control.
I wonder if I'd like to use remora to recognize a kind of modification, is it feasible to use just one sequence ? If not, how many sequences did you use in training to recognize classical modification like 5mC. Could you provide the scale for a reference?
The text was updated successfully, but these errors were encountered: