-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
help combining bismark pipeline with methylated UMIs #680
Comments
Hi there, apologies for the slow response. UMIbam is indeed a tool for UMI-aware deduplication, kind of a slightly more capable version of What you need here is a way of transferring the UMI-sequence from the beginning of both reads to the readIDs, and then also remove the 2 additional bp, Trim Galore already has dedicated methods of doing this for the the
It should be easy enough to implement this additional trimming, you could for example use the code for |
It seems like the --clock is most similar to what I am doing because I have UMIs on both ends. I have no "fixed" bases so that part of the code could be removed. Hopefully I can find someone to implement this type of code change. Do the following steps sound right:
Am I missing anything important on the trimming step? For deduplication, I would use UMIbam instead of deduplicate_bismark in my pipeline. |
I think you can start with Steps 2+3 (this can be done in one go), and then proceed with Step 4 (which will remove adapter and quality). If you add both barcodes from R1 and R2 as a string after a colon together,
If you can't find someone to apply a local fix I can probably take a look for you. |
I just got my data and I want to confirm how the RID would need to be formatted so that the current
So the UMI for R1 in this case is GNATA and the UMI for R2 is CNCTG. Not a great example because of the N's. You had previously said it needed to be formatted as
So does this mean my RIDs should look like the following:
Or do I also need to remove the P7 and P5 barcodes from the RID? |
Yes that's correct, it will take take the last element after the last |
Hi Felix! I will be the one helping @shawpa with this task. Right now, the functionality we are looking to implement is to remove the first N bases of R1 and R2 and then appending those to each of the read ids as Annie showed above. We are also looking to be able to receive either a list or file of "valid" UMIs and if either R1 or R2 has a UMI that is not valid, throw away the pair. I've used cutadapt/trim_galore a bunch, but never added functionality. I would love to add the functionality and make it open source, what is the best way of going about this? |
Hi Zach, This is a very interesting idea, incidentally we have talked about Twist libraries and also libraries from Watchmaker Genomics that will apparently have a similar in-line barcoding strategy for both R1 and R2. As it turns out, it seem I may have a real interest in making this become reality myself :) My first instinct would probably be to have such a 5' inline barcode transfer function (for a variable number of nucleotides) set up as a stand-alone function (for both SE and PE), and potentially one that can be called as part of the trimming process. The latter would require a little more thinking though.... Adding functionality to validate UMIs against a UMI-allowed or whitelist seems like a rather particular, application-specific step and I wonder if this isn't best handled afterwards using a post-processing script of some sort? I could probably have a go at UMI/inline-barcode transfer function as a stand-alone process (similar to e.g. |
Hey Felix, sorry for the late response. If you are up for it, then by all means give it a go! I agree about validating UMI as a post-processing script. We will handle that separately. If there is anything I can do to help, please let me know. Thank you for help |
I have a whole genome em-seq methylation library with dual indexes as well as methylated UMI's at the beginning of each R1 and R2. These are from Twist and were used to make EM-seq libraries. I am not sure how I need to modify my current bismark pipeline to use the --barcode option when I do the deduplication. I saw in a previous thread that you have a program called UMI-Grinder but I'm not sure where it fits (if at all) in the pipeline. Normally I do the following:
The UMIs are going to be the first 5 bases of each read1 and read2. The next 2 bases are supposed to be skipped and then my genomic sequencing should begin. I looked at the UMI-Grinder documention but the input seems to be a bam. Not sure how I get to a bam from a fastq without aligning and I can't align because the UMI's at the beginning of the read. Thank you for your help.
The text was updated successfully, but these errors were encountered: