-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue 1566 - basemod API and hardclips #1567
Conversation
The change seems reasonable as an interim measure. Would you mind if I did some squashing/rebasing and pushed back to your branch? (I can do it elsewhere if not, no worries, but your PR will show as closed instead of merged when we get around to it.) There is the potential for an improvement, but I'm unsure if it really matters. Specifically if vendors used I'm unsure of exactly how we cure this long term. The fundamental problem of storing sequence base specific information in a tag is keeping it up to date, given tags are optional and lots of tools manipulate BAM files without knowing about all the latest tag types. This is where nearly all the complexity comes from in handling base mods. (We also couldn't put it in the SEQ field as BAM wouldn't accept that.) We could say that tools that do the hard-clipping also do the trimming of the MM/ML tags, but how would you be able to distinguish between tools which have been updated to cope with this tag, and those which haven't? You simply can't. So this seems like a very unwise route to go down. Unfortunately it also means that even if we wanted to be good and get tools that hard-clip to do the right thing, there isn't really a right thing available. So it is indeed feeling like the only solution which is viable is not to count every C'th (or Gth, Ath, etc) base and just case every base - ie the Thoughts? |
Thanks @jkbonfield, fine to squash and rebase in any way that's convenient for you. I'll come back with some thoughts for the points you raised a little later. |
I'm back now with some comments. I do think hardclips should be handled somehow but this isn't an urgent problem since users that care about modifications at supplementary alignments can tell minimap2 to use softclips instead of hardclips with the
I agree we can't ask the SAM producers to update the MM tags but is it worth considering a tool that fixes the MM tag for hardclipped reads? There's precedent with I'm not sure how I feel about switching everything to @cjw85 thoughts? |
I like the idea of a samtools command that fixes up "broken" MM tags. It does immediately raise the need to then encode some extra state as you say. What a nightmare! This would then I think necessitate a need to beef up the API in order to serve clients with the correct interpretation of all the information. I'm in two minds about |
Indeed. One reason I haven't started working on the API again is I'm not 100% sure I know the final requirements yet! All help and suggestions welcomed here. It's hard to know what fix can be applied to a read with hard clipping and MM, unless we extend the MM syntax and the fixmates equiv command turn adds the additional flags as appropriate. So that needs both spec update and API updates. Or it uses N+m style syntax and we just define what that means with hardclips. |
There is ongoing discussion about this on twitter: https://twitter.com/Psy_Fer_/status/1632943849386692608 This appears to show that some tools are doing hard-clipping and creating MM tags in the same pipeline (I assume just the two steps in that order). So this PR would make that correct looking data fail. We're kind of stuck, but I wonder whether the right thing is just to sress that the MM tags count along the sequence as recorded in the SEQ record (albeit in original orientation) - this is already stated in the spec - and for the implementations to have sanity checks to ensure that out of bounds data is reported. So rather than just rejecting the data, for reads with hard-clipping we could check if Nth C base is beyond the end of the recorded seq. It's not ideal from a performance viewpoint, and we may erroneously accept some duff data, but given a sufficient number of sequences it seems highly unlikely that they'd all accidentally pass the checks. Long term we need to figure out a way of recording the difference between MM-corrected and MM-uncorrected (and now broken). I can't think how to do that unless it's a third party piece of data. Eg embedding the sequence length as an error check. If we modify SEQ len in any way and don't make a corresponding edit to MM, then MM is now invalid. It's disappointing as it'd be yet another format change, but I cannot see any other robust way to do it unless it's an additional tag. Eg MZ:i:<seq_len> for Modification-siZe check. Thoughts? |
I'm going to close this as we know it will affect some pipelines which already do the right thing. So stuck between a rock and a hard place of detecting erroneous data vs breaking corrected data. I'll work on an MZ:i alternative tag instead to act as a validation. |
This PR checks whether an aligned read has hard clipped bases, and if so returns -1 from bam_parse_basemod. Fixes #1566