Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy the tag type text from SAM spec to SAMtags (PR #804) #804

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jkbonfield
Copy link
Contributor

SAM section 1.5 clearly defines the standard SAM tag types along with the expanded codes used in the B byte-array type. This text has been copied into the SAMtags document to remove a rather woolly definition there. The text describing lower-case tags has been removed, because this was already discussed in further detail at the end of the SAMtags document (and is itself somewhat woolly when it comes to half-upper half-lower combinations due to the addition of draft tags).

Fixes #798

SAM section 1.5 clearly defines the standard SAM tag types along with the
expanded codes used in the B byte-array type.  This text has been
copied into the SAMtags document to remove a rather woolly definition
there.  The text describing lower-case tags has been removed, because
this was already discussed in further detail at the end of the SAMtags
document (and is itself somewhat woolly when it comes to half-upper
half-lower combinations due to the addition of draft tags).

Fixes samtools#798
Copy link

github-actions bot commented Jan 7, 2025

Changed PDFs as of 7f4af4b: SAMtags (diff).

@zaeleus
Copy link

zaeleus commented Jan 22, 2025

Why not link to § 1.5 "The alignment section: optional fields", rather than duplicating the serialization format here? The introduction notes that this is a companion document, which associates the two together anyway.

Alternatively, you could replace the type codes with generalized types, e.g., i => integer, Z => string, B,I => uint32_t[], etc. This would remove the need to describe the format.

@jmarshall
Copy link
Member

jmarshall commented Jan 28, 2025

I agree with @zaeleus that it'd be better to have a less woolly1 summary of the types that are actually used in the table and a pointer to SAMv1.pdf “for the complete details” rather than duplicate the complete details here.

I don't think we should use informal types such as integer, string, uint32_t[]. The reason the types are listed in the quick-reference table is that these fields need to be of exactly the specified type, so we do need to be precise.

Footnotes

  1. Less woolly than the current summary in SAMtags.pdf

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jan 28, 2025

If this is woolly then the SAM specification is woolly too as this is literally just replacing the poor text originally here with the SAM specification text. Or maybe I just misunderstand and you're saying this is better than the old text? I simply tightened up what was there before as it had very vague terms ("character" could permit UTF-8, and "hexadecimal array" could imply comma-separated lists of hex values).

We do need to be careful about sizing though. SAM is not specifically tied to 32-bit (signed or otherwise), and we've exploited this in the past when we've gone from 32-bit coordinates to 64-bit coordinates. The BAM serialisation has issues, but the SAM encoding does not and we shouldn't needlessly paint ourselves into corners by over specifying as it restricts migration to better encodings. (Sadly CRAM made those same mistakes though and I regret not fixing them for CRAM 3.0)

Arguably however, and maybe contentiously, if we wish to only have one canonical source for the meaning of the type fields then it perhaps should be SAMtags. The reason it was split in the first place was that the tag encoding is shared between SAM, BAM and CRAM (and now other things like extended FASTQs and GFA). Hence keeping it in its own document.

@zaeleus
Copy link

zaeleus commented Jan 28, 2025

SAM is not specifically tied to 32-bit (signed or otherwise), and we've exploited this in the past when we've gone from 32-bit coordinates to 64-bit coordinates.

SAM positions are tied to 32-bit signed integers:

Col Field Type Regexp/Range Brief description
4 POS Int [0, 231-1] 1-based leftmost mapping POSition

SAM data field values, however, are not:

The number of digits in an integer optional field is not explicitly limited in SAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: New items
Development

Successfully merging this pull request may close these issues.

Somewhat unclear references to types in SAMtags
3 participants