-
Notifications
You must be signed in to change notification settings - Fork 7
A Multi String Burrows Wheeler Transform Primer
A Burrows-Wheeler Transform (BWT) is a permutation of the characters in a string with some interesting properties:
- The BWT is a reversible permutation of a single string that enables compression through run-length encoding (Burrows and Wheeler, 1994).
- The BWT implicitly represents a suffix array . Suffix arrays store the lexicographic order of every suffix in a single string. This enables searching the string for arbitrary patterns in O(N*log(N)) time for a string of length N (Abouelhodaa, Kurtzb, and Ohlebusch, 2004).
- An auxiliary data structure called the FM-index can be built for a BWT in linear time. This data structure enables arbitrary searches for k-length patterns (k-mers) in O(k) time (Ferragina and Manzini, 2000).
In its most general form, the Multi-String Burrows-Wheeler Transform (MSBWT) is a variant of the Burrows-Wheeler Transform (BWT) that allows for a string collection as opposed to a single string. All of the properties regarding compression and query capability that applied to a BWT are adapted to still work for a MSBWT. The end result is a representation of string collection that is compressed and searchable.
Given a sequencing dataset, we treat each read as a string in a string collection. By building the MSBWT of this string collection, we obtain a compressed, lossless representation of the dataset that can be quickly searched for arbitrary length patterns. Thus the problem shifts to identifying meaningful patterns with a genomic sequencing context. Here is a list of the core benefits from using an MSBWT:
- Data-centered approach - all strings are encoded in the MSBWT; there is no loss of string information and the entire original string collection can be retrieved from the MSBWT if necessary
- Compression - once encoded in a MSBWT, run-length encoding can be used to reduce the size of the MSBWT; compressing the MSBWT does not interfere with query capabilities (in practice, compression actually tends to speed up the search speeds)
- Arbitrary pattern searches - after building the MSBWT and FM-index, any pattern of any length can be searched for; in other words, the MSBWT and FM-index are not fixed to a particular pattern length as some other k-mer based methods are
- Read extraction - given the results from a particular pattern search, the MSBWT can be used to retrieve all reads that contain that particular pattern; this enables many different types of tools centered around direct access to sequencing data
Additionally, here is a list of published applications of the MSBWT to bioinformatics:
- De novo assembly - using k-mer frequencies from an FM-index to build a graphical representation of the data and then assemble it into longer contigs (Simpson and Durbin, 2010)
- Splice junction detection - by comparing the k-mer frequencies in a DNA dataset v. an RNA datasets, one can identify candidate k-mer patterns that might be indicative of splice junctions (Cox et al, 2012)
- Short-read correction - using the MSBWT k-mer frequencies to correct errors in DNA-seq (Greenstein, Holt, and McMillan, 2015)
Holt, J., & McMillan, L. (2014). Merging of multi-string BWTs with applications. Bioinformatics, btu584.
Holt, J., & McMillan, L. (2014, September). Constructing burrows-wheeler transforms of large string collections via merging. In Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 464-471). ACM.
James Holt - [email protected]