-
Notifications
You must be signed in to change notification settings - Fork 7
Queries with the Python API
Once a MSBWT has been constructed, the primary use case is to query the structure for particular k-length strings called k-mers. This wiki page will describe how to load a MSBWT and then perform simple queries through the python interface.
Assuming a MSBWT has already been created, we need only the path to the MSBWT directory to load the structure. Note, the first time you load the MSBWT, it will generate several files corresponding to the FM-index data structure and save them in the MSBWT directory. As a result, it may take longer to load for the first time. Below we import the msbwt package into python and then load our MSBWT.
from MUSCython import MultiStringBWTCython as MSBWT
msbwt = MSBWT.loadBWT('/path/to/my/BWT')
The fundamental query for a MSBWT is a k-mer count. In short, given a specific k-mer, the query returns the number of occurrences of that k-mer in all strings (i.e. reads) in the MSBWT. Below is an example function call of the count routine for a short k-mer.
#the forward counts for 3-mer 'AAA'
forwardCount = msbwt.countOccurrencesOfSeq('AAA')
Additionally, the reverse-complement string can be generated and queried with this API:
#the reverse complement counts for 3-mer 'AAA'; this is the same as the forward counts for 'TTT'
revCompCount = msbwt.countOccurrencesOfSeq(MSBWT.reverseComplement('AAA'))
Without going into much detail, the MSBWT performs queries in reverse order. In other words, if we wish to search for k-mer 'ACT', the MSBWT will first search for 'T', then 'CT', and finally 'ACT'. We expose this property of the MSBWT through another function. This function returns a "range" in the MSBWT that can also be used to solve for k-mer counts in place of the previous method.
tLow, tHigh = msbwt.findIndicesOfStr('T')
ctLow, ctHigh = msbwt.findIndicesOfStr('C', (tLow, tHigh))
actLow, actHigh = msbwt.findIndicesOfStr('A', (ctLow, ctHigh))
If you want to retrieve counts this way, simply taking the difference in the high and low will get the same value as msbwt.countOccurrencesOfSeq('ACT').
actCount = actHigh-actLow
Note there is often more than one way to get the same result through chaining queries.
#the below method returns the same result
actLow, actHigh = msbwt.findIndicesOfStr('AC', (tLow, tHigh))
Holt, J., & McMillan, L. (2014). Merging of multi-string BWTs with applications. Bioinformatics, btu584.
Holt, J., & McMillan, L. (2014, September). Constructing burrows-wheeler transforms of large string collections via merging. In Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 464-471). ACM.
James Holt - [email protected]