Skip to content

Queries with the Python API

Leonard McMillan edited this page May 23, 2017 · 3 revisions

Queries with the Python API

Once a MSBWT has been constructed, the primary use case is to query the structure for particular k-length strings called k-mers. This wiki page will describe how to load a MSBWT and then perform simple queries through the python interface.

Loading the MSBWT

Assuming a MSBWT has already been created, we need only the path to the MSBWT directory to load the structure. Note, the first time you load the MSBWT, it will generate several files corresponding to the FM-index data structure and save them in the MSBWT directory. As a result, it may take longer to load for the first time. Below we import the msbwt package into python and then load our MSBWT.

from MUSCython import MultiStringBWTCython as MSBWT
msbwt = MSBWT.loadBWT('/path/to/my/BWT')

Performing Basic Queries

The fundamental query for a MSBWT is a k-mer count. In short, given a specific k-mer, the query returns the number of occurrences of that k-mer in all strings (i.e. reads) in the MSBWT. Below is an example function call of the count routine for a short k-mer.

#the forward counts for 3-mer 'AAA'
forwardCount = msbwt.countOccurrencesOfSeq('AAA')

Additionally, the reverse-complement string can be generated and queried with this API:

#the reverse complement counts for 3-mer 'AAA'; this is the same as the forward counts for 'TTT'
revCompCount = msbwt.countOccurrencesOfSeq(MSBWT.reverseComplement('AAA'))

Chaining Queries

Without going into much detail, the MSBWT performs queries in reverse order. In other words, if we wish to search for k-mer 'ACT', the MSBWT will first search for 'T', then 'CT', and finally 'ACT'. We expose this property of the MSBWT through another function. This function returns a "range" in the MSBWT that can also be used to solve for k-mer counts in place of the previous method.

tLow, tHigh = msbwt.findIndicesOfStr('T')
ctLow, ctHigh = msbwt.findIndicesOfStr('C', (tLow, tHigh))
actLow, actHigh = msbwt.findIndicesOfStr('A', (ctLow, ctHigh))

If you want to retrieve counts this way, simply taking the difference in the high and low will get the same value as msbwt.countOccurrencesOfSeq('ACT').

actCount = actHigh-actLow 

Note there is often more than one way to get the same result through chaining queries.

#the below method returns the same result
actLow, actHigh = msbwt.findIndicesOfStr('AC', (tLow, tHigh))