The raw input data is in the form of short sequences of length around 100 bp. Prior to the actual data analyis, it is generally advised to clean and trim the reads, but these are rather mechanical, fairly easy to do, and not really interesting. We will skip ahead to the more interesting parts of the analysis workflow.
The workflow, of course, depends on the research objectives. Some of the typical questions that researchers try to answer are:
- What are the different species of bacteria that are present in the sample?
- In what relative abundance are they present?
- Are there differences in the bacterial community profiles among sites ?
- What kind of antibiotic resistant genes are present? In what proportion?
- What kind of mobile genetic elements are present? In what proportion?
One approach is to first assemble the reads, which means to stitch the reads together to reconstruct the chromosome which was fragmented during the sequencing process. This is a challenging task and is computationally demanding. We will not delve into this approach in this workshop.
What we will explore is the assembly-free approach, in which the reads are compared against reference databases to answer the questions posed above. In particular, we will look at Questions 1 -- 3.