Feature addition request #37

methylnick · 2018-11-02T00:09:25Z

Thinking of adding a sample contamination check into the pipeline to get an assessment on sample purity.

Will become an increasing issue for those playing in microbiome/host genomics. But also for xenograft experiments (human/mouse) as examples.

One tool I have used is fastq screen
https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/

as a suggestion, I am sure there are other equivalent tools.

serine · 2018-11-14T02:25:06Z

Adding another tool that relates to this thread, http://www.bcgsc.ca/platform/bioinfo/software/biobloomtools

I think this is a tool that can sample fastq file and blast to see different species contamination

pansapiens · 2018-12-11T00:54:27Z

I've been experimenting with mash screen: https://mash.readthedocs.io/en/latest/tutorials.html#screening-a-read-set-for-containment-of-refseq-genomes

For all these tools you need some kind of reference database(s) to screen against, which can involve different amounts of mucking around to setup properly depending on the tool (eg Bowtie indices vs pre-computed Bloom filter databases vs pre-computed 'sketch' indices).

mash screen

Pros: single (relatively) small reference database (RefSeq genomes) is provided, simplifying setting up a pretty comprehensive screening db. Pretty fast.

Cons: no MultiQC plugin (yet)

fastq_screen

Pros: MultiQC support. Database download is now simple (but huge and slow) since they've added the --get_genomes option (used to be more mucking around, which I why I previously felt RNAsik should explore another option). References databases provided by fastq_screen are probably better than the mash RefSeq database for routine screening since the fastqc_screen databases are built for the task and include common contaminants, adapters etc as well as model organisms.

Cons: Precomputed reference database might not be as comprehensive as the mash ReqSeq database for detecting more obscure organisms. Bowtie / BWA dependency - not a big issue now we are recommending conda as the supported deployment method.

biobloom

Pros: MultiQC support.

Cons: As far as I can tell, no precomputed reference databases are provided.

pansapiens · 2019-03-07T00:55:24Z

Another option to consider (designed more for human data with potential microbial contamination):

PathSeq

http://software.broadinstitute.org/pathseq/
https://software.broadinstitute.org/gatk/documentation/article?id=10913

Doesn't appear to be in bioconda, so probably a non-starter :/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature addition request #37

Feature addition request #37

methylnick commented Nov 2, 2018

serine commented Nov 14, 2018

pansapiens commented Dec 11, 2018 •

edited

Loading

pansapiens commented Mar 7, 2019

Feature addition request #37

Feature addition request #37

Comments

methylnick commented Nov 2, 2018

serine commented Nov 14, 2018

pansapiens commented Dec 11, 2018 • edited Loading

mash screen

fastq_screen

biobloom

pansapiens commented Mar 7, 2019

PathSeq

pansapiens commented Dec 11, 2018 •

edited

Loading