Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature addition request #37

Open
methylnick opened this issue Nov 2, 2018 · 3 comments
Open

Feature addition request #37

methylnick opened this issue Nov 2, 2018 · 3 comments

Comments

@methylnick
Copy link

Thinking of adding a sample contamination check into the pipeline to get an assessment on sample purity.

Will become an increasing issue for those playing in microbiome/host genomics. But also for xenograft experiments (human/mouse) as examples.

One tool I have used is fastq screen
https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/

as a suggestion, I am sure there are other equivalent tools.

@serine
Copy link
Collaborator

serine commented Nov 14, 2018

Adding another tool that relates to this thread, http://www.bcgsc.ca/platform/bioinfo/software/biobloomtools

I think this is a tool that can sample fastq file and blast to see different species contamination

@pansapiens
Copy link
Contributor

pansapiens commented Dec 11, 2018

I've been experimenting with mash screen: https://mash.readthedocs.io/en/latest/tutorials.html#screening-a-read-set-for-containment-of-refseq-genomes

For all these tools you need some kind of reference database(s) to screen against, which can involve different amounts of mucking around to setup properly depending on the tool (eg Bowtie indices vs pre-computed Bloom filter databases vs pre-computed 'sketch' indices).

mash screen

Pros: single (relatively) small reference database (RefSeq genomes) is provided, simplifying setting up a pretty comprehensive screening db. Pretty fast.

Cons: no MultiQC plugin (yet)

fastq_screen

Pros: MultiQC support. Database download is now simple (but huge and slow) since they've added the --get_genomes option (used to be more mucking around, which I why I previously felt RNAsik should explore another option). References databases provided by fastq_screen are probably better than the mash RefSeq database for routine screening since the fastqc_screen databases are built for the task and include common contaminants, adapters etc as well as model organisms.

Cons: Precomputed reference database might not be as comprehensive as the mash ReqSeq database for detecting more obscure organisms. Bowtie / BWA dependency - not a big issue now we are recommending conda as the supported deployment method.

biobloom

Pros: MultiQC support.

Cons: As far as I can tell, no precomputed reference databases are provided.

@pansapiens
Copy link
Contributor

Another option to consider (designed more for human data with potential microbial contamination):

PathSeq

http://software.broadinstitute.org/pathseq/
https://software.broadinstitute.org/gatk/documentation/article?id=10913

Doesn't appear to be in bioconda, so probably a non-starter :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants