Skip to content

Latest commit

 

History

History
94 lines (77 loc) · 5.49 KB

README.md

File metadata and controls

94 lines (77 loc) · 5.49 KB

FUNGI_ITS_TAXONOMY

This is a pipeline implemented in nextflow to identify fungus through ITS2 metabarcoding generated in S5 IonTorrent. The outputs (one csv and one html report) are stored in the final_report folder inside the output directory.

Prerequisites

  • OS Ubuntu
  • Docker tutorial
  • Nextflow (version 23.10.0) tutorial
  • Nextflow deals with image download and container run by itself.

Database

The database used for fungi taxonomy included in this repository was download from UNITE QIIME release_29.11.2022 for Fungi. It includes the trainned classifier for quay.io/qiime2/core:2023.7. Files are stored in the arquivos_db folder in gz format. The pipeline itself decompress the files.

Pipeline

The pipeline is executed through quay.io/qiime2/core:2023.7 image and includes a python script for data preprocessing and a R script to generate the html report through quarto based on ngsReports package. Those files are stored in bin directory.

Parameters

  • Due to variability in the length of the sequences for ITS,we opted for not using the parameter --p-trunc-len = 0 tutorial;
  • As the pattern shown by Ion S5 fastq files quality, the default for parameters p_quality_cutoff_5end and p_quality_cutoff_3end are 20;
  • p_error_rate = 0.2
  • p_minimum_length = 80
  • p_max_ee = 2
  • Percent identity for qiime feature-classifier classify-consensus-vsearch = 0.99
  • Percent identity for qiime feature-classifier classify-consensus-blast = 0.99
  • Percent identity for qiime feature-classifier classify-sklearn = 0.99
  • Percent identity for qiime feature-classifier classify-sklearn --p-reads-per-batch = 10000
  • Primer sequence for qiime cutadapt trim-single --i-demultiplexed-sequences --p-front GGAAGTAAAAGTCGTAACAAGG

Usage

Help message

nextflow run github/its_pipeline/main.nf --help

Usage:
 The typical command for running the pipeline is as follows:

 nextflow run main.nf --primer_seq 'primer_sequence' \
 	--fastq_folder '/path/to/my/fastq_files' \
 	--trainned_classifier 'full_path/and_name/to/my/trainned_classifier' \
 	--ref_reads 'full_path/to/reference_reads.fasta' \
 	--tax_file 'full_path/to/tax_file.txt' \
 	--outdir 'output-folder-name' \
 	--threads "15"

 If using default parameters, you only need to provide "fastq_folder" and a csv file (named samplesheet.csv)
 containing the sample names in the first column named "sample", reads1 in the second column named "r1" and
 a third column empty column "r2". The .csv fie must be placed inside the `--fastq_folder`.




 Mandatory arguments:
  --fastq_folder                 Folder with fastq files (full path required)
  --primer_seq                   Primer sequence ("AACTCCG") [must be surrounded with quotes]
                                 [Default: "GGAAGTAAAAGTCGTAACAAGG"]
  --trainned_classifier          full_path/to/trainned_classifier.qza (full path required)
                                 [Default: trainned_qiime-2023.7_ver9_99_s_all_29.11.2022_dev.qza.gz]
  --ref_reads                    Full path to reference reads file (.fasta) (full path required)
                                 [Default: sh_refs_qiime_ver9_99_s_all_29.11.2022_dev.fasta.gz]
  --tax_file                     Full path to taxonomy file (.txt) (full path required)
                                 [Default: sh_taxonomy_qiime_ver9_99_s_all_29.11.2022_dev.txt.gz]
  --outdir                       Output folder name

Optional arguments:
 --threads                      Number of CPUs to use [Default: 15]

   qiime cutadapt trim-single arguments
   --p_quality_cutoff_5end        Quality cut off in 5' end [Default: 20]
   --p_quality_cutoff_3end        Quality cut off in 3' end [Default: 20]
   --p_error_rate                 Maximum allowed error rate [Default: 0.2]
   --p_minimum_length             Discard reads shorter than specified value. Note, the 
                                  cutadapt default of 0 has been overridden, because
                                  that value produces empty sequence records [Default: 80]

   qiime dada2 denoise-single arguments
   --p_max_ee                     Reads with number of expected errors higher than
                                  this value will be discarded [Default: 2] 
   --p_trunc_len                  Position at which sequences should be truncated due
                                  to decrease in quality. This truncates the 3' end of
                                  the of the input sequences, which will be the bases
                                  that were sequenced in the last cycles. Reads that
                                  are shorter than this value will be discarded. If 0
                                  is provided, no truncation or length filtering will
                                  be performed [Default: 0]

Run the pipeline

You need to provide at least the --fastq_folder parameter.

nextflow run github/its_pipeline/main.nf --fastq_folder 'its_reads'

Output files

The default name for the output dir is "results". Inside the folder you'll find a directory called "final_report" (see an example in the final_report dir). It contains a ".html" file reporting quality control and denoise statistics, alpha-rarefaction curves and a taxonomy table comparing results from classify-consensus-vsearch, classify-consensus-blast and classify-sklearn.