Skip to content

Output files

maelyg edited this page Nov 6, 2023 · 55 revisions

Preprocessing and host-filtering

Quality check will be performed on the raw fastq file using NanoPlot which it a tool that can be used to produce general quality metrics e.g. quality score distribution, read lengths and other general stats. A NanoPlot-report.html file will be saved under the nanoplot folder with the prefix raw. This report displays 6 plots as well as a table of summary statistics.

Example of output plots:

A preprocessed fastq file will be saved in the preprocessing output directory which will minimally have its read names trimmed after the first whitespace, for compatiblity purposes with all downstream tools. This fastq file might be additionally trimmed of adapters and/or filtered based on quality and length (if PoreChopABI and/or Chopper were run).

After quality/length trimming, a PoreChopABI log will be saved under the preprocessing/porechop folder.

After adapter trimming, a Chopper log file will be saved under the preprocessing/chopper folder.

If adapter trimming and/or quality/length trimming is performed, a second quality check will be performed on the processsed fastq file and a NanoPlot-report.html file will be saved under the nanoplot folder with the prefix filtered.

If host filtering of reads is performed, an unaligned.fastq.gz file and an unaligned_reads_count.txt file listing the remainder read count will be saved under the host_filtering folder.

If adapter trimming, the quality filtering and/or the host filtering options have been run, a QC report called run_qc_report_YYYYMMDD-HHMMSS.txt will be saved under the qc_report folder

Example of report:

Sample raw_reads quality_filtered_reads host_filtered_reads percent_quality_filtered percent_host_filtered
MT010 315868 315081 200632 99.75 63.52

Read classification mode

Nucleotide taxonomic classification (Kraken 2 and Braken)

After running Kraken2, three files will be saved under the read_classification/kraken folder. The kraken2 standard output classifies each sequence. Check the manual for details about each table field.

De novo assembly mode

In this mode, the assembly created by Canu will be saved under the assembly/canu folder. In the contig headers, the length of each contig and the number of reads that contributed to it are specified.

The canu log is also saved under the same folder.

An example of a successful Canu log is included below:

-- Finished on Wed Oct 25 22:37:27 2023 (furiously fast) with 50236.869 GB free disk space
----------------------------------------
-- Finished stage 'generateOutputs', reset canuIteration.
--
-- Assembly 'MT483' finished in 'MT483'.
--
-- Summary saved in 'MT483.report'.
--
-- Sequences saved:
--   Contigs       -> 'MT483.contigs.fasta'
--   Unassembled   -> 'MT483.unassembled.fasta'
--
-- Read layouts saved:
--   Contigs       -> 'MT483.contigs.layout'.
--
-- Bye.

Similarly, a fasta file containing the flye assembly and a log of the assembly process will be saved under assembly/flye If the Flye assembly ran succesfully it will output some assembly statistics at the end, as per the example below:

[2023-10-30 11:56:06] root: INFO: Assembly statistics:

	Total length:	105982
	Fragments:	38
	Fragments N50:	2768
	Largest frg:	4361
	Scaffolds:	3
	Mean coverage:	89

If a final primer check was performed, then an SampleName_assembly_filtered.fa will be saved under the assembly folder along with the log of the cutadapt filtering step.

If a blast homology search of the contigs was performed against a databas, the results will be saved under the assembly/blastn folder. All the top hits derived for each contig are listed under the file SampleName_assembly_blastn_top_hits.txt. This file contains the following 26 columns.

If virus or viroid hits are present in the top hits file, they will be listed under the SampleName_assembly_blastn_top_viral_hits.txt file.
If multiple contigs are recovered for the same viral species, the best hit will be listed under SampleName_assembly_blastn_top_viral_spp_hits.txt.
The SampleName_assembly_viral_spp_abundance.txt will list the number of contigs recovered for each viral species.
In the example below, 2 contigs were recovered matching to the Tomato spotted wilt orthotospovirus:

species	Count
Tomato spotted wilt orthotospovirus	2

Finally the SampleName_assembly_queryid_list_with_viral_match.txt will list each unique accession IDs detected in the sample, the viral species they correspond to, and the number of contigs matching to it, and their IDs.
In the example below, a contig was recovered which matched to 2 different accession numbers matching to 2 separate segments (L and E) of the tomato spotted wilt orthotospovirus.

species	sacc	count	qseqid
Tomato spotted wilt orthotospovirus	OM112200	1	contig_3
Tomato spotted wilt orthotospovirus	OM112202	1	contig_4

Clustering mode

Blast to reference mode

Results folder structure

results
├── MT001
│   ├── host_filtering
│   │   ├── MT001_unaligned.fastq.gz
│   │   └── MT001_unaligned_reads_count.txt
│   ├── nanoplot
│   │   ├── MT001_filtered_LengthvsQualityScatterPlot_dot.html
│   │   ├── MT001_filtered_NanoPlot-report.html
│   │   ├── MT001_filtered_NanoStats.txt
│   │   ├── MT001_filtered_Non_weightedHistogramReadlength.html
│   │   ├── MT001_filtered_Non_weightedLogTransformed_HistogramReadlength.html
│   │   ├── MT001_filtered_WeightedHistogramReadlength.html
│   │   ├── MT001_filtered_WeightedLogTransformed_HistogramReadlength.html
│   │   ├── MT001_filtered_Yield_By_Length.html
│   │   ├── MT001_raw_LengthvsQualityScatterPlot_dot.html
│   │   ├── MT001_raw_NanoPlot-report.html
│   │   ├── MT001_raw_NanoStats.txt
│   │   ├── MT001_raw_Non_weightedHistogramReadlength.html
│   │   ├── MT001_raw_Non_weightedLogTransformed_HistogramReadlength.html
│   │   ├── MT001_raw_WeightedHistogramReadlength.html
│   │   ├── MT001_raw_WeightedLogTransformed_HistogramReadlength.html
│   │   └── MT001_raw_Yield_By_Length.html
│   ├── preprocessing
│   │   ├── MT001_preprocessed.fastq.gz
│   │   └── porechop
│   │       └── MT001_porechop.log
│   ├── assembly
│   │   ├── blastn
│   │   │   ├── MT001_assembly_blastn_top_hits.txt
│   │   │   ├── MT001_assembly_blastn_top_viral_hits.txt
│   │   │   ├── MT001_assembly_blastn_top_viral_spp_hits.txt
│   │   │   ├── MT001_assembly_queryid_list_with_viral_match.txt
│   │   │   └── MT001_assembly_viral_spp_abundance.txt
│   │   ├── canu
│   │   │   ├── MT001_canu_assembly.fasta
│   │   │   ├── MT001_canu.fastq
│   │   │   └── MT001.canu.log
│   │   ├── MT001_canu_assembly_filtered.fa
│   │   └── MT001_cutadapt.log
│   └── read_classification
│       ├── bracken
│       │   ├── MT001_bracken_report.txt
│       │   ├── MT001_bracken_report_viral_filtered.txt
│       │   └── MT001_bracken_report_viral.txt
│       ├── homology_search
│       │   ├── MT001_read_classification_blastn_top_hits.txt
│       │   ├── MT001_read_classification_blastn_top_viral_hits.txt
│       │   ├── MT001_read_classification_blastn_top_viral_spp_hits.txt
│       │   ├── MT001_read_classification_queryid_list_with_viral_match.txt
│       │   └── MT001_read_classification_viral_spp_abundance.txt
│       ├── kaiju
│       │   ├── MT001_kaiju.krona
│       │   ├── MT001_kaiju_name.tsv
│       │   ├── MT001_kaiju_summary.tsv
│       │   ├── MT001_kaiju_summary_viral_filtered.tsv
│       │   └── MT001_kaiju_summary_viral.tsv
│       ├── kraken
│       │   ├── MT001.kraken2
│       │   ├── MT001_kraken_report.txt
│       │   └── MT001_seq_ids.txt
│       └── krona
│           └── MT001_krona.html
└── qc_report
    └── run_qc_report_20231009-114823.txt