Code repository for sedimentary ancient DNA (sedaDNA) shotgun sequencing (metagenomics) data analysis.
Contributed by: Sisi Liu/Sisi Liu, Lars Harms, Christiane Böckel, Kathleen R. Stoof-Leichsenring
This repository contains 10 bash scripts located in the bash_scripts
directory. These scripts are:
00-bowtie2-build-bac0.sh
01-fastqc-clumpify-fastp-dedupe.sh
02-bowtie2-a1.sh
03-bowtie2-a2.sh
04-bowtie2-a3.sh
05-merge-sort.sh
06-metaDMG.sh
07-post-metadmg-lca.sh
08-combine_lca.sh
09-mismatch.sh
10-combine_dmgout.sh
In addition, there are 5 external scripts in the external_scripts
directory, which are written in various languages (Python and R) and are not part of the bash scripts.
- combine_dmgout.R
- combine_lca.R
- dedup_sam.py
- mismatch.R
- post-metadmg-lca.R
In addition, there are 10 files in the external_files
directory, which are descriptions of sources of raw shotgun sequencing data (raw_shotgun_data_sources.txt), taxonomic reference data (taxonomic_reference_database.xlsx), and age-depth models of 8 lake cores (e.g., Age-depth_*_shotgun.csv)
This section provides an in-depth look at the data analysis's features and functionality.
Before running the scripts, make sure to install the required dependencies.
Instructions on how to use the dependencies.
I. Shotgun sequencing data quality check -> deduplication -> adapter trimming and merging of paired-end reads in parallel -> deduplication -> quality check
- Input raw sequencing paired end fastq files: there are two files, ${FILEBASE}.R1.fastq.gz and ${FILEBASE}.R2.fastq.gz (or ${FILEBASE}_R1.fastq.gz and ${FILEBASE}_R2.fastq.gz, depending on sequencing company), for each sequencing id ${FILEBASE}.
- Script: bash_scripts/01-fastqc-clumpify-fastp-dedupe.sh
- Output for next step (alignment): *fastp_dedupe_merged.fq.gz
- Source data for taxonomic reference database establishment: external_files/taxonomic_reference_sources.txt
- Script for taxonomic reference database establishment: bash_scripts/00-bowtie2-build-bac0.sh (bowtie2-build for Bacteria refseq database establishmen. Other database using the same script with different path-to-db and splited size)
- Input merged shotgun sequencing data: *fastp_dedupe_merged.fq.gz
- Script for alignment against taxonomic reference database: bash_scripts/02-bowtie2-a1.sh, bash_scripts/03-bowtie2-a2.sh, bash_scripts/04-bowtie2-a3.sh
- Output for next step (merge and sort alignments):
${FILEBASE}.$ (basename$DB).bam ($ {FILEBASE} is fastq file id; $(basename $DB) is taxonomic reference database name. In total, there are 147 alignment bam files per seqencing file.)
Motivation: To make sure alignments have been sorted by readID; sort the sam file instead of bam file due to size of headers of merged bam file > 2GB.
- Input all alignments:${FILEBASE}.$(basename $DB).bam
- Script for merge and sort: bash_scripts/05-merge-sort.sh
- Outoup for next step (taxonomic classification and ancient damage pattern analysis):${FILEBASE}_L30.sorted.sam.gz
Taxonomic profile: Wang et al., 2022; Ancient pattern: Michelsen et al., 2022
- Input sorted alignments:${FILEBASE}_L30.sorted.sam.gz
- Script: bash_scripts/06-metaDMG.sh
- Output structure: see Michelsen et al., 2022
- Attach full lineage and key ranks based on tax_id: bash_scripts/07-post-metadmg-lca.sh and external_scripts/post-metadmg-lca.R
- Combine taxonomic classification results: bash_scripts/08-combine_lca.sh and external_scripts/combine_lca.R
- Calculate ATCG substitutions frequency and attach lineage information: bash_scripts/09-mismatch.sh and external_scripts/mismatch.R
- Combine C>T rate of metaDMGout.csv and attach lineage information: bash_scripts/10-combine_dmgout.sh and external_scripts/combine_dmgout.R