Skip to content

Generic Amplicon Pipeline ReadMe

MikeWLloyd edited this page Apr 11, 2024 · 3 revisions

Amplicon Sequencing: General PCR / Target Panel Documentation

Amplicon Sequencing: General PCR / Target Panel Pipeline (--workflow amplicon_generic)

•	Step 1: Trim FASTQ reads    
•	Step 2: Map reads with BWA  
•	Step 3: Base recalibration with ABRA2 for FreeBayes  
•	Step 4: Base recalibration with GATK for HaplotypeCaller  
•	Step 5: Call variants with FreeBayes  
•	Step 6: Call variants with GATK HaplotypeCaller  
•	Step 7: Filter FreeBayes variants  
•	Step 8: Filter HaplotypeCaller variants (SNP and INDEL filtered separately)  
•	Step 9: Merge VCFs from FreeBayes and HaplotypeCaller  
•	Step 10: Annotate with dbSNP, COSMIC, SNPEFF, and DBNSFP information  
•	Step 11: MultiQC  
flowchart TD
    p0((Sample))
    p1[FASTP]
    p2[FASTQC]
    p4[BWA_MEM]
    p5[PICARD_SORTSAM]

    p6[ABRA2]
    p7[GATK_BASERECALIBRATOR_ABRA2]
    p8[GATK_APPLYBQSR_ABRA2]
    p9[FREEBAYES]
    p10[GATK_INDEXFEATUREFILE]

    p11[GATK_BASERECALIBRATOR]
    p12[GATK_APPLYBQSR]
    p13[GATK_HAPLOTYPECALLER]

    o1([Genomic BAM]):::output
    p118[PICARD_COLLECTTARGETPCRMETRICS]
    p119[TARGET_COVERAGE_METRICS]

    o2([Raw Variant Calls - FreeBayes]):::output
    o3([Raw Variant Calls - HaplotypeCaller]):::output
    
    p14[GATK_VARIANTFILTRATION_FREEBAYES]
    p15[BCFTOOLS_NORM_FREEBAYES]
    p16[ADD_ALT_AF_FREEBAYES]
    
    p17[GATK_VARIANTFILTRATION_HAPLOTYPECALLER_SNP]
    p18[GATK_VARIANTFILTRATION_HAPLOTYPECALLER_INDEL]
    p19[GATK_MERGE_VCF]
    p20[GATK_VARIANTFILTRATION_HAPLOTYPECALLER]
    p21[BCFTOOLS_NORM_HAPLOTYPECALLER]
    p22[ADD_ALT_AF_HAPLOTYPECALLER]

    p23[BCFTOOLS_MERGECALLERS]
    o4([Merged Variant Calls]):::output

    p24[SNPSIFT_ANNOTATE_DBSNP]
    p25[SNPSIFT_ANNOTATE_COSMIC]
    p26[SNPEFF]
    p27[SNPSIFT_DBNSFP]
    p28[SNPEFF_ONEPERLINE]

    o5([Merged Annotated Variant Calls]):::output

    p29[MULTIQC]
    o6([MutliQC Report]):::output
    p30[PICARD_COLLECTTARGETPCRMETRICS]
    p31[TARGET_COVERAGE_METRICS]


    p0 --> |Raw Reads| p1
    subgraph alignment [  ]

    p1 --> p4
    p4 --> p5
    p5 --> p6
    p6 --> p7
    p7 --> p8


    p5 --> p11
    p11 --> p12
    p12 --> o1

    end

    subgraph calling [  ]
    p8 --> p9
    p9 --> p10
    p10 --> o2
    o2 --> p14
    p14 --> p15
    p15 --> p16

    o1 --> p13
    p13 --> o3
    o3 --> p17
    o3 --> p18
    p17 --> p19
    p18 --> p19
    p19 --> p20
    p20 --> p21
    p21 --> p22

    p16 --> p23
    p22 --> p23
    p23 --> o4
    o4 --> p24
    p24 --> p25
    p25 --> p26
    p26 --> p27
    p27 --> p28
    p28 --> o5

    end

    subgraph qc [  ]
    p1 --> p2
    p1 --> p29
    p2 --> p29
    p11 --> p29
    o1 --> p30
    o1 --> p31
    p30 --> p29
    p31 --> p29
    p29 --> o6
    end

classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000

style alignment stroke:#333,stroke-width:2px
style calling stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
Loading

Parameters for Amplicon Pipeline

  • --pubdir

    • Default: /<PATH>
    • Comment: The directory that the saved outputs will be stored.
  • --organize_by

    • Default: sample
    • Comment: How to organize the output folder structure. Options: sample or analysis.
  • --cacheDir

    • Default: /projects/omics_share/meta/containers
    • Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
  • -w

    • Default: /<PATH>
    • Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
  • --sample_folder

    • Default: /<PATH>
    • Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
  • --extension

    • Default: .fastq.gz
    • Comment: The expected extension for the input read files.
  • --pattern

    • Default: "*_R{1,2}*"
    • Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this READ_NAME_R1_MoreText.fastq.gz or READ_NAME_R1.fastq.gz
  • --read_type

    • Default: PE
    • Comment: Options: PE and SE. Default: PE. Type of reads: paired end (PE) or single end (SE).
  • --concat_lanes

    • Default: false
    • Comment: Options: false and true. Default: false. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
  • --csv_input

    • Default: null
    • Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, * --download_data can be specified.
  • --download_data

    • Default: null
    • Comment: Requires * --csv_input. When specified, read data in the CSV manifest will be downloaded from provided URLs with Aria2.
  • --gen_org

    • Default: human
    • Comment: Options: human.
  • --multiqc_config

    • Default: /<PATH>
    • Comment: The path to amplicon.yaml. The configuration file used while running MultiQC
  • --quality_phred

    • Default: 15
    • Comment: The quality value that is required for a base to pass. Default: 15 which is a phred quality score of >=Q15.
  • --unqualified_perc

    • Default: 40
    • Comment: Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
  • --detect_adapter_for_pe

    • Default: false
    • Comment: If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
  • --ref_fa

    • Default: '/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'
    • Comment: The reference fasta to be used throughout the process for alignment as well as any downstream analysis.
  • --ref_fa_indices

    • Default: '/projects/omics_share/human/GRCh38/genome/indices/gatk/bwa/Homo_sapiens_assembly38.fasta'
    • Comment: Pre-compiled BWA index files.
  • --mismatch_penalty

    • Default: -B 8
    • Comment: The BWA penalty for a mismatch.
  • --amplicon_primer_intervals

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/capture_kit_files/IDT/xGen_sampleID_amplicon/hg38Lifted_xGen_SampleID_primers.interval_list'
    • Comment: GATK interval file with primer positions for the specific amplicon panel for calculation of coverage metrics. This file is specific to each IDT xGen amplicon panel, and should be changed if samples are not derived from xGen sample ID. File can be generated with: (Picard BedToIntervalList)[https://gatk.broadinstitute.org/hc/en-us/articles/13832706340763-BedToIntervalList-Picard]
  • --amplicon_target_intervals

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/capture_kit_files/IDT/xGen_sampleID_amplicon/hg38Lifted_xGen_SampleID_merged_targets.interval_list'
    • Comment: GATK interval file with target positions for the specific amplicon panel for calculation of coverage metrics. This file is specific to each IDT xGen amplicon panel, and should be changed if samples are not derived from xGen sample ID. File can be generated with: (Picard BedToIntervalList)[https://gatk.broadinstitute.org/hc/en-us/articles/13832706340763-BedToIntervalList-Picard]
  • --amplicon_rsid_targets

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/capture_kit_files/IDT/xGen_sampleID_amplicon/hg38Lifted_xGen_SampleID_merged_targets.txt'
    • Comment: Amplicon SNP target file containing rsID and gene information. Used in generation of the final fingerprint report file.
  • --gold_std_indels

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz’
    • Comment: Used in GATK BaseRecalibrator.
  • --phase1_1000G

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/1000G_phase1.snps.high_confidence.hg38.vcf.gz'
    • Comment: Used in GATK BaseRecalibrator.
  • --dbSNP

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/dbsnp_151.vcf.gz'
    • Comment: The dbSNP database contains known single nucleotide polymorphisms, and is used in the annotation of known variants.
  • --dbSNP_index

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/dbsnp_151.vcf.gz.tbi'
    • Comment: The dbSNP index file associated with the dbSNP VCF file.
  • --call_val

    • Default: 50
    • Default: The minimum phred-scaled confidence threshold at which variants should be called.
  • --ploidy_val

    • Default: '-ploidy 2'
    • Comment: variable in haplotypecaller. Not required for amplicon, but present in module.
  • --target_gatk

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/capture_kit_files/IDT/xGen_sampleID_amplicon/hg38Lifted_xGen_SampleID_merged_targets.bed'
    • Comment: A bed file with amplicon target intervals as defined in the amplicon array used in the data. NOTE: This file MUST reflect the amplicon array used to generate your data.
  • --call_val

    • Default: 50
    • Comment: The minimum phred-scaled confidence threshold at which variants should be called.
  • --gen_ver

    • Default: "hg38"
    • Comment: snpEff genome version. Sets to 'hg38' when * --gen_org human
  • --snpEff_config

    • Default: "/projects/omics_share//human/GRCh38/genome/indices/snpEff_5_1/snpEff.config"
    • Comment: The configuration file used while running snpEff.
  • --tmpdir

    • Default: '/fastscratch/${USER}'
    • Comment: Temporary directory to store intermediate files generated outside of the standard Nextflow cache location.

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

Naming Convention Description
atac_report.html Nextflow autogenerated report.
trace.txt Nextflow trace of processes.
multiqc MultiQC report summarizing quality metrics across samples in the analysis run.
*freebayes_raw.vcf Raw VCF with variant calls from FreeBayes
*haplotypecaller_raw.vcf Raw VCF with variant calls from GATK HaplotypeCaller
*freebayes_altAF_filtered.vcf Alt allele frequency filtered VCF with variant calls from FreeBayes
*haplotypecaller_altAF_filtered.vcf Alt allele frequency filtered VCF with variant calls from GATK HaplotypeCaller
*mergedCallers_filtered_unannotated.vcf Filtered unannotated variant calls merged between FreeBayes and HaplotypeCaller
*mergedCallers_filtered_annotated.vcf dbSNP, COSMIC, SNPEFF, and DBNSFP annotated filtered merged variant calls
*snpsift_finalTable.txt Annotated variant calls in tabular format
bam Directory containing alignments post base realignment (i.e., post apply BQSR)
stats Directory containing all individual stats files output by the pipeline

CSV Input Sample Sheet

The required input header is: sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

  • The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
  • The lane column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis.
  • The fastq_1 and fastq_2 columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.

Basic examples:

An example PE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz

An example SE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz
Clone this wiki locally