Skip to content

bioinfo-pf-curie/ATAC-seq

Repository files navigation

ATAC-seq

Institut Curie - Nextflow ATAC-seq analysis pipeline

Nextflow MultiQC Install with Singularity Container available Docker Container available

Introduction

This pipeline was built for ATAC-seq data analysis. It provides a detailed quality controls of all samples as well as a first downstream analysis including peak calling and annotation.
It was developed with two modes. The first one (by default) allows to detect open genomic regions and extends reads to DNA fragment. The second one (--tn5sites) was defined to precisely detect transposase Tn5 insertion sites by centering the analysis on the 5' end of each R1/R2 reads. The later is mainly recommanded for motif discovery or footprinting analysis.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with containers making installation trivial and results highly reproducible.

Pipeline Summary

  1. Run quality control of raw sequencing reads (fastqc)
  2. Align reads on reference genome (BWA / Bowtie2)
  3. Sort aligned reads (SAMTools)
  4. Mark duplicates (Picard)
  5. Library complexity analysis (Preseq)
  6. Filtering aligned BAM files (SAMTools)
    • reads mapped to mitochondrial DNA
    • reads mapped to blacklisted regions
    • reads marked as duplicates
    • reads that arent marked as primary alignments
    • reads that are unmapped
    • reads mapped with a low mapping quality
  7. Reads shifting to correct for transposase insertin bias (deepTools)
  8. Calcule fragment size distribution (picard)
  9. Compute TSS enrichment for both nuclesome free and nucleosome bound regions (deepTools)
  10. Create normalized bigWig file (deepTools)
  11. Peak calling (MACS2, Genrich)
  12. Convert peaks files into bigBed file (UCSCtools)
  13. Peak annotation and QC (HOMER)
  14. Results summary (MultiQC)

Quick help

N E X T F L O W  ~  version 20.01.0
======================================================================
ATAC-seq v1.0.0
======================================================================

Usage:

nextflow run main.nf --reads '*_R{1,2}.fastq.gz' -profile conda --genomeAnnotationPath '/data/annotations/pipelines' --genome 'hg19'
nextflow run main.nf --samplePlan 'sample_plan.csv' -profile conda --genomeAnnotationPath '/data/annotations/pipelines' --genome 'hg19'

Mandatory arguments:
--reads [file]                     Path to input data (must be surrounded with quotes)
--samplePlan [file]                Path to sample plan file if '--reads' is not specified
--genome [str]                     Name of genome reference. See the `--genomeAnnotationPath` to defined the annotations path
-profile [str]                     Configuration profile to use. Can use multiple (comma separated)

Inputs:
--singleEnd [bool]                 Specifies that the input is single end reads
--fragmentSize [int]               Estimated fragment length used to extend single-end reads. Default: 200

References           If not specified in the configuration file or you wish to overwrite any of the references given by the --genome field
--fasta [file]                     Path to Fasta reference

Trimming:
--trimNextseq [int]            Instructs Trim Galore to apply the --nextseq=X option, to trim based on quality after removing poly-G tails (Default: 0)
--skipTrimming [bool]          Skip the adapter trimming step (Default: false)
--saveTrimmed [bool]           Save the trimmed FastQ files in the results directory (Default: false)

Alignment:
--aligner [str]                    Alignment tool to use ['bwa-mem', 'star', 'bowtie2']. Default: 'bwa-mem'
--saveAlignedIntermediates [bool]  Save all intermediates mapping files. Default: false
--bwaIndex [file]                  Index for Bwa-mem aligner
--bowtie2Index [file]              Index for Bowtie2 aligner
--bwaOpts [str]                    Modify the Bwa-mem mapping parameters
--bowtie2Opts [str]                Modify the Bowtie2 mapping parameters

Filtering:
--mapq [int]                       Minimum mapping quality to consider. Default: false
--keepDups [bool]                  Do not remove duplicates afer marking. Default: false
--keepSingleton [bool]             Keep unpaired reads. Default: false
--keepMito [bool]                  Do not filter reads from mitochrondrial chromosomal. Default: false
--blacklist [file]                 Path to black list regions (.bed).
--ignoreBlacklist [bool]           Do not filter blacklisted regions. Default: false
  
Calling:
--caller [str]                     Peak caller to use ['macs2','genrich']. Several tools can be specified (comma separated). Default: 'macs2'
--tn5sites [bool]                  Focus the analysis on Tn5 insertion sites (ie. work at the reads level and not at the fragment one). Default: false
--extsize [int]                    Value to use for extsize parameter during Macs calling. Shift parameter will be set up as extsize/2. Default: 73

Annotation:          If not specified in the configuration file or you wish to overwrite any of the references given by the --genome field
--geneBed [file]                   BED annotation file with gene coordinate.
--gtf [file]                       GTF annotation file. Used in HOMER peak annotation
--effGenomeSize [int]              Effective Genome size

Skip options:        All are false by default
--skipFastqc [bool]                Skips fastQC. Default: false
--skipShift [bool]                 Skips reads shifting to correct for transposase bias. Default: false
--skipPreseq [bool]                Skips preseq QC. Default: false
--skipDeepTools [bool]             Skips deeptools QC. Default: false
--skipPeakcalling [bool]           Skips peak calling. Default: false
--skipPeakanno [bool]              Skips peak annotation. Default: false
--skipMultiQC [bool]               Skips MultiQC step. Default: false

Other options:
--metadata [file]                  Path to metadata file for MultiQC report
--outdir [file]                    The output directory where the results will be saved
-name [str]                        Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.

=======================================================
Available Profiles
  -profile test                    Run the test dataset
  -profile conda                   Build a new conda environment before running the pipeline. Use `--condaCacheDir` to define the conda cache path
  -profile multiconda              Build a new conda environment per process before running the pipeline. Use `--condaCacheDir` to define the conda cache path
  -profile path                    Use the installation path defined for all tools. Use `--globalPath` to define the insallation path
  -profile multipath               Use the installation paths defined for each tool. Use `--globalPath` to define the insallation path
  -profile docker                  Use the Docker images for each process
  -profile singularity             Use the Singularity images for each process. Use `--singularityPath` to define the insallation path
  -profile cluster                 Run the workflow on the cluster, instead of locally

Quick run

The pipeline can be run on any infrastructure from a list of input files or from a sample plan as follow

Run the pipeline on a test dataset

See the conf/test.conf to set your test dataset.

nextflow run main.nf -profile test,conda --genomeAnnotationPath ANNOTATION_PATH

The genomeAnnotationPath is a directory where all annotations data are stored. This annotation path must follow the organisation describes in conf/genomes.config

Run the pipeline from a sample plan

nextflow run main.nf --samplePlan MY_SAMPLE_PLAN --genome 'hg19' --genomeAnnotationPath ANNOTATION_PATH --outdir MY_OUTPUT_DIR

Defining the '-profile'

By default (whithout any profile), Nextflow will excute the pipeline locally, expecting that all tools are available from your PATH variable.

In addition, we set up a few profiles that should allow you i/ to use containers instead of local installation, ii/ to run the pipeline on a cluster instead of on a local architecture. The description of each profile is available on the help message (see above).

Here are a few examples of how to set the profile option.

## Run the pipeline locally, using a global environment where all tools are installed (build by conda for instance)
-profile path --globalPath INSTALLATION_PATH

## Run the pipeline on the cluster, using the Singularity containers
-profile cluster,singularity --singularityPath SINGULARITY_PATH

## Run the pipeline on the cluster, building a new conda environment
-profile cluster,conda --condaCacheDir CONDA_CACHE

Sample Plan

A sample plan is a csv file (comma separated) that list all samples with their biological IDs. The sample plan is expected to be created as below :

SAMPLE_ID | SAMPLE_NAME | FASTQ_R1 [Path to R1.fastq file] | FASTQ_R2 [For paired end, path to Read 2 fastq]

Full Documentation

  1. Installation
  2. Reference genomes
  3. Running the pipeline
  4. Output and how to interpret the results
  5. Troubleshooting

Credits

This pipeline has been written by the bioinformatics platform of the Institut Curie (Clement Benoit, Nicolas Servant)

Contacts

For any question, bug or suggestion, please use the issues system or contact the bioinformatics core facility.

About

Nextflow pipeline for ATAC-seq data analysis

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •