Skip to content

Latest commit

 

History

History
124 lines (91 loc) · 5.3 KB

README.md

File metadata and controls

124 lines (91 loc) · 5.3 KB

ddSeeker

Description

ddSeeker extracts cellular and molecular identifiers from single cell RNA sequencing experiments.

Input: R1 and R2 FASTQ files from a paired-end single cell sequencing experiment.

Output: one unmapped BAM (uBAM) file containing reads tagged with cell barcodes and Unique Molecular Identifiers (UMI). Default tags are XC and XM for cell barcodes and UMI, XE for errors related to the barcode identification, and XQ and Xq for base quality of cell barcode and UMI respectively1. Users can manually set different tags (see Additional options).

Errors in barcode identification

  • LX = both linkers not aligned correctly
  • L1 = linker 1 not aligned correctly
  • L2 = linker 2 not aligned correctly
  • I = indel in BC2
  • D = deletion in Phase Block or BC1
  • J = indel in BC3 or ACG trinucleotide
  • K = indel in UMI or GAC trinucleotide
  • B = one BC with more than 1 mismatch

Additional options

  • Increment number of CPU units (faster analysis) with -c/--cores.
  • Manually set tags with --tag-bc, --tag-umi, --tag-bc-q, --tag-umi-q, --tag-error.
  • Print uncompressed SAM file to standard output (allowing direct feeding to other tools for filtering, sorting etc.) with -o/--output - (note the - sign).
  • Generate two csv files reporting the number of reads per cell and the distribution of error tags specifying the path with -s/--summary-prefix.
  • Create plots from the csv summary files using make_graphs.R (see ).

Installation & Usage

Clone the repository and add the folder to your PATH variable

git clone https://github.com/cgplab/ddSeeker.git
export PATH=<path_to_ddSeeker>:$PATH

Dependencies

We suggest to install python packages using pip which should be already installed if you are using Python3 >= 3.4.

pip install biopython
pip install pysam

Examples

  • ddSeeker with 20 cores

    ddSeeker.py --input sampleA_R1.fastq.gz sampleA_R2.fastq.gz --output sampleA_tagged.bam --cores 20
    
  • Print to stdout and pipe to samtools for queryname sorting

    ddSeeker.py -i sampleA_R* -c 20 -o - | samtools sort -no sampleA_tagged_qsorted.bam
    

Test dataset

https://github.com/cgplab/ddSeeker_example_dataset

Generate summary files and make graphs

Requires R >=3.4 and the tidyverse package. Three plots are generated: dot plot of error distribution, absolute count of reads per cell, and cumulative distribution of reads per cell. The latter two report by default the whole set of barcodes in the csv file. To limit the report to a lower number, specify it from the command line.

mkdir summary_folder
ddSeeker.py -i sampleA_R* -c 20 -o sampleA_tagged.bam -s summary_folder/sampleA
make_graphs.R summary_folder/sampleA 2000

Integrating single cell analysis pipelines

Several pipelines have been developed to perform single cell analysis. Below we describe the main steps required to integrate our tool with Drop-seq tools, scPipe and dropEst.

Drop-seq tools

Since Drop-seq tools was our choice for our analyses, we provide a ready-to-use bash script. Simply run

ddSeeker_dropSeq_tools.sh [options] sampleA_R1.fastq.gz sampleA_R2.fastq.gz

to produce aligned tagged reads in BAM format. Table of Counts can be obtained using the DigitalExpression tool included in Drop-seq tools.

scPipe

scPipe requires one FASTQ file with cell barcodes and UMIs stored in the header of each read record. To change the output of ddSeeker use the option --pipeline scpipe.

ddSeeker.py -i <read1.fastq.gz> <read2.fast.gz> -o <tagged_reads.fastq.gz -c 20 --pipeline scpipe

In addition, set bc_len=18 and UMI_len=8 with the sc_exon_mapping() function.

dropEst

dropEst can work with tagged BAM files. Simply make the BamTags match with the ddSeeker tags specifying them in the config.xml file

<BamTags>
    <cb>XC</cb>
    <umi>XM</umi>
</BamTags>

Citation

Romagnoli D*, Boccalini G*, Bonechi M, Biagioni C, Fassan P, Bertorelli R, De Sanctis V, Di Leo A, Migliaccio I, Malorni L, Benelli M. ddSeeker: a tool for processing Bio-Rad ddSEQ single cell RNA-seq data. BMC Genomics. 2018; 19:960

Link to the paper: ddSeeker: a tool for processing Bio-Rad ddSEQ single cell RNA-seq data

Footnotes

  1. Know issue. The cell barcode is composed by three shorter blocks and if any of the block is affected by and indel event it may be difficult to determine which base was inserted/deleted (although each block is fixed by comparing it to a list of available blocks). Therefore, the quality string of the cell barcode may differ in length with respect to the proper cell barcode by one base: it still gives a general view of the quality of the barcode.