Skip to content

Vignette

mainciburu edited this page Mar 30, 2022 · 8 revisions

Introduction

WAT3R is a pipeline for the analysis of single cell data enriched for the variable regions of T Cell Receptor (TCR) sequences. It was originally created to analyze data from the TREK-seq protocol, but if offers the flexibility needed to apply it to other single cell TCR enrichment methods.

Installation

Installation is done via Docker. Thus, WAT3R runs in a closed docker container, which ensures reproducibility and includes every requirement for the pipeline to function and some test data.

First, build a local docker image for WAT3R. This can be done in two ways.

A) Pull the image from docker hub

A prebuilt image is availeable at docker hub. To pull it to your local computer, use:

docker pull mainciburu/wat3r

B) Build the image from source

All the files needed to build the WAT3R image are available on GitHub. To do so, first clone the GitHub repository to your local machine. Then, redirect your working directory to the top-level folder, where the Dockerfile is located. Finally, build the image by:

docker build -t mainciburu/wat3r .

Once you have a local image of WAT3R, start a new interactive container to run your analysis.

docker run -it \
	--name WAT3R \
	--mount 'type=bind,source=<source>,target=/workdir' \
	--workdir '/workdir' \
	mainciburu/wat3r \
	bash

The --mount argument will mount a local folder of your choice (substitute <source> by the path) into the container.

Preparing data

Required data

The required input files for WAT3R are two compressed fastq files. One must contain the TCR sequences, and the other the paired cell barcodes and UMI.

When demultiplexing your raw data, make sure you get two separate fastq files. As an example, data from TREK-seq is sequenced on an Illumina MiSeq set to 28 bp for Read 1 (cell barcode + UMI) and 150 bp for Index 1 (TCR). To demultiplex runs, we use

bcl2fastq --output-dir /path/to/mydir/ \
		  --use-bases-mask Y28,I150 \
		  --barcode-mismatches 0,0 \
		  --create-fastq-for-index-reads

We also use a SampleSheet with 150xN as the index sequence, so that useful reads end up in the Undetermined.fastq.gz files.

Here we show an example of the kind of files that we create and use as input for WAT3R

Undetermined_S0_L001_R1_001.fastq.gz contains cell barcodes and UMIs

@M00336:864:000000000-JMPYG:1:1101:9057:1000 1:N:0:NTGCTCTTACAGTTACTGTGGTTCCGGCTCCAAAGCTGAGCTTGTAGTCCCCATCTCTCACAGCACAGAGGTAAGAGGCAGAGTCTTTCATCTGGAGCTCCTTCAAAAGGAGGTAACTGTACCNNNNNGACCGACTAAGGAATGAANNNN
NTACTATGTACGTTCAAACATGTTAAAC
+
#8BCCFGGGGGGGGGGGGGGGGGGGGGG

Undetermined_S0_L001_I1_001.fastq.gz contains TCR sequences

@M00336:864:000000000-JMPYG:1:1101:9057:1000 1:N:0:NTGCTCTTACAGTTACTGTGGTTCCGGCTCCAAAGCTGAGCTTGTAGTCCCCATCTCTCACAGCACAGAGGTAAGAGGCAGAGTCTTTCATCTGGAGCTCCTTCAAAAGGAGGTAACTGTACCNNNNNGACCGACTAAGGAATGAANNNN
NTGCTCTTACAGTTACTGTGGTTCCGGCTCCAAAGCTGAGCTTGTAGTCCCCATCTCTCACAGCACAGAGGTAAGAGGCAGAGTCTTTCATCTGGAGCTCCTTCAAAAGGAGGTAACTGTACCNNNNNGACCGACTAAGGAATGAANNNN
+
#88ABFFGGFGFGGEFGGGFFGGGGFGGGGGGGGGGGFGFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGFFGGGFFFGGGGGGGGGGGGGG?FGGGGGGGGCFGGEGGGGGGGGGGGG#####99@DFGGGGF88EFFFGF####

Optional data

In addition to the fastq files, there are extra files that the user can optionally provide for specific steps of the pipeline.

A list of barcodes different to the one included in the docker container and used by default (10x 3' v3 list), can be added. The file must be formated as .txt.gz, each row must be a barcode and it must not have row nor column names.

An example of the correct formatting is as follows:

AAACCCAAGAAACACT
AAACCCAAGAAACCAT
AAACCCAAGAAACCCA
AAACCCAAGAAACCCG
AAACCCAAGAAACCTG
AAACCCAAGAAACGAA
AAACCCAAGAAACGTC
AAACCCAAGAAACTAC
AAACCCAAGAAACTCA
AAACCCAAGAAACTGC

A list of barcodes and annotations comming from a paired single cell experiment can be provided too. It should be a .txt file, where the first column corresponds to barcodes, and the second one to annotations.

See the following lines for an example of the correct formatting:

CCGGTGATCCGTCACT B
CAGAGCCGTCTACACA Monocytes
GAGCTGCAGACGATAT Monocytes
TTTAGTCCACTCAAGT NK
CCGCAAGTCTGAGGTT Monocytes

Running WAT3R

The pipeline is divided in two commands, wat3r and downstream. The first one performs the biggest part of the analysis, while the second polishes the results to generate the final version and overlaps them with optional paired single cell RNA data.

The only requirements to run wat3r, as described above, are the two compressed fastq files.

wat3r -b /usr/local/testdata/BCseq_test.fastq.gz \
      -t /usr/local/testdata/TCRseq_test.fastq.gz \
      -d /workdir \
downstream -u /workdir/wat3r/sample_igblast_db-pass.tsv \
		   -c /workdir/wat3r/QC/BC_UMI_cluster_metrics.txt \
		   -f /workdir/wat3r/wat3rMetrics.txt \
		   -s /workdir/wat3r/stats.log \
		   -n PB01 \
		   -a /usr/local/testdata/PB01_clusters.txt
Clone this wiki locally