Skip to content

Vignette

mainciburu edited this page Apr 1, 2022 · 8 revisions

Introduction

WAT3R is a pipeline for the analysis of single cell data enriched for the variable regions of T Cell Receptor (TCR) sequences. It was originally created to analyze data from the TREK-seq protocol, but if offers the flexibility needed to apply it to other single cell TCR enrichment methods.

Installation

Installation is done via Docker. Thus, WAT3R runs in a closed docker container, which ensures reproducibility and includes every requirement for the pipeline to function and some test data.

First, build a local docker image for WAT3R. This can be done in two ways.

A) Pull the image from docker hub

A prebuilt image is availeable at docker hub. To pull it to your local computer, use:

docker pull mainciburu/wat3r

B) Build the image from source

All the files needed to build the WAT3R image are available on GitHub. To do so, first clone the GitHub repository to your local machine. Then, redirect your working directory to the top-level folder, where the Dockerfile is located. Finally, build the image by:

docker build -t mainciburu/wat3r .

Once you have a local image of WAT3R, start a new interactive container to run your analysis.

docker run -it \
	--name WAT3R \
	--mount 'type=bind,source=<source>,target=/workdir' \
	--workdir '/workdir' \
	mainciburu/wat3r \
	bash

The --mount argument will mount a local folder of your choice (substitute <source> by the path) into the container.

Preparing data

Required data

The required input files for WAT3R are two compressed fastq files. One must contain the TCR sequences, and the other the paired cell barcodes and UMI.

When demultiplexing your raw data, make sure you get two separate fastq files. As an example, data from TREK-seq is sequenced on an Illumina MiSeq set to 28 bp for Read 1 (cell barcode + UMI) and 150 bp for Index 1 (TCR). To demultiplex runs, we use

bcl2fastq --output-dir /path/to/mydir/ \
		  --use-bases-mask Y28,I150 \
		  --barcode-mismatches 0,0 \
		  --create-fastq-for-index-reads

We also use a SampleSheet with 150xN as the index sequence, so that useful reads end up in the Undetermined.fastq.gz files.

Here we show an example of the kind of files that we create and use as input for WAT3R

Undetermined_S0_L001_R1_001.fastq.gz contains cell barcodes and UMIs

@M00336:864:000000000-JMPYG:1:1101:9057:1000 1:N:0:NTGCTCTTACAGTTACTGTGGTTCCGGCTCCAAAGCTGAGCTTGTAGTCCCCATCTCTCACAGCACAGAGGTAAGAGGCAGAGTCTTTCATCTGGAGCTCCTTCAAAAGGAGGTAACTGTACCNNNNNGACCGACTAAGGAATGAANNNN
NTACTATGTACGTTCAAACATGTTAAAC
+
#8BCCFGGGGGGGGGGGGGGGGGGGGGG

Undetermined_S0_L001_I1_001.fastq.gz contains TCR sequences

@M00336:864:000000000-JMPYG:1:1101:9057:1000 1:N:0:NTGCTCTTACAGTTACTGTGGTTCCGGCTCCAAAGCTGAGCTTGTAGTCCCCATCTCTCACAGCACAGAGGTAAGAGGCAGAGTCTTTCATCTGGAGCTCCTTCAAAAGGAGGTAACTGTACCNNNNNGACCGACTAAGGAATGAANNNN
NTGCTCTTACAGTTACTGTGGTTCCGGCTCCAAAGCTGAGCTTGTAGTCCCCATCTCTCACAGCACAGAGGTAAGAGGCAGAGTCTTTCATCTGGAGCTCCTTCAAAAGGAGGTAACTGTACCNNNNNGACCGACTAAGGAATGAANNNN
+
#88ABFFGGFGFGGEFGGGFFGGGGFGGGGGGGGGGGFGFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGFFGGGFFFGGGGGGGGGGGGGG?FGGGGGGGGCFGGEGGGGGGGGGGGG#####99@DFGGGGF88EFFFGF####

Optional data

In addition to the fastq files, there are extra files that the user can optionally provide for specific steps of the pipeline.

A list of barcodes different to the one included in the docker container and used by default (10x 3' v3 list), can be added. The file must be formated as .txt.gz, each row must be a barcode and it must not have row nor column names.

An example of the correct formatting is as follows:

AAACCCAAGAAACACT
AAACCCAAGAAACCAT
AAACCCAAGAAACCCA
AAACCCAAGAAACCCG
AAACCCAAGAAACCTG

A list of barcodes and annotations comming from a paired single cell experiment can be provided too. It should be a .txt file, where the first column corresponds to barcodes, and the second one to annotations.

See the following lines for an example of the correct formatting:

CCGGTGATCCGTCACT B
CAGAGCCGTCTACACA Monocytes
GAGCTGCAGACGATAT Monocytes
TTTAGTCCACTCAAGT NK
CCGCAAGTCTGAGGTT Monocytes

Running WAT3R

The pipeline is divided in two commands, wat3r and downstream. The first one performs the biggest part of the analysis, while the second polishes the results to generate the final version and overlaps them with optional paired single cell RNA data.

The only requirements to run wat3r, as described above, are the two compressed fastq files. A minimal example of the wat3r instruction, using the test data included in the docker container, is as follows:

wat3r -b /usr/local/testdata/BCseq_test.fastq.gz \
      -t /usr/local/testdata/TCRseq_test.fastq.gz \
      -d /workdir \

By default, wat3r will consider barcode and UMI lengths to be 16 and 12 nucleotides, perform both barcode and UMI correction and use the list of 10x 3' v3 barcodes as the set of allowed barcodes. These and other parameters can be changed by the user. For a detailed description on the options included in wat3r, see the wiki page on options.

wat3r will create a set of directories where intermediate files as well as QC plots and tables are stored.

  • fastq_processed contains intermediate text files created to reformat the original fastq files, including lists of corrected barcodes and UMI.

  • wat3r contains files created during the different steps of the pipeline, such as reformated and filtered fastq files or log files.

    • wat3r/QC contains two QC plots (see the output plots wiki page) and the corresponding files to generate them.

The output created by wat3r will be used by the second command, downstream to generate the final results. To run downstream, the user must indicate the path and name of 4 files that are part of the wat3r output and, by default, they are always created in the same directories and given the same name.

  1. ./wat3r/sample_igblast_db-pass.tsv is the table of V(D)J alignments, that will be further filtered and finetuned by downstream.

  2. ./wat3r/QC/BC_UMI_cluster_metrics.txt is a table of metrics for each of the BC-UMI clusters, that will be used by downstream to filter results based on the most abundant cluster proportion and log ratio.

  3. ./wat3r/wat3rMetrics.txt is a small table with the number of reads passing the different wat3r filtering steps. downstream will perform two extra filters and plot the final percentage of reads passing filters.

  4. ./wat3r/stats.log contains error rates for consensus building. This information is added to the final results and plotted in downstream.

In addition, as explained in the previous section, the user can add a list of annotations for downstream to overlap it with the results.

Taking this into account and following with the minimal example, the user can run downstream as shown below, using the test data annotations included in the docker container:

downstream -u /workdir/wat3r/sample_igblast_db-pass.tsv \
		   -c /workdir/wat3r/QC/BC_UMI_cluster_metrics.txt \
		   -f /workdir/wat3r/wat3rMetrics.txt \
		   -s /workdir/wat3r/stats.log \
		   -a /usr/local/testdata/PB01_clusters.txt

Again, downstream accepts extra options that are described here.

downstream will create the directory also called downstream, where extra intermediary files, results and plots are stored. You can find a detailed descritpion of the two final result tables in the output tables wiki page and a description of the graphs in the output plots wiki page.

Clone this wiki locally