-
Notifications
You must be signed in to change notification settings - Fork 2
Vignette
WAT3R is a pipeline for the analysis of single cell data enriched for the variable regions of T Cell Receptor (TCR) sequences. It was originally created to analyze data from the TREK-seq protocol, but if offers the flexibility needed to apply it to other single cell TCR enrichment methods.
Installation is done via Docker. Thus, WAT3R runs in a closed docker container, which ensures reproducibility and includes every requirement for the pipeline to function and some test data.
First, build a local docker image for WAT3R. This can be done in two ways.
A prebuilt image is availeable at docker hub. To pull it to your local computer, use:
docker pull mainciburu/wat3r
All the files needed to build the WAT3R image are available on GitHub. To do so, first clone the GitHub repository to your local machine. Then, redirect your working directory to the top-level folder, where the Dockerfile is located. Finally, build the image by:
docker build -t mainciburu/wat3r .
Once you have a local image of WAT3R, start a new interactive container to run your analysis.
docker run -it \
--name WAT3R \
--mount 'type=bind,source=<source>,target=/workdir' \
--workdir '/workdir' \
mainciburu/wat3r \
bash
The --mount
argument will mount a local folder of your choice (substitute <source>
by the path) into the container.
The required input files for WAT3R are two compressed fastq files. One must contain the TCR sequences, and the other the paired cell barcodes and UMI.
When demultiplexing your raw data, make sure you get two separate fastq files. As an example, data from TREK-seq is sequenced on an Illumina MiSeq set to 28 bp for Read 1 (cell barcode + UMI) and 150 bp for Index 1 (TCR). To demultiplex runs, we use
bcl2fastq --output-dir /path/to/mydir/ \
--use-bases-mask Y28,I150 \
--barcode-mismatches 0,0 \
--create-fastq-for-index-reads
We also use a SampleSheet with 150xN as the index sequence, so that useful reads end up in the Undetermined.fastq.gz files.
Here we show an example of the kind of files that we create and use as input for WAT3R
Undetermined_S0_L001_R1_001.fastq.gz contains cell barcodes and UMIs
@M00336:864:000000000-JMPYG:1:1101:9057:1000 1:N:0:NTGCTCTTACAGTTACTGTGGTTCCGGCTCCAAAGCTGAGCTTGTAGTCCCCATCTCTCACAGCACAGAGGTAAGAGGCAGAGTCTTTCATCTGGAGCTCCTTCAAAAGGAGGTAACTGTACCNNNNNGACCGACTAAGGAATGAANNNN
NTACTATGTACGTTCAAACATGTTAAAC
+
#8BCCFGGGGGGGGGGGGGGGGGGGGGG
Undetermined_S0_L001_I1_001.fastq.gz contains TCR sequences
@M00336:864:000000000-JMPYG:1:1101:9057:1000 1:N:0:NTGCTCTTACAGTTACTGTGGTTCCGGCTCCAAAGCTGAGCTTGTAGTCCCCATCTCTCACAGCACAGAGGTAAGAGGCAGAGTCTTTCATCTGGAGCTCCTTCAAAAGGAGGTAACTGTACCNNNNNGACCGACTAAGGAATGAANNNN
NTGCTCTTACAGTTACTGTGGTTCCGGCTCCAAAGCTGAGCTTGTAGTCCCCATCTCTCACAGCACAGAGGTAAGAGGCAGAGTCTTTCATCTGGAGCTCCTTCAAAAGGAGGTAACTGTACCNNNNNGACCGACTAAGGAATGAANNNN
+
#88ABFFGGFGFGGEFGGGFFGGGGFGGGGGGGGGGGFGFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGFFGGGFFFGGGGGGGGGGGGGG?FGGGGGGGGCFGGEGGGGGGGGGGGG#####99@DFGGGGF88EFFFGF####
In addition to the fastq files, there are extra files that the user can optionally provide for specific steps of the pipeline.
A list of barcodes different to the one included in the docker container and used by default (10x 3' v3 list), can be added. The file must be formated as .txt.gz, each row must be a barcode and it must not have row nor column names.
An example of the correct formatting is as follows:
AAACCCAAGAAACACT
AAACCCAAGAAACCAT
AAACCCAAGAAACCCA
AAACCCAAGAAACCCG
AAACCCAAGAAACCTG
AAACCCAAGAAACGAA
AAACCCAAGAAACGTC
AAACCCAAGAAACTAC
AAACCCAAGAAACTCA
AAACCCAAGAAACTGC
A list of barcodes and annotations comming from a paired single cell experiment can be provided too. It should be a .txt file, where the first column corresponds to barcodes, and the second one to annotations.
See the following lines for an example of the correct formatting:
CCGGTGATCCGTCACT B
CAGAGCCGTCTACACA Monocytes
GAGCTGCAGACGATAT Monocytes
TTTAGTCCACTCAAGT NK
CCGCAAGTCTGAGGTT Monocytes
The pipeline is divided in two commands, wat3r
and downstream
. The first one performs the biggest part of the analysis, while the second polishes the results to generate the final version and overlaps them with optional paired single cell RNA data.
The only requirements to run wat3r
, as described above, are the two compressed fastq files.
wat3r -b /usr/local/testdata/BCseq_test.fastq.gz \
-t /usr/local/testdata/TCRseq_test.fastq.gz \
-d /workdir \
downstream -u /workdir/wat3r/sample_igblast_db-pass.tsv \
-c /workdir/wat3r/QC/BC_UMI_cluster_metrics.txt \
-f /workdir/wat3r/wat3rMetrics.txt \
-s /workdir/wat3r/stats.log \
-n PB01 \
-a /usr/local/testdata/PB01_clusters.txt