-
Notifications
You must be signed in to change notification settings - Fork 1
Barcoding tools
The Barcoding tools are a little different from the other tools in that the do not access the BAM file directly, but rather they use VCFs which are created from BAM files from other tools. Hence the process is as follows:
- For each sample, we run
samtools mpileup
(from samtools v1.3) to create a VCF file containing the allele read depths at each SNP in a master list (the SNPs we want in the barcode) - We then run the Barcoding tools to use the read counts in the VCFs to produce genotypes and then concatenate them in the desired order to form barcodes.
We need a list of SNPs that are concatenated to form the barcode. Indeed the list has to be in two different forms:
-
<SNPLIST_NAME>.tab
- a tab-separated table of barcoding positions. This has one header line, then one line per barcoding position, in the desired barcode order. The table must have at least two columns, named "Chr" and "Pos", containing the chromosome ID and the position of the barcode variation within the chromosome, respectively. Any additional columns are ignored. The following is an example:
Num Chr Pos Ref Nonref 1 Pf3D7_02_v3 376222 A G 2 Pf3D7_02_v3 470013 G A 3 Pf3D7_03_v3 656861 T G [...]
-
<SNPLIST_NAME>.bed
- a BED file of barcoding positions. This is needed bysamtools
and is encoded in the BED format: a space-separate table (no header line), one line per barcode variation, ordered by genomic position. Each line contains three fields: the chromosome ID, and the start and end positions of the barcode variation. For SNPs, the end position is the same as the position in the<SNPLIST_NAME>.tab
file (see above), while the start position is one position earlier. [NOTE: as soon as I get round to it, I'll make a tool to generate the BED file from the SNP list automatically]. The following is an example of BED file, corresponding to the SNP list file shown above:
Pf3D7_02_v3 376221 376222 Pf3D7_02_v3 470012 470013 Pf3D7_03_v3 656860 656861 [...]
We produce one VCF file per sample using samtools, but then the file is reheadered (to fix the sample name) using bcftools. The commands are as follows:
samtools mpileup -I -R -C50 -d 1000 -f <REF_FASTA_FILE> -l <BED_FILE> -v -t AD,DP \ -o <RAW_VCFFILE> <BAM_FILE> bcftools reheader -s <SAMPLE_NAMEFILE> -o <FINAL_VCFFILE> <RAW_VCFFILE>
with the following parameter:
- REF_FASTA_FILE The path to a FASTA file containing the reference sequence used for alignment.
- BED_FILE The path to the BED file (see above).
- BAM_FILE The path to the BAM file containing the sample alignment.
- RAW_VCFFILE The path to the VCF file that will be output by mpileup.
-
FINAL_VCFFILE The path to the final, reheadered VCF file, e.g.
<SAMPLE_NAME>.vcf.gz
- SAMPLE_NAMEFILE The path to a text file containing a single line with the sample name, e.g. "PA0007-C", used for reheadering
The final VCF has to be placed in an output folder <OUT_DIR>/<SUB>
where <SUB>
is a string consisting of the first 4 letters of the sample name. For example, file PA0001-C.alleles.tab
will be written to folder <OUT_DIR>/PA00
. This simple file hashing scheme, which is particularly suited for the MalariaGEN naming scheme, prevents thousands of files accumulating in a single folder, which may cause indexing problems especially on Linux.
The following script can be used as a task template for creating the VCF file.
BAMLIST_FILE=$1 BED_FILE=$2 SNPLIST_FILE=$3 REF_FASTA_FILE=$4 OUT_DIR=$5 IN=`sed "$LSB_JOBINDEX q;d" $BAMLIST_FILE` read -a LINE <<< "$IN" SAMPLE=${LINE[0]} BAM_FILE=${LINE[1]} SUBDIR=$OUT_DIR/${SAMPLE:0:4} mkdir -p $SUBDIR RAW_VCFFILE=$SUBDIR/$SAMPLE.raw.vcf.gz FINAL_VCFFILE=$SUBDIR/$SAMPLE.vcf.gz SAMPLE_NAMEFILE=$SUBDIR/$SAMPLE.sampleList.txt echo $SAMPLE > $SAMPLE_NAMEFILE /nfs/users/nfs_o/om1/samtools-1.3/samtools mpileup -I -R -C50 -d 1000 -f $REF_FASTA_FILE -l $BED_FILE -v -t AD,DP -o $RAW_VCFFILE $BAM_FILE /nfs/users/nfs_o/om1/bcftools-1.3/bcftools reheader -s $SAMPLE_NAMEFILE -o $FINAL_VCFFILE $RAW_VCFFILE rm $SAMPLE_NAMEFILE rm $RAW_VCFFILE
Invoking the barcoding tool for a single sample:
java -Xms512m -Xmx2000m 'org.cggh.bam.barcode.BarcodeFromVcfAnalysis$SingleSample' \
<SAMPLE_NAME> <OUT_DIR> <SNPLIST_FILE>
Note that the qualified class name HAS to be between single quotes. The parameters are as follows:
- SAMPLE_NAME The identifier of the sample, e.g. “PA0001-C”
- OUT_DIR The path to the folder where the input VCFs are read from, and where output files will be written to.
- SNPLIST_FILE The path to the SNP file (see above)
Invoking the barcoding tool for multiple samples:
java -Xms512m -Xmx2000m 'org.cggh.bam.barcode.BarcodeFromVcfAnalysis$MultiSample' \
<SAMPLE_LIST_FILE> <OUT_DIR> <SNPLIST_FILE>
Note that the qualified class name HAS to be between single quotes. The parameters are as follows:
-
SAMPLE_LIST_FILE The path of tab-separated text file, with one line (record) per sample to be processed. For compatibility with other tools, this uses the common format, where each record contains three fields:
- SAMPLE_NAME The identifier of the sample, e.g. “PA0001-C”
- BAM_FILE Ignored
- CHR_MAP_NAME Ignored
- OUT_DIR The path to the folder where the output files will be written to.
- SNPLIST_FILE The path to the SNP file (see above)
The genotyping tools output tab-separated text files for each sample, placed in the same folder where the VCF file was found, i.e. <OUT_DIR>/<SUB>
where <SUB>
is a string consisting of the first 4 letters of the sample name. For example, file PA0001-C.genos.tab
will be written to folder <OUT_DIR>/PA00
.
-
<SAMPLE_NAME>.genos.tab
contains one line (record) for each barcoding SNP, with information on the genotype at that position in the present sample. Each record has the following fields:- Num A sequential record number (can be ignored)
- Sample The identifier of the sample, e.g. “PA0001-C”
- Chr The ID of the chromosome where the barcoding SNP is located
- Pos The position within the chromosome where the barcoding SNP is located
- Genotype The genotype at this position: a nucleotide, or X for missing, or N for het
- TopAllele If the genotype is not missing, this is the most common allele found at this position in this sample
- Call The genotype call: 0=missing, 1=reference allele, 2=non-reference allele, 3=het
- Nraf The frequency of the non-reference allele at this position in this sample
- Ref The number of reads supporting the reference allele
- Nref The number of reads supporting the non-reference allele
- Counts A human-readable string showing all read counts for all alleles
The Barcoding Aggregation Tool uses the files produced by the Barcoding Tool and aggregates them, producing sampleset-wide outputs. Invoking the result aggregation tool:
java -Xms512m -Xmx2000m 'org.cggh.bam.barcode.BarcodeFromVcfAnalysis$MergeResults' \
<SAMPLE_LIST_FILE> <OUT_DIR> <SNPLIST_FILE>
Note that the qualified class name HAS to be between single quotes. The parameters are as follows:
-
SAMPLE_LIST_FILE The path of tab-separated text file, with one line (record) per sample to be processed. For compatibility with other tools, this uses the common format, where each record contains three fields:
- SAMPLE_NAME The identifier of the sample, e.g. “PA0001-C”
- BAM_FILE Ignored
- CHR_MAP_NAME Ignored
- OUT_DIR The path to the folder where the input VCFs are read from, and where output files will be written to.
- SNPLIST_FILE The path to the SNP file (see above)
The genotype aggregation tool outputs several tab-separated text files in output folder <OUT_DIR>
:
-
AllSamples.barcodes.tab
This is the main output file, which reports the barcodes and their stats for each sample. Each line (record) shows the barcode data for a sample, with the following fields:- Num A sequential record number (can be ignored)
- Sample The identifier of the sample, e.g. “PA0001-C”
- MissingCount The number of barcoding SNPs with missing genotypes
- MissingProp The proportion of barcoding SNPs with missing genotypes
- HetCount The number of barcoding SNPs with heterozygous genotypes
- HetProp The proportion of non-missing barcoding SNPs with heterozygous genotypes
- MajorityBarcode A barcode constructed with majority alleles (no het calls); dash caharacters ('-') represent missing calls
- MajorityMeanFreq The mean across the barcode of the majority allele frequency at heterozygous positions
- Barcode The "official" barcode: X represent a missing call, N represents a heterozygous call
-
AllSamples.alleleCounts.tab
A table showing the read counts for each alleles found at each barcoding position. Each line (record) shows the data for a sample, with the following fields:- Num A sequential record number (can be ignored)
- Sample The identifier of the sample, e.g. “PA0001-C”
- One column per barcoding SNP, in the order they occur in the barcode, containing a human-readable string showing all read counts for all alleles found in the sample at that position
-
AllSamples.nrafs.tab
A table showing the proportion of non-reference reads found at each barcoding position. Each line (record) shows the data for a sample, with the following fields:- Num A sequential record number (can be ignored)
- Sample The identifier of the sample, e.g. “PA0001-C”
- One column per barcoding SNP, in the order they occur in the barcode, containing the non-reference allele frequency. A dash ('-') indicates a missing call.