-
Notifications
You must be signed in to change notification settings - Fork 1
Heteroallelic Genotyping tools
Invoking the genotyping tool for a single sample:
java -Xms512m -Xmx2000m 'org.cggh.bam.heteroallelic.HeteroallelicAnalysis$SingleSample' \
<CONFIG_FILE> <SAMPLE_NAME> <BAM_FILE> <CHR_MAP_NAME> \
<REF_FASTA_FILE> <CHR_MAP_FILE> <OUT_DIR>
Note that the qualified class name HAS to be between single quotes. The parameters are as follows:
- CONFIG_FILE This is a configuration file, defining loci and targets, in Java .properties format. The format is specified in section “Heteroallelic configuration file”
- SAMPLE_NAME The identifier of the sample, e.g. “PA0001-C”
- BAM_FILE The path of the BAM file for the sample
- CHR_MAP_NAME The name of the chromosome name map to be used for this sample (e.g. “default”). See section “Chromosome Maps”
- REF_FASTA_FILE The path to a FASTA file containing the reference sequence used for alignment.
- CHR_MAP_FILE The path to a file specifying the chromosome name maps available. See section “Chromosome Maps”.
- OUT_DIR The path to the folder where the output files will be written to.
Invoking the genotyping tool for multiple samples:
java -Xms512m -Xmx2000m 'org.cggh.bam.heteroallelic.HeteroallelicAnalysis$MultiSample' \
<CONFIG_FILE> <SAMPLE_LIST_FILE> \
<REF_FASTA_FILE> <CHR_MAP_FILE> <OUT_DIR>
Note that the qualified class name HAS to be between single quotes. The parameters are as follows:
-
CONFIG_FILE This is a configuration file, defining loci and targets, in Java .properties format. The format is specified in section “Heteroallelic configuration file”
-
SAMPLE_LIST_FILE The path of tab-separated text file, with one line (record) per sample to be processed. Each record contains three fields:
- SAMPLE_NAME The identifier of the sample, e.g. “PA0001-C”
- BAM_FILE The path of the BAM file for the sample
- CHR_MAP_NAME The name of the chromosome name map to be used for this sample (e.g. “default”). See section “Chromosome Maps”
-
REF_FASTA_FILE The path to a FASTA file containing the reference sequence used for alignment.
-
CHR_MAP_FILE The path to a file specifying the chromosome name maps available. See section “Chromosome Maps”.
-
OUT_DIR The path to the folder where the output files will be written to.
The genotyping tools output two tab-separated text files for each sample. These files are placed in an output folder <OUT_DIR>/<SUB>
where <SUB>
is a string consisting of the first 4 letters of the sample name. For example, file PA0001-C.alleles.tab
will be written to folder <OUT_DIR>/PA00
. This simple file hashing scheme, which is particularly suited for the MalariaGEN naming scheme, prevents thousands of files accumulating in a single folder, which may cause indexing problems especially on Linux.
-
<SAMPLE_NAME>.<LOCUS>.mutations.tab
contains one line (record) for each codon within the locus in which a nonsynonymous mutation is found in this sample. If there are multiple mutations, the file will contain multiple records for the locus, one per mutation. Each record has the following fields:- Num A sequential record number (can be ignored)
- Sample The identifier of the sample, e.g. “PA0001-C”
- Locus The name of the locus, as specified in the configuration file
- Codon The number of the codon where the mutation is found
- Call The call for this mutation ("MU" is a homozygous mutant, "HE" a heterozygous one)
- Mutation The mutation observed (e.g. C580Y represents a C -> Y mutation at codon 580)
- TotalReadCount The read coverage of this codon
- MutantReadCount The number of codon reads supporting the mutation call
- MutantReadProp The proportion of codon reads supporting the mutation call
-
<SAMPLE_NAME>.<LOCUS>.calls.tab
contains one line (record) containing the composite call resulting from aggregating all mutation calls within the locus found in this sample. The record has the following fields:- Num A sequential record number (can be ignored)
- Sample The identifier of the sample, e.g. “PA0001-C”
- Locus The name of the locus, as specified in the configuration file
- Call The call for this sample ("WT" is wild-type, "MU" is a homozygous mutant, "HE" a heterozygous mutant, and "MI" is a missing call)
- Mutation The mutation(s) observed, as a comma-separated list. Het mutations are marked by an asterisk
- MissingCodonCallsProp The proportion of codons within the locus where a call could not be made (coverage too low)
- MedianReadCount The median read count at codons across the locus
- MeanReadCount The mean read count at codons across the locus
-
<SAMPLE_NAME>.<LOCUS>.multipleMutants.tab
contains one line (record) for each pair of mutated codons found within the same reads. The aim is to capture evidence of multiple nonsynonymous mutations in the same sample. If no such pairs are found, the file is not generated. The record has the following fields:- Num A sequential record number (can be ignored)
- Sample The identifier of the sample, e.g. “PA0001-C”
- Locus The name of the locus, as specified in the configuration file
- Alleles The mutation(s) observed, as a comma-separated list.
- ReadCount The number of reads in which the pair is observed.
-
<SAMPLE_NAME>.<LOCUS>.messages.tab
contains one line (record) for each warning message generated during the genotyping of the sample. Two warnings are currently output: NONREF_ALLELE_FROM_SINGLE_STRAND_IN_HET when a non-reference allele is disregarded because most reads supporting it are on a single strand; and MULTIALLELIC_CODON_WITHOUT_REF when no call is issued because multiple alleles are observed, without the reference. The record has the following fields:- Num A sequential record number (can be ignored)
- Message The warning message code
- Allele The allele for which the warning is issued.
- ReadCount The number of reads supporting the allele.
The Genotype Aggregation Tool uses the files produced by the Genotyping Tool and aggregates them, producing sampleset-wide outputs.
Invoking the result aggregation tool:
java -Xms512m -Xmx2000m 'org.cggh.bam.heteroallelic.HeteroallelicAnalysis$MergeResults' \
<CONFIG_FILE> <SAMPLE_LIST_FILE> \
<REF_FASTA_FILE> <CHR_MAP_FILE> <OUT_DIR>
Note that the qualified class name HAS to be between single quotes. The parameters are as follows:
- CONFIG_FILE This is a configuration file, defining loci and targets, in Java .properties format. The format is specified in section “Heteroallelic configuration file”
-
SAMPLE_LIST_FILE The path of tab-separated text file, with one line (record) per sample to be processed (note, only the first column is used here) . Each record contains three fields:
- SAMPLE_NAME The identifier of the sample, e.g. “PA0001-C”
- BAM_FILE The path of the BAM file for the sample
- CHR_MAP_NAME The name of the chromosome name map to be used for this sample (e.g. “default”). See section “Chromosome Maps”
- REF_FASTA_FILE The path to a FASTA file containing the reference sequence used for alignment.
- CHR_MAP_FILE The path to a file specifying the chromosome name maps available. See section “Chromosome Maps”.
- OUT_DIR The path to the folder where the output files will be written to.
The genotype aggregation tool outputs several tab-separated text files in output folder <OUT_DIR>
:
-
AllSamples.<LOCUS>.calls.tab
This is the main output file, containing the genotype calls for each sample. It is an aggregation of the<SAMPLE_NAME>.<LOCUS>.calls.tab
files (see above), with one line (record) per sample. -
AllSamples.<LOCUS>.mutations.tab
This is is an aggregation of the<SAMPLE_NAME>.<LOCUS>.mutations.tab
files (see above), with one line (record) per mutation in a sample. -
AllSamples.<LOCUS>.multipleMutants.tab
This is an aggregation of the<SAMPLE_NAME>.<LOCUS>.multipleMutants.tab
files (see above), with one line (record) per mutation pair in a sample. -
AllSamples.<LOCUS>.messages.tab
This is an aggregation of the<SAMPLE_NAME>.<LOCUS>.messages.tab
files (see above), with one line (record) per warning.
The configuration file for the Heteroallelic genotyping tools is a file in Java .properties format (a text-based file of name/value pairs). The following properties are specified:
-
heteroallelic.loci
A comma-separated list of locus names that will be used in the analysis -
heteroallelic.locus.<LOCUS_NAME>.region
The span of the locus, over which mapped reads will be analyzed, in the form<chr>:<startPos>-<endPos>
(must be specified for each locus). IMPORTANT: the locus must contain exactly a whole number of consecutive codons -
heteroallelic.locus.<LOCUS_NAME>.reverse
A boolean, if TRUE indicates that the gene codes in the reverse strand (must be specified for each locus) -
heteroallelic.locus.<LOCUS_NAME>.startCodon
The number of the first codon in the locus
The following is an example of a configuration file that will genotype the BTB/POZ and propeller domains (codon 350 upwards) of the pfkelch13 gene.
heteroallelic.loci=K13_ArtR_Domains
heteroallelic.locus.K13_ArtR_Domains.region=Pf3D7_13_v3:1724820-1725950
heteroallelic.locus.K13_ArtR_Domains.startCodon=350
heteroallelic.locus.K13_ArtR_Domains.reverse=true