Skip to content

Codon based Genotyping tools

Olivo Miotto edited this page Oct 1, 2024 · 7 revisions

These tools produce amino acid (or amino acid sequences) calls at specified targets, by inspecting read from BAM file alignments. The genotyping uses the alignment approach and is able to process unmapped reads. The configuration file specifies one or more loci, each containing one or more target to be genotyped. A locus is essentially a region from which read alignments are extracted for analysis. Targets must consist of exactly one or more codons. Each sample is analyzed individually, and then the results can be aggregated.

Single-sample Codon-based Genotyping Tool

This is a tool for genotyping a single sample. Invoke as follows:

 java -Xms512m -Xmx2000m 'org.cggh.bam.codon.CodonAnalysis$SingleSample' \
     <CONFIG_FILE> <BATCH_ID> <SAMPLE_ID> <BAM_FILE> <REF_FASTA_FILE> <OUT_DIR>

Note that the qualified class name HAS to be between single quotes. The parameters are as follows:

  • CONFIG_FILE This is a configuration file, defining loci and targets, in Java .properties format. The format is specified in section “Codon-based Genotyping Configuration File
  • BATCH_ID The identifier of the batch the sample belongs to; it is just a string (e.g. “34567” or "MY_BATCH") used to organize the output files
  • SAMPLE_ID The identifier of the sample, e.g. “PA0001-C”
  • BAM_FILE The path of the BAM file containing the sample's read alignment
  • REF_FASTA_FILE The path to a FASTA file containing the reference sequences used for alignment.
  • OUT_DIR The path to the root folder where the output files will be written to.

Multi-samples Codon-based Genotyping Tool

This is a tool for genotyping multiple samples. Each sample is processed separately in a multithreaded job, and the results are then aggregated, producing the same outputs as the Genotype Aggregation Tool (see below). Invoke as follows:

java -Xms512m -Xmx2000m 'org.cggh.bam.codon.Codon$MultiSample' \
     <CONFIG_FILE> <SAMPLE_LIST_FILE> <REF_FASTA_FILE> <OUT_DIR>

Note that the qualified class name HAS to be between single quotes. The parameters are as follows:

  • CONFIG_FILE This is a configuration file, defining loci and targets, in Java .properties format. The format is specified in section “Codon-based Genotyping Configuration File

  • SAMPLE_LIST_FILE The path of tab-separated text file, with one line (record) per sample to be processed. Each record contains three fields:

    • BATCH_ID The identifier of the batch the sample belongs to; it is just a string (e.g. “34567” or "MY_BATCH") used to organize the output files
    • SAMPLE_ID The identifier of the sample, e.g. “PA0001-C”
    • BAM_FILE The path of the BAM file containing the sample's read alignment
  • REF_FASTA_FILE The path to a FASTA file containing the reference sequences used for alignment.

  • OUT_DIR The path to the root folder where the output files will be written to.

Codon-based Genotyping Tools Outputs

The genotyping tools output three tab-separated text files for each sample. These files are placed in an output folder <OUT_DIR>/<BATCH_ID>.

  • <SAMPLE_NAME>.calls.tab contains one line (record) for each target, showing the final call for that target. Each record has the following fields:

    • Num A sequential record number (can be ignored)
    • Batch The identifier of the sample's batch
    • Sample The identifier of the sample, e.g. “PA0001-C”
    • Locus The name of the target’s locus, as specified in the configuration file
    • Target The name of the target, as specified in the configuration file
    • Call A code to show the type of call: WT=wild-type (homozygous call for reference amino acid); MU=Mutation (homozygous call for non-reference amino acid); HE=heterozygous (multiple amino acids produced); MI=missing (no call could be made due to insufficient coverage)
    • Amino The amino acid call
    • AminoNref The amino acid call in "nonref" format (i.e. using "." instead of the reference amino acid)
    • Nt The codon (nucleotide) call
    • NtNref The codon (nucleotide) call in "nonref" format (i.e. using "." instead of the reference codon)
    • Counts Number of supporting reads for each codon found in the alignment
  • <SAMPLE_NAME>.alleles.tab contains one line (record) for each allele genotyped at each target. If at a given target multiple alleles are observed, the file will contain multiple records for the target, one per allele. Each record has the following fields:

    • Num A sequential record number (can be ignored)
    • Batch The identifier of the sample's batch
    • Sample The identifier of the sample, e.g. “PA0001-C”
    • Locus The name of the target’s locus, as specified in the configuration file
    • Target The name of the target, as specified in the configuration file
    • Allele Allele (nucleotide sequence) found
    • Amino Amino acid translation of allele
    • Count Number of supporting reads for this allele
  • <SAMPLE_NAME>.locusCoverage.tab contains one line (record) for each target, detailing the coverage (number of reads used in the analysis) for the target. Each record has the following fields:

    • Num A sequential record number (can be ignored)
    • Batch The identifier of the sample's batch
    • Sample The identifier of the sample, e.g. “PA0001-C”
    • Locus The name of the target’s locus, as specified in the configuration file
    • Target The name of the target, as specified in the configuration file
    • Aligned The number of reads that were aligned by anchor matching
    • Misaligned The number of reads where an anchor was matched, but were discarded from further analysisbecause of a high number of mismatches (configurable with property grc.alignment.maxReadMismatches) between the read sequence and the consensus sequence of the aligned reads.
    • Covering The number of aligned reads that fully covered the target
    • Calls The number of covering reads for which a codon was called
    • LowQuality The number of covering reads that failed to produce a codon, usually because of poor base quality

Codon-based Genotype Aggregation Tool

This is a tool for genotyping multiple samples. Each sample is processed separately in a multithreaded job, and the results are then aggregated, producing the same outputs as the Genotype Aggregation Tool (see below). Invoke as follows:

The Genotype Aggregation Tool uses the output results produced by the Single Sample Genotyping Tool and aggregates them, producing results for the whole sample set. The aggregation tool also performs some statistics across the whole sample set. Invoke as follows:

java -Xms512m -Xmx2000m 'org.cggh.bam.codon.CodonAnalysis$MergeResults' \
     <CONFIG_FILE> <SAMPLE_LIST_FILE> <REF_FASTA_FILE> <OUT_DIR>

Note that the qualified class name HAS to be between single quotes. The parameters are as follows:

  • CONFIG_FILE This is a configuration file, defining loci and targets, in Java .properties format. The format is specified in section “Codon-based Genotyping Configuration File

  • SAMPLE_LIST_FILE The path of tab-separated text file, with one line (record) per sample to be processed. Each record contains three fields:

    • BATCH_ID The identifier of the batch the sample belongs to; it is just a string (e.g. “34567” or "MY_BATCH") used to organize the output files
    • SAMPLE_ID The identifier of the sample, e.g. “PA0001-C”
    • BAM_FILE The path of the BAM file containing the sample's read alignment
  • REF_FASTA_FILE The path to a FASTA file containing the reference sequences used for alignment.

  • OUT_DIR The path to the root folder where the output files will be written to.

Codon-based Genotype Aggregation Tool Outputs

The genotype aggregation tool outputs several tab-separated text files in output folder <OUT_DIR>:

  • AllCallsBySample.tab This is the main output file, which summarizes the genotype calls for each sample. Each line (record) shows the genotypes for a sample, with the following fields:

    • Num A sequential record number (can be ignored)
    • Batch The identifier of the sample's batch
    • Sample The identifier of the sample, e.g. “PA0001-C”
    • One column per target, containing the amino acid called. Each column has a header showing the target name, following by the 3D7 allele in square bracket (e.g. "crt_326[N]"). In heterozygous calls, a comma-separated list of all the alleles calledis given.
  • AllCallsNrefBySample.tab Contains the same data as AllCallsBySample.tab, except that the reference amino acid is replaced by a dot (‘.’). As a result, this file is somewhat more informative at a glance if one wants to inspect the presence of non-reference alleles

  • AlleleSampleCount.<LOCUS>_<TARGET>.tab There is one of these files produced per target. It shows the counts of reads supporting each allele observed at this target, in each sample, with the following fields:

    • Num A sequential record number (can be ignored)
    • Batch The identifier of the sample's batch
    • Sample The identifier of the sample, e.g. “PA0001-C”
    • One column per allele observed at this target in the sample set, containing the number of reads supporting the allele’s presence in the sample. The column header is the nucleotide sequence of the allele, followed by the amino acid sequence in square brackets.
  • AlleleStats.<LOCUS>_<TARGET>.tab There is one of these files produced per target. It shows some statistics for each of the observed alleles at this target in the sampleset. Each record represents an allele, with the following fields:

    • Num A sequential record number (can be ignored)
    • Allele The allele, represented as the nucleotide sequence of the allele, followed by the amino acid sequence in square brackets.
    • SampleCount The count of samples in which the allele was observed
    • MaxReads The highest read count seen for this allele in any one sample
    • MaxReadFraction The highest fraction of total reads seen for this allele in any one sample
  • CallsBySample.<LOCUS>_<TARGET>.tab There is one of these files produced per target. It shows details of how each sample was called- essentially, a more detailed form of AllCallsBySample.tab. Each record represents a sample, with the following fields:

    • Num A sequential record number (can be ignored)
    • Batch The identifier of the sample's batch
    • Sample The identifier of the sample, e.g. “PA0001-C”
    • Call The sample call: WT (reference allele), MU (mutant allele), HE (heterozygous), MI (missing call)
    • Alleles A comma-separated list of the amino acid alleles
    • NtAlleles A comma-separated list of the codon/nucleotide alleles
    • AlleleReads A comma-separated list of the codon alleles, showing the amino acid and the the count of reads supporting the allele
  • LocusCoverage.<LOCUS>.tab There is one of these files per locus. It shows counts of the reads that were used in the analysis of that locus. Each record represents a sample, with the following fields:

    • Num A sequential record number (can be ignored)
    • Batch The identifier of the sample's batch
    • Sample The identifier of the sample, e.g. “PA0001-C”
    • Locus The name of the target’s locus, as specified in the configuration file
    • Target The name of the target, as specified in the configuration file
    • Aligned The number of reads that were aligned by anchor matching
    • Misaligned The number of reads where an anchor was matched, but a high number of point differences (currently >10) was found between the read sequence and the consensus sequence of the aligned reads.
    • Covering The number of aligned reads that fully covered the target
    • Calls The number of covering reads for which a codon was called
    • LowQuality The number of covering reads that failed to produce a codon, usually because of poor base quality

Codon-based Genotyping Configuration File

The configuration file for the GRC genotyping tools is a file in Java .properties format (a text-based file of name/value pairs). The following properties are specified:

  • codon.genotype.minCallReadCount (integer, default=5) The minimum number of total covering reads required to make a call for a target

  • codon.genotype.minAlleleReadCount (integer, default=2) The minimum number of covering reads required to make a call for an allele at a target (e.g. in a het call)

  • codon.genotype.minAlleleReadProp (double, default=0.10) The minimum proportion of the total covering reads required to make a call for an allele at a target (e.g. in a het call)

  • codon.genotype.minBaseQScore (integer, default=10) The minimum base quality score that must be present at each target position for a given read to be used to call the target's allele

  • codon.alignment.maxIndelSize (integer) The maximum size of insertions or deletions that will be tolerated in alignment reads; reads with larger indels will be disregarded

  • codon.alignment.maxReadMismatches (integer) The maximum number of mismatches between an alignment read and the alignment consensus; reads with a greater number of mismatches will be disregarded

  • codon.loci A comma-separated list of locus names that will be used in the analysis

  • codon.locus.<LOCUS_NAME>.region The span of the locus, over which mapped reads will be analyzed, in the form <chr>:<startPos>-<endPos> (must be specified for each locus)

  • codon.locus.<LOCUS_NAME>.targets A comma-separated list of targets to be genotyped. Each target is specified in the format <targetName>@<targetStartPos>-<targetEndPos>. The target must contain exactly an integer number of codons (a multiple of 3) which will be translated to amino acids

  • codon.locus.<LOCUS_NAME>.anchors A comma-separated list of the anchors (regex) that will be used to match the reads. Each anchor is specified in the format <anchorStartPos>@<regex> where <regex> is a regular expression that will be matched against the reads

  • codon.locus.<LOCUS_NAME>.analyzeUnmappedReads If set to 'true' the unmapped reads in the BAM will be searched for anchor sequences and remapped. Default (when property is missing) is 'false'. This has a performance impact, so it should only be set to 'true' at loci thought to be problematic for alignments.

The following is an example of a configuration file that will genotype the core (72-76) haplotype of pfcrt, and the arps10-127 mutation. Note that the arps10 anchors will match alternative alleles at some positions (“[CA]” will match either C or A), and skip testing some position (“.” is a wildcard)

codon.genotype.minCallReadCount=10 
codon.genotype.minAlleleReadCount=5
codon.genotype.minAlleleReadProp=0.1

codon.alignment.maxIndelSize=6
codon.alignment.maxReadMismatches=10

codon.loci=crt_core,arps10
# CRT
codon.locus.crt_core.region=Pf3D7_07_v3:403500-403800
codon.locus.crt_core.targets=crt_72-76@403612-403626
codon.locus.crt_core.anchors=403593@TATTATTTATTTAAGTGTA,403627@ATTTTTGCTAAAAGAAC
# ARPS10
codon.locus.arps10.region=Pf3D7_14_v3:2480900-2481200
codon.locus.arps10.targets=arps10_127@2481070-2481072
codon.locus.arps10.anchors=2481045@ATTTAC[CA]TTTTTGCGATCTCCCCAT...[GC],2481079@GACAGT[AC]G[AG]GA[GA]CAATTCGAAATAAAAC 
codon.locus.arps10.analyzeUnmappedReads=false