Skip to content

boostDM cancer 2023

Pre-release
Pre-release
Compare
Choose a tag to compare
@koszulordie koszulordie released this 22 Apr 14:11

boostDM cancer v1.1.0-alpha

This is a boostDM version for driver mutation identification in tumors that is compatible with the output produced by intOGen v2023.05.31. Several improvements and new features have been included, including changes in model implementation, reference datasets and bugfixes. Herein we describe the new features of this version.

Model implementation

Uniform cluster features across cohorts of same tumor type category

Although mutational cluster features (OncodriveCLUSTL, HotMAPS and smRegions) are generated on a cohort-by-cohort basis, the boostDM models are specified by tumor-type, meaning that an aggregation or consensus step has to be taken in order to annotate the mutations in the training and test sets.

In the previous release, mutations with the same genomic identity (same genomic coordinates and alternate allele) would be annotated with different mutational cluster feature values depending on in which cohort they were originally reported, thereby yielding instances of mutations with same identity but different cluster feature values across samples of the same tumor type.

In the current version we have changed this criterion. Now if a mutation is mapped by a clustering feature in some cohort belonging to a given tumor type, all the instances of the mutation across the tumor type will get the same cluster feature value in the training and prediction corresponding to this tumor type.

We have developed OpenVariant, a new python package to parse the input files. This new method annotates all SNVs and indels from the mutation data files and transforms the data in a standardized format, by reading a yaml file prepared by the user.

Importantly, the previous bgparsers tool had a bug in the annotation of indels, leading to some indels not being processed. This bug has been corrected in OpenVariant. As a consequence, there are, overall, more mutations annotated per sample.

One single feature per cluster type taking three ordinal values

Each cluster method (OncodriveCLUSTL, HotMAPS and smRegions) is now associated with a single ordinal feature with three levels:

2 - the mutation is associated to the cluster in the same tumor type
1 - the mutation is not associated to the cluster in the same tumor type, but is associated in another tumor type
0 - the mutation is not associated to the cluster in any case

Therefore we no longer keep the distinction between cat_1 and cat_2 dichtomous cluster features.

Moreover, we have decided not to use the OncodriveCLUSTL_SCORE numerical feature due to the little gain in predictive ability along with the added complexity at interpreting the feature contributions at prediction.

Unweighted consensus of base classifiers

The forecast aggregator combinining the 50 base classifiers per model is now unweighted, meaning that that the base classifiers are no longer favoured or penalized based on the cross-validation performance. Particularly in those cases where the cross-validation test set remaining after removing repeated mutation instances is very small, the F-score50 of a single base classifier can be biased and definitely not sufficient to determine whether the base model behaviour needs to be promoted in the forecast aggregator.

Model selection criteria

Our model selection strategy is conducted in two steps:

Models are only trained if there are at least 30 mutations on average in the training splits. Note that the training splits are label-balanced sets comprising 70% of the total mutation set available for model training. The size of the mutation set available for training does in turn depend on the dN/dS excess per consequence type inferred at each cohort.

We apply a composite rule that requires:

  • mean F-score50 >= 0.8
  • minimum number of observed mutations per each tier defined by the discovery index

We have updated step 2 by changing the discovery index tiers and the minimum number of observed mutations for each tier. The new step 2 would read in Python code as follows:

DISCOVERY_TIERS = (0, 0.5, 0.75) 
MUTATION_TIERS = (50, 30, 0)
FSCORE_THRESHOLD = 0.8

def meet_condition(fscore, discovery, n_muts):
    if fscore >= FSCORE_THRESHOLD:
        for discovery_thresh, n_muts_thresh in zip(DISCOVERY_TIERS, MUTATION_TIERS):
            if (discovery >= discovery_thresh) and (n_muts >= n_muts_thresh):
                return True
    return False

The rationale behind this change is that in providing broader discovery index tiers we are making sure that we provide more qualitatively distinct discovery classes. In particular, as per our undersampling study it is very difficult to assert qualitative differences between discovery index values within the (0, 0.5) range, making a case subdivision in this range not well justified.

Pan-gene models are no longer calculated

We have discontinued the training of meta-gene models based on the dichotomous mode-of-action classification into oncogenes (Act) and tumor suppressor (LoF) genes. The pipeline now only renders models that are gene-specific. Consequently the usage of the oncogenic mode-of-action or Role of the gene as a feature has been discontinued, too.

Data updates

Variant Effect Predictor

The new release uses ENSEMBL Variant Effect Predictor (VEP) v.101 annotations instead of ENSEMBL VEP v.92.

Genomic regions comprised in the models

This release uses a new definition of region both at training and at prediction.

The new regions of interest include exons as well as intronic splicing regions mapping the canonical transcripts by ENSEMBL v.101. Intronic splicing regions are defined as the intronic sites within 25 bp distance from the pre-mRNA exon-intron junctions determined by the canonical transcript.

Updated consequence type definitions

We have extended the set of sequence ontology (SO) terms that map to the consequence classification we employ in the pipeline, which consists on four consequence types: “missense”, “nonsense”, “splicing” and “synonymous”.

Nonsense mutations now include the SO terms:

stop_gained
stop_lost
start_lost

Splicing mutations now include the SO terms:

splice_donor_variant
splice_acceptor_variant
splice_region_variant
intron_variant

In the regions we analyze, intronic mutations can only be found in the intronic splicing regions described in the previous section.

Technical updates and fixes

Fixed reference genome build incompatibility

Recently we found a misspecification in the code base that prevented the randomization of passenger mutations as intended. As described in the main documentation, our method resorts to trinucleotide context probabilities derived from the observed frequencies in the cohort mutational data. With these probabilities we can then draw random mutations to create the set of passenger mutations used as negative set in the supervised learning step conducted with XGBoost.

Because of the usage of incompatible genome builds (hg38, hg19) in the step that converted genomic coordinates into reference triplets, some genes did not match correctly the canonical transcript positions with their respective reference trinucleotides. Consequently, for some genes the negative set of mutations was drawn following a mutational profile which in some cases might significantly differ from the empirical one informed by the cohort mutational data.

In the current version the hg38 genome build is consistently used throughout the entire pipeline.

IntOGen integration

The output of IntOGen now generates a unified data environment for the boostDM pipeline to be run from the new outputs of IntOGen with little preprocessing required.