Releases: bbglab/boostdm-pipeline
boostDM cancer 2024
BoostDM v2024.07.15-cancer
This boostDM release has been implemented with the output produced by Intogen v2024.06.21. Several improvements and features have been included, including changes in model implementation, reference datasets and bugfixes. Herein we describe the main technical changes, we summarize the new output of the pipeline and provide a comparison with the previous release.
Model Implementation
Uniform cluster features across cohorts of same tumor type category
Although mutational cluster features (OncodriveCLUSTL, HotMAPS and smRegions) are generated on a cohort-by-cohort basis, boostDM models are specified by tumor-type, meaning that an aggregation or consensus step has to be taken in order to annotate the mutations in the training and test sets.
In the previous release, mutations with the same genomic identity (same genomic coordinates and alternate allele) would be annotated with different mutational cluster feature values depending on which cohort they were originally reported in, thereby yielding instances of mutations with same identity but different cluster feature values across samples of the same tumor type.
In the current version we have changed this criterion. Now if a mutation is mapped by a clustering feature in some cohort belonging to a given tumor type, all the instances of the mutation across the tumor type will get the same cluster feature value in the training and prediction corresponding to this tumor type.
Ordinal encoding of cluster features
Each cluster method (OncodriveCLUSTL, HotMAPS and smRegions) is now associated with a single ordinal feature with three levels:
2
- the mutation is associated to a cluster in the same tumor type1
- the mutation is only associated to a cluster in another tumor type0
- the mutation is not associated to the cluster in any case
Therefore we no longer keep the distinction between cat_1
and cat_2
binary-encoded cluster features. Moreover, we do not use the OncodriveCLUSTL_SCORE
numerical feature anymore due to the little gain in predictive ability and explainability along with the added complexity at interpreting the feature contributions at prediction.
Unweighted consensus of base classifiers
The forecast aggregator combining the 50 base classifiers per model is now unweighted, meaning that the base classifiers are no longer favored or penalized based on their cross-validation performance. Particularly in those cases where the cross-validation test set remaining after removing repeated mutation instances is very small, the performance score of a single base classifier can be biased and misleading to determine whether the base classifier behavior needs to be promoted in the forecast aggregator.
Model selection
Our model selection strategy is now conducted in two steps:
- Models are only trained if there are at least 30 mutations on average across the training splits of the 50 base classifiers. Note that the training splits are label-balanced sets comprising 70% of the total mutation training set. Note that the size of the mutation set available for training does in turn depend on the dN/dS excess per consequence type inferred at each cohort.
- We apply a composite rule that requires:
- Mean CV performance F-score50 across base classifier
>= 0.8
- a minimum number of observed mutations per each tier defined by the discovery index
- Mean CV performance F-score50 across base classifier
We have updated step 2 by changing the discovery index tiers and the minimum number of observed mutations for each tier. The new step 2 reads in Python code as follows:
DISCOVERY_TIERS = (0, 0.5, 0.75)
MUTATION_TIERS = (50, 30, 0)
FSCORE_THRESHOLD = 0.8
def meet_condition(fscore, discovery, n_muts):
if fscore >= FSCORE_THRESHOLD:
for discovery_thresh, n_muts_thresh in zip(DISCOVERY_TIERS, MUTATION_TIERS):
if (discovery >= discovery_thresh) and (n_muts >= n_muts_thresh):
return True
return False
The rationale behind this change is that it is very difficult to assert qualitative differences between discovery index values within the (0, 0.5) range, making a case subdivision in this range not well justified. Therefore by providing broader discovery index tiers we make sure to provide more qualitatively distinct discovery classes.
Pan-genes model are no longer trained
We no more train meta-gene models based on the dichotomous mode-of-action classification into oncogenes (Act) and tumor suppressor (LoF) genes. This is in part because the models end up being dominated by a few genes contributing more mutations to each Act or LoF class. The pipeline now only renders models that are gene-specific. Consequently the oncogenic mode-of-action or “role” of the gene as a feature is no longer used. As a result, the column “selected_model_gene” is no longer used in the in silico saturation mutagenesis output tables.
Prediction output format
The in silico saturation mutagenesis output is displayed with some formatting differences, with filenames consistent with the expression [gene].model.[ttype_model].features.[ttype_features].prediction.tsv.gz. The term ttype_model represents the tumor type used as training context, the set of cohorts where the signals of positive selection were calculated and from which the training mutations were taken for the specific model used to cast the prediction, i.e., in which tumor type context the model was learned. The term ttype_features is the tumor type corresponding to the features that are used to encode the input mutation as a vector of feature values for prediction, i.e., the tumor type of the query mutation. In this release, we provide in silico saturation mutagenesis outputs where the predictive model has been trained in the same tumor type context of the query mutation. As a consequence of making explicit the context of the training and the query, the column selected_model_ttype is no longer part of the output tables.
Data Updates
Ensembl Variant Effect Predictor
The new release uses ENSEMBL Variant Effect Predictor (VEP) v.111 annotations instead of ENSEMBL VEP v.92.
Transcripts
Ensembl canonical transcripts are no longer the reference transcripts in order to interpret relevant sites and the consequence type of the mutations. The current release is built on the MANE (Matched Annotation from NCBI and EMBL-EBI) transcripts. We have switched to using MANE Select transcripts to align with common standards used in the clinics. This redefinition of the transcript is in accordance with the current Intogen release v2024.06.212, where it has an impact on the regions used in methods like OncodriveFML, OncodriveCLUSTLand and smRegions, also when computing the mutational profiles and when building coding sequence regions for dNdScv.
Genomic regions of interest
In the current release, the genomic regions considered for training and prediction include the entire coding sequence according to MANE Select, including stop codons (as in the previous release) as well as the 5 noncoding base-pairs flanking each exonic coding sequence segment. The inclusion of the non coding base-pairs intends to exploit the signals of positive selection cast by noncoding splicing-affecting mutations. The mapping is done according to the MANE Select transcripts retrieved from VEP v.111.
Updated consequence type definitions
We have extended the set of sequence ontology (SO) terms that map to the consequence classification we employ in the pipeline, which consists of four consequence types: “missense”, “nonsense”, “splicing” and “synonymous”.
- The term nonsense now includes the SO terms:
stop_gained
stop_lost
start_lost
- The term splicing now include the SO terms:
splice_donor_variant
splice_acceptor_variant
splice_region_variant
splice_donor_5th_base_variant
splice_donor_region_variant
splice_polypyrimidine_tract_variant
intron_variant
Pfam domains
We have updated the version of Pfam to v35.0.
Oncotree
As tumor type ontology for hierarchical model implementation across tumor types we now use the oncotree version boostdm2023 (see Downloads) which is based on the 2021 version of the MSKCC oncotree.
Bug Fixing and technical updates
Fixed reference genome incompatibility
Recently we found a misspecification in the codebase that prevented the randomization of passenger mutations as intended. As described in the main documentation, our method resorts to trinucleotide context probabilities derived from the observed frequencies in the cohort mutational data. With these probabilities we can then draw random mutations to create the set of passenger mutations used as a negative set in the supervised learning step.
Because of the unintended usage of a reference genome inconsistent with the rest of the pipeline in the step that converted genomic coordinates into reference triplets prior to randomization, there was a mismatch between sites and reference triplets. Consequently, the nondriver training set was drawn in a trinucleotide-agnostic way.
In the current version the hg38 genome build is consistently used throughout the entire pipeline.
Intogen integration
The output of Intogen now generates a unified data environment for the boostDM pipeline to be run from the new outputs ...
boostDM clonal hematopoiesis 2024
BoostDM-CH is a method to score all possible single base substitutions in clonal hematopoiesis (CH) driver genes for their potential to drive CH. The method's rationale and analysis results thereof are described in this study:
Identification of Clonal Hematopoiesis Driver Mutations through In Silico Saturation Mutagenesis
Santiago Demajo, Joan Enric Ramis-Zaldivar, Ferran Muiños, Miguel L Grau, Maria Andrianova, Nuria Lopez-Bigas, Abel Gonzalez-Perez
URL: https://doi.org/10.1158/2159-8290.CD-23-1416
The method is based on two previous projects: IntOGen-CH and boostDM
IntOGen-CH pipeline provides a compendium of CH driver genes and mutational features upon using the IntOGen pipeline with a catalog of high VAF somatic mutations called in normal blood from tumor samples by reverse variant calling, which has been described in this study:
Discovering the drivers of clonal hematopoiesis
Oriol Pich, Iker Reyes-Salazar, Abel Gonzalez-Perez, Nuria Lopez-Bigas
URL: https://doi.org/10.1038/s41467-022-31878-0
boostDM is a methodology to annotate mutations in cancer genes for their potential to drive tumorigenesis, which has been described in another study:
In silico saturation mutagenesis of cancer genes
Ferran Muiños, Francisco Martinez-Jimenez, Oriol Pich, Abel Gonzalez-Perez, Nuria Lopez-Bigas
URL: https://www.nature.com/articles/s41586-021-03771-1
boostDM cancer 2023
boostDM cancer v1.1.0-alpha
This is a boostDM version for driver mutation identification in tumors that is compatible with the output produced by intOGen v2023.05.31. Several improvements and new features have been included, including changes in model implementation, reference datasets and bugfixes. Herein we describe the new features of this version.
Model implementation
Uniform cluster features across cohorts of same tumor type category
Although mutational cluster features (OncodriveCLUSTL, HotMAPS and smRegions) are generated on a cohort-by-cohort basis, the boostDM models are specified by tumor-type, meaning that an aggregation or consensus step has to be taken in order to annotate the mutations in the training and test sets.
In the previous release, mutations with the same genomic identity (same genomic coordinates and alternate allele) would be annotated with different mutational cluster feature values depending on in which cohort they were originally reported, thereby yielding instances of mutations with same identity but different cluster feature values across samples of the same tumor type.
In the current version we have changed this criterion. Now if a mutation is mapped by a clustering feature in some cohort belonging to a given tumor type, all the instances of the mutation across the tumor type will get the same cluster feature value in the training and prediction corresponding to this tumor type.
We have developed OpenVariant, a new python package to parse the input files. This new method annotates all SNVs and indels from the mutation data files and transforms the data in a standardized format, by reading a yaml file prepared by the user.
Importantly, the previous bgparsers tool had a bug in the annotation of indels, leading to some indels not being processed. This bug has been corrected in OpenVariant. As a consequence, there are, overall, more mutations annotated per sample.
One single feature per cluster type taking three ordinal values
Each cluster method (OncodriveCLUSTL, HotMAPS and smRegions) is now associated with a single ordinal feature with three levels:
2 - the mutation is associated to the cluster in the same tumor type
1 - the mutation is not associated to the cluster in the same tumor type, but is associated in another tumor type
0 - the mutation is not associated to the cluster in any case
Therefore we no longer keep the distinction between cat_1 and cat_2 dichtomous cluster features.
Moreover, we have decided not to use the OncodriveCLUSTL_SCORE numerical feature due to the little gain in predictive ability along with the added complexity at interpreting the feature contributions at prediction.
Unweighted consensus of base classifiers
The forecast aggregator combinining the 50 base classifiers per model is now unweighted, meaning that that the base classifiers are no longer favoured or penalized based on the cross-validation performance. Particularly in those cases where the cross-validation test set remaining after removing repeated mutation instances is very small, the F-score50 of a single base classifier can be biased and definitely not sufficient to determine whether the base model behaviour needs to be promoted in the forecast aggregator.
Model selection criteria
Our model selection strategy is conducted in two steps:
Models are only trained if there are at least 30 mutations on average in the training splits. Note that the training splits are label-balanced sets comprising 70% of the total mutation set available for model training. The size of the mutation set available for training does in turn depend on the dN/dS excess per consequence type inferred at each cohort.
We apply a composite rule that requires:
- mean F-score50 >= 0.8
- minimum number of observed mutations per each tier defined by the discovery index
We have updated step 2 by changing the discovery index tiers and the minimum number of observed mutations for each tier. The new step 2 would read in Python code as follows:
DISCOVERY_TIERS = (0, 0.5, 0.75)
MUTATION_TIERS = (50, 30, 0)
FSCORE_THRESHOLD = 0.8
def meet_condition(fscore, discovery, n_muts):
if fscore >= FSCORE_THRESHOLD:
for discovery_thresh, n_muts_thresh in zip(DISCOVERY_TIERS, MUTATION_TIERS):
if (discovery >= discovery_thresh) and (n_muts >= n_muts_thresh):
return True
return False
The rationale behind this change is that in providing broader discovery index tiers we are making sure that we provide more qualitatively distinct discovery classes. In particular, as per our undersampling study it is very difficult to assert qualitative differences between discovery index values within the (0, 0.5) range, making a case subdivision in this range not well justified.
Pan-gene models are no longer calculated
We have discontinued the training of meta-gene models based on the dichotomous mode-of-action classification into oncogenes (Act) and tumor suppressor (LoF) genes. The pipeline now only renders models that are gene-specific. Consequently the usage of the oncogenic mode-of-action or Role of the gene as a feature has been discontinued, too.
Data updates
Variant Effect Predictor
The new release uses ENSEMBL Variant Effect Predictor (VEP) v.101 annotations instead of ENSEMBL VEP v.92.
Genomic regions comprised in the models
This release uses a new definition of region both at training and at prediction.
The new regions of interest include exons as well as intronic splicing regions mapping the canonical transcripts by ENSEMBL v.101. Intronic splicing regions are defined as the intronic sites within 25 bp distance from the pre-mRNA exon-intron junctions determined by the canonical transcript.
Updated consequence type definitions
We have extended the set of sequence ontology (SO) terms that map to the consequence classification we employ in the pipeline, which consists on four consequence types: “missense”, “nonsense”, “splicing” and “synonymous”.
Nonsense mutations now include the SO terms:
stop_gained
stop_lost
start_lost
Splicing mutations now include the SO terms:
splice_donor_variant
splice_acceptor_variant
splice_region_variant
intron_variant
In the regions we analyze, intronic mutations can only be found in the intronic splicing regions described in the previous section.
Technical updates and fixes
Fixed reference genome build incompatibility
Recently we found a misspecification in the code base that prevented the randomization of passenger mutations as intended. As described in the main documentation, our method resorts to trinucleotide context probabilities derived from the observed frequencies in the cohort mutational data. With these probabilities we can then draw random mutations to create the set of passenger mutations used as negative set in the supervised learning step conducted with XGBoost.
Because of the usage of incompatible genome builds (hg38, hg19) in the step that converted genomic coordinates into reference triplets, some genes did not match correctly the canonical transcript positions with their respective reference trinucleotides. Consequently, for some genes the negative set of mutations was drawn following a mutational profile which in some cases might significantly differ from the empirical one informed by the cohort mutational data.
In the current version the hg38 genome build is consistently used throughout the entire pipeline.
IntOGen integration
The output of IntOGen now generates a unified data environment for the boostDM pipeline to be run from the new outputs of IntOGen with little preprocessing required.