Automated cell type annotation and exploration of single cell signalling dynamics using mass cytometry and machine learning
Mass cytometry by time-of-flight (CyTOF) is an emerging technology allowing for in-depth characterisation of cellular heterogeneity in cancer and other diseases. However, computational identification of cell populations from CyTOF, and utilisation of single cell data for biomarker discoveries faces several technical limitations, and although some computational approaches are available, high-dimensional analyses of single cell data remains quite demanding. Here, we deploy a bioinformatics framework that tackles two fundamental problems in CyTOF analyses namely: a) automated annotation of cell populations guided by a reference dataset, and b) systematic utilisation of single cell data for more effective patient stratification. By applying this framework on several publicly available datasets, we demonstrate that the Scaffold map approach surpasses state-of-the-art supervised and semi-supervised approaches for automated cell type annotation. Additionally, a case study focusing on a cohort of leukemia patients, reported salient interactions between signalling proteins that are sufficient to predict short-term survival at time of diagnosis using the XGBoost algorithm.
Here we provide all codes and datasets required to reproduce the analysis presented in our relevant publication
Title: Automated cell type annotation and exploration of single cell signalling dynamics using mass cytometry
Journal: The paper is under major revision with the iScience jourmal
Published: pre-print available at bioRxiv https://doi.org/10.1101/2022.08.13.503587
The framework is presented using different R Markdowns that implement parts of the analysis as follows:
-
Part1.Rmd : benchmarking different supervised and semi-supervised approaches for cell type annotation
-
Part2.Rmd : using the Scaffold map approach to phenotype data from the leukemia cohort and healthy controls, and facilitate comparisons using statistical approaches
-
Part3_DREMI.Rmd : an in-house implementation of DREMI to generate features for ML modelling
-
Part3_ML_modelling.Rmd : implementing different classification techinques to predict patients survival with ML
-
Part4.Rmd : feauture importance using the XGBoost algorithm focusing on the leukemia case study cohort
Important note for users
Please modify all paths found in the R markdowns and change them to your computer's file system. Since we are not allowed to share patient personal information, there are code lines reading external files with personal info. Please skip this parts and move to the next subsections were we provide already processed non-identifiable data. We provide links to all of our analysed data and results that can be downloaded in the rda format. For more specific information please contact the authors.
Below we provide the sources of the datasets used in the study:
-
Reference data set from Triana et al: file named Healthy.rds from https://doi.org/10.6084/m9.figshare.13397651.v4
-
Datasets named AML_benchmark and BMMC_benchmark: the datasets are described in this publication https://doi.org/10.1002/cyto.a.23738 and can be downloaded from http://flowrepository.org/id/FR-FCM-ZYTT
-
Dataset named PANORAMA_benchmark: we refer to the completed dataset called Samusik_all_SE from the R package HDCytoData, available at https://rdrr.io/github/lmweber/HDCytoData/
-
Dataset used for the leukemia case study: the data are described in publication titled "Early response evaluation by single cell signaling profiling in acute myeloid leukemia", doi: 10.1038/s41467-022-35624-4. Please refer to the following repository to download raw fcs files:
http://flowrepository.org/id/RvFr0LLv9McDJ89jgK50G4lwnfDFRTrcMelxYgnSIcE2Cymrpf2qh2NaWybtWDNH -
All other dataset and results from our re-analysis are available at: https://zenodo.org/records/10984478
16-Apr-2024 : Revised methodology
16-Mar-2022 : Beta version 1
Comments and bug reports are welcome, please email: Dimitrios Kleftogiannis ([email protected])
We are also interested to know about how you have used our source code, including any improvements that you have implemented.
You are free to modify, extend or distribute our source code, as long as our copyright notice remains unchanged and included in its entirety.
This project is licensed under the MIT License.
Copyright 2022 Department of Informatics, University of Bergen (UiB) and the Centre of Cancer Biomarkers (CCBIO), Norway
You may only use the source code in this repository in compliance with the license provided in this repository. For more details, please refer to the file named "LICENSE.md".