There is significant interest in targeting disease-causing proteins with small molecule inhibitors to restore healthy cellular states. The ability to accurately predict the binding affinity of small molecules to a protein target in silico enables the rapid identification of candidate inhibitors and facilitates the optimization of on-target potency. In this work, we present T-ALPHA, a novel deep learning model that enhances protein-ligand binding affinity prediction by integrating multimodal feature representations within a hierarchical transformer framework to capture information critical to accurately predicting binding affinity. T-ALPHA outperforms all existing models reported in the literature on multiple benchmarks designed to evaluate protein-ligand binding affinity scoring functions. Remarkably, T-ALPHA maintains state-of-the-art performance when utilizing predicted structures rather than crystal structures, a powerful capability in real-world drug discovery applications where experimentally determined structures are often unavailable or incomplete. Additionally, we present an uncertainty-aware self-learning method for protein-specific alignment that does not require additional experimental data, and demonstrate that it improves T-ALPHA’s ability to rank compounds by binding affinity to biologically significant targets such as the SARS-CoV-2 main protease and the epidermal growth factor receptor. To facilitate implementation of T-ALPHA and reproducibility of all results presented in this paper, we have made all of our software available at this repository.
For full details of T-ALPHA, please refer to the preprint on bioRxiv.
- Installation
- Accessing Data Files
- Running the Scripts
- Training the Model
- Performing Inference
- Monte Carlo Dropout and Semi-Supervised Fine-Tuning
- On-the-Fly Inference Notebook
- Contact
git clone https://github.com/gregory-kyro/T-ALPHA.git
cd T-ALPHA
python -m venv venv
source venv/bin/activate # On Linux/Mac
venv\Scripts\activate # On Windows
pip install -r requirements.txt
The pre-trained model parameters and test datasets are hosted on Zenodo for easy access.
- Model Parameters: Download Here
- Test Files: Download Here
T-ALPHA/
├── data/
│ ├── model_parameters.tar.gz
│ ├── test_files.tar.gz
tar -xvzf test_files.tar.gz
tar -xvzf model_parameters.tar.gz
This repository provides three primary scripts to train the model, perform inference, and run Monte Carlo Dropout for uncertainty estimation.
To train the model from scratch, run:
python scripts/train.py --train_set <path_to_train_hdf> \
--val_set <path_to_val_hdf> \
--checkpoint_dir <path_to_save_checkpoints> \
--save_dir <path_to_logs> \
--save_name <experiment_name>
To evaluate the model on test data and compute key metrics, run:
python scripts/inference.py --ckpt_path <path_to_checkpoint> \
--test_set_path <path_to_test_hdf>
This will produce:
- A CSV file with predictions
- A scatter plot of predictions vs. true values
To perform uncertainty estimation with Monte Carlo Dropout, run:
python scripts/mc_dropout.py --checkpoint <path_to_checkpoint.ckpt> \
--test_set <path_to_test_dataset> \
--output_results <path_to_save_results.csv> \
--output_weighted <path_to_save_weighted_results.csv>
The T-ALPHA Inference Notebook enables real-time protein-ligand binding affinity prediction. It supports:
- Sequence-Based Inputs: Protein sequence and ligand SMILES.
- Structural Inputs: Protein PDB and ligand SDF files.
- PDB ID-Based Inputs: Fetch protein sequence automatically using a PDB ID.
With minimal setup, you can quickly obtain pKd predictions. Try it now here
For questions, issues, or collaborations, feel free to reach out:
- Name: Gregory W. Kyro
- Email: [email protected]