FlashFold: a command-line tool for faster protein and protein complex structure prediction

Introduction

Proteins are vital to cellular functions and their tertiary structure is key to understanding their biological roles. FlashFold predicts the structure of proteins and complexes from amino acid sequences, using AlphaFold2 models with a focus on speed. It also provides a table of quality metrics for the predicted structures.

License: FlashFold is licensed under the MIT license
Language: Python3 ( > 3.9 )
OS: Linux, macOS
OS-level Dependencies:
- LocalColabFold
- HMMER Suite

Installation

FlashFold can be installed on Linux and macOS.

🚨 Important: If you are using macOS, please note that the structure prediction is 5-10 times slower compared to Linux with a GPU.
This is due to the absence of Nvidia GPU/CUDA drivers on macOS.

If you are planning to use a GPU, it is recommended to check the following settings prior to installation:

CUDA 12.1 or later (version 12.4 is recommended) and cudnn 9 are required. (If you are planning to use a GPU)

You can check the CUDA version using the following command:
```
nvcc --version
```
DO N🚫T use nvidia-smi to check the version. ❌
✔️ See NVIDIA CUDA Installation Guide for Linux if you haven't installed it.

GNU compiler version is **9.0 or later** is required.

You can check the GNU compiler version using the following command:
```
gcc --version
```
💡 If the version is 8.5.0 or older (e.g. CentOS 7, Rocky/Almalinux 8, etc.), install a new one and add PATH to it.

📌 FlashFold can be installed using the following steps:

✔ Step 1: Install Conda (Skip this step if conda is already installed)

Conda is a package manager that helps to install and manage dependencies. It can be downloaded and installed from:

✔ Step 2: Clone the git repository

git clone https://github.com/chayan7/flashfold.git
cd flashfold

✔ Step 3: Install dependencies under conda environment

FlashFold internally uses LocalColabFold (local version of ColabFold) for structure prediction. The installation instructions for LocalColabFold can be found here.

To streamline the installation process for both Linux and macOS users, FlashFold provides a convenient installation script that sets up the required dependencies within a conda environment named flashfold.

bash install.sh              # Install dependencies
conda activate flashfold     # Activate the environment

✔ Step 4: Install the package

poetry install

✔ Step 5: Run the tests

poetry run pytest

or,

pytest

Workflow

FlashFold uses amino acid sequences to predict the structure of proteins and protein complexes. In order to achieve this, FlashFold uses the following steps:

Sequence Alignment: FlashFold uses jackhmmer to generate a multiple sequence alignment (MSA) for the input sequence. FlashFold reduces the MSA generation time significantly by using a compact database.
Structure Prediction: The MSA is then formatted and used as an input for colabfold_batch to predict the structure.
Model Refinement (optional): Based on user input, the predicted structure is refined using OpenMM and OpenStructure.
Quality Metrics: FlashFold provides a table of quality metrics for the predicted structures. For protein complexes, it uses the Predicted DockQ version 2 (pDockQ2) script to calculate the quality of each interface.

Application

Database

In order to predict the structure of proteins and protein complexes, FlashFold requires a sequence database. The database is used for homology sequence detection as the input sequence to generate a multiple sequence alignment (MSA) . FlashFold provides the following options:
Download in-built database

FlashFold provides three in-built databases, that can be downloaded using the following command:
```
flashfold download_db -i /path/to/database.json -o /path/to/downloaded_db/
```
The database.json file can be found here. User can avoid downloading a database by removing the database name and the download link in the json file.
Create custom database

FlashFold allows user to create custom database using the `create_db` subcommand. In this case, the input should be the assembled genome data in [GenBank](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/file-formats/annotation-files/about-ncbi-gbff/) format.
In order to download the genome data from NCBI, FlashFold provides a convenient script `ncbi_data` that can be used as follows:
- For example, to download all the genbank files of Pseudomonas aeruginosa form NCBI RefSeq, the following command can be used:
```
flashfold ncbi_data  -n "Pseudomonas aeruginosa" -f gbff -s refseq -o /path/to/genbank_file_dir/ 
```
- Or, user can download the genbank files of particular genome of interest from NCBI using accession numbers as input, see example.
```
flashfold ncbi_data  -i /path/to/assembly_accessions.txt -f gbff -o /path/to/genbank_file_dir/
```
Once the genbank files are downloaded, the custom database can be created using the create_db subcommand as follows:
```
flashfold create_db -p /path/to/genbank_file_dir/ -o /path/to/custom_db/
```
Extend database

FlashFold allows user to update or extend the current database with new information. - If user would like to extend or update database_1 with the information from database_2, it is possible by using the `extend_db` subcommand.
```
flashfold extend_db -m /path/to/database_1 -n /path/to/database_2
```
Note that, only the database_1 will be updated with the new information from database_2.
- It is also possible to extend the current database directly with the new collection of genbank files, using the extend_db subcommand.
```
flashfold extend_db -m /path/to/database_to_be_extended -g /path/to/genbank_file_dir/
```
Protein structure prediction

FlashFold provides a subcommand fold to predict the structure of proteins and protein complexes. See details below:
Input file preparation
- FlashFold takes amino acid sequence in FASTA format as input. Also, it can take multiple FASTA files as input when --batch is set. The input file should follow the following guidelines:
  - It is recommended to keep the file name short and readable. Avoid using special characters in the file name.
  - It should be noted that, when --batch is set, the file name will be used as a directory to store results under user provided output directory. If any special characters are found except "." or "_" in the file name, it will be replaced with "_".
  - File extension should be .fasta.
- Additionally, FlashFold can take A3M file as input. The A3M file preferably should be generated using FlashFold itself using the --only_msa option. User customised A3M file can be served as input as well. --batch option is also applicable for A3M file input as FASTA.
Few examples for FASTA sequence as input are shown below:

Monomer
```
>seq_1
FHWDREGQADDSSSCWLRVASGWAGRNYGAIAIPRVGMEVLVTFLEGDPDQPLVTGCLFH
REHPVPYELPGHKTRSVFKSLSSPGGGGYNELRIEDRKGQEQIFVHAQR
```
Protein complex
- Homo-dimer
```
>seq_1
FHWDREGQADDSSSCWLRVASGWAGRNYGAIAIPRVGMEVLVTFLEGDPDQPLVTGCLFH
REHPVPYELPGHKTRSVFKSLSSPGGGGYNELRIEDRKGQEQIFVHAQR
>seq_1
FHWDREGQADDSSSCWLRVASGWAGRNYGAIAIPRVGMEVLVTFLEGDPDQPLVTGCLFH
REHPVPYELPGHKTRSVFKSLSSPGGGGYNELRIEDRKGQEQIFVHAQR
```
- Hetero-dimer
```
>seq_1
FHWDREGQADDSSSCWLRVASGWAGRNYGAIAIPRVGMEVLVTFLEGDPDQPLVTGCLFH
REHPVPYELPGHKTRSVFKSLSSPGGGGYNELRIEDRKGQEQIFVHAQR
>seq_2
MTSWTLVTLVLLIILAAIRPEQLQVVAYKLVLVTLGAVAGYWIDRSLFPYVARPHECSAN
LVVVGAWLRRGLIVLACILGLTLGL
```
- Hetero-trimer
```
>seq_1
FHWDREGQADDSSSCWLRVASGWAGRNYGAIAIPRVGMEVLVTFLEGDPDQPLVTGCLFH
REHPVPYELPGHKTRSVFKSLSSPGGGGYNELRIEDRKGQEQIFVHAQR
>seq_2
MTSWTLVTLVLLIILAAIRPEQLQVVAYKLVLVTLGAVAGYWIDRSLFPYVARPHECSAN
LVVVGAWLRRGLIVLACILGLTLGL
>seq_3
MAFQADRFLWFNSSSGQTVAPVSIVGGQMFINTAMIQDGSITNAKIGNVIQSTALGANGE
PLWKLDKAGSLTMNSATSGGFMRQTAEAVKVYDANLVLRVQIGNLDA
```
Commands

FlashFold offers subcommand `fold` to predict the structure of proteins and protein complexes. FlashFold uses different algorithm and model for monomer and multimer prediction. However, the user does not need to specify it because FlashFold can automatically detect based on the input sequence.
Few examples are shown below:

Beginner
```
flashfold fold -q /path/to/query.fasta -d /path/to/database/ -o /path/to/output/ -t number_of_threads
```
Moderate
```
flashfold fold -q /path/to/query.fasta -d /path/to/database/ -o /path/to/output/ -t number_of_threads --only_msa   
```
Advanced
```
flashfold fold -q /path/to/query.a3m -o /path/to/output/  
```
Expert
```
flashfold fold -q /path/to/dir/multiple_fasta_files --batch -d /path/to/database/ -o /path/to/output/ -t number_of_threads
```
Summary report generation

FlashFold includes a subcommand summary designed to generate a comprehensive summary report of the predicted structures.
Command
```
flashfold summary -d /path/to/flashfold_output/ -o /path/to/generate/report/
```
Output

Example of the summary report generated by FlashFold is shown below:

Acknowledgements

FlashFold utilizes and/or references the following separate libraries and packages:

Citation

If you use FlashFold in your research, please cite:

Saha, CK., Roghanian, M., Häussler, S., Guy, L. FlashFold: a standalone command-line tool for accelerated protein structure prediction.
bioRxiv. doi: https://doi.org/10.1101/2025.01.23.634437

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
envs		envs
flashfold		flashfold
logo		logo
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
database.json		database.json
install.sh		install.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlashFold: a command-line tool for faster protein and protein complex structure prediction

Introduction

Installation

🚨 Important: If you are using macOS, please note that the structure prediction is 5-10 times slower compared to Linux with a GPU.
This is due to the absence of Nvidia GPU/CUDA drivers on macOS.

📌 FlashFold can be installed using the following steps:

✔ Step 1: Install Conda (Skip this step if conda is already installed)

✔ Step 2: Clone the git repository

✔ Step 3: Install dependencies under conda environment

✔ Step 4: Install the package

✔ Step 5: Run the tests

Workflow

Application

Database

Protein structure prediction

Summary report generation

Acknowledgements

Citation

About

Releases

Packages

Languages

License

chayan7/flashfold

Folders and files

Latest commit

History

Repository files navigation

FlashFold: a command-line tool for faster protein and protein complex structure prediction

Introduction

Installation

🚨 Important: If you are using macOS, please note that the structure prediction is 5-10 times slower compared to Linux with a GPU. This is due to the absence of Nvidia GPU/CUDA drivers on macOS.

📌 FlashFold can be installed using the following steps:

✔ Step 1: Install Conda (Skip this step if conda is already installed)

✔ Step 2: Clone the git repository

✔ Step 3: Install dependencies under conda environment

✔ Step 4: Install the package

✔ Step 5: Run the tests

Workflow

Application

Database

Protein structure prediction

Summary report generation

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

🚨 Important: If you are using macOS, please note that the structure prediction is 5-10 times slower compared to Linux with a GPU.
This is due to the absence of Nvidia GPU/CUDA drivers on macOS.

Packages