Please use repository tfold-release for the TFold package! (Present repository includes notebooks that were used in the data analysis for the paper, and which are not needed to run TFold.)
#AlphaFold-based pipeline for prediction of peptide-MHC structures.
Please cite as:
Victor Mikhaylov and Arnold J. Levine, "Accurate modeling of peptide-MHC structures with AlphaFold", to appear.
#Download and install
-
Download AlphaFold and its parameters. (This pipeline was tested with AlphaFold 2.1.0.) No need to download PDB and the protein databases.
-
Clone this repository:
git clone https://github.com/v-mikhaylov/tfold-release.git
Enter the tfold-release
folder.
- Install the dependencies. With conda, you should be able to create an environment that would work for both TFold pipeline and AlphaFold:
conda env create --file tfold-env.yml
conda activate tfold-env
pip install --upgrade jax==0.2.24 jaxlib==0.1.69+cuda111 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
(This environment for running AlphaFold outside of Docker is due to https://github.com/kalininalab/alphafold_non_docker.)
- Download the data file
data.tar.gz
with templates and other information from Zenodo:
https://zenodo.org/record/7700748#.ZAV0sy-B23x
and unpack it into the tfold-release
folder. This will create a folder data
.
-
Set paths to a couple folders in
tfold/config.py
andtfold_patch/tfold_config.py
. -
That should be it.
#Model pMHCs
- Prepare an input file. An example can be found in
data/examples/sample.csv
. It should be a.csv
file with a header and with columnspep
andMHC allele
orMHC sequence
.
- The format for MHC alleles is
SpeciesId-Locus*Allele
for class I andSpeciesId-LocusA*AlleleA/LocusB*AlleleB
for class II. Some examples:HLA-A*02:01
,H2-K*d
,HLA-DRA*01:01/DRB4*01:144
,H2-IEA*d/IEB*k
. - For class II, the MHC sequence should contain alpha-chain and beta-chain sequences separated by '/'.
- For more details and options, please see
details.ipynb
.
- Activate conda environment:
conda activate tfold-env
- Choose an output folder
$working_dir
and run the script as follows:
model_pmhcs.sh $input_file $working_dir [-d YYYY-MM-DD]
Here [-d YYYY-MM-DD]
is an optional cutoff on the allowed template dates.
- The models will be saved in
$working_dir/outputs$
, with a separate folder for each pMHC. There will also be a summary.csv
file in$working_dir
with information about the best models (by predicted score).
#Details
The notebook details.ipynb
contains some additional details on the pipeline that can be useful e.g. for splitting the jobs over multiple GPUs. It also contains a description of our cleaned pMHC and TCR structure database and associated tools.