Focused learning by antibody language models using preferential masking of non-templated regions

While existing antibody language models (AbLMs) excel at predicting germline residues, they often struggle with mutated and non-templated residues, which concentrate in the complementarity-determining regions (CDRs) and are crucial for determining antigen-binding specificity. Many of these models are trained using a masked language modeling (MLM) objective with uniform masking probabilities; however, antibody recombination is modular in nature, creating relatively distinct regions of high and low complexity (non-templated and templated, respectively). We sought to determine whether and to what extent AbLMs can improve when trained using an alternative masking strategy based on this observation.

We developed a variation on MLM called Preferential Masking, which alters masking probabilities to amplify training signals from the CDR3. We pre-trained two AbLMs using either uniform or preferential masking and observed that the latter improves pre-training efficiency and residue prediction accuracy in the highly variable CDR3. Preferential masking also improves antibody classification by native chain pairing and binding specificity, suggesting improved CDR3 understanding and indicating that non-random, learnable patterns help govern antibody chain pairing. We further show that specificity classification is largely informed by residues in the CDRs, demonstrating that AbLMs learn meaningful patterns that align with immunological understanding.

The Python scripts and Jupyter Notebooks in this repository contain all code necessary to re-train these AbLMs from scratch and replicate our downstream analyses.

pre-training

Base models can be trained from scratch by running AbLM_pretraining.py with an associated train-config.yaml, as described here.

Weights for the pre-trained model checkpoints used in the paper can also be downloaded from Zenodo.

how should I cite this?

The Preferential Masking paper has been published as a preprint on biorxiv, and can be cited as:

Ng, K., & Briney, B. (2024). Focused learning by antibody language models using preferential masking of non-templated regions (p. 2024.10.23.619908). bioRxiv. https://doi.org/10.1101/2024.10.23.619908

The current version of the datasets used for pre-training and classifier head fine-tuning (v2024.10.31) can be cited as:

Ng, K., & Briney, B. (2024). Focused learning by antibody language models using preferential masking of non-templated regions (v2024.10.31) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14019655

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
fig3_inference		fig3_inference
fig4_pair-classification		fig4_pair-classification
fig5_CoV-classification		fig5_CoV-classification
models		models
tokenizer		tokenizer
.gitignore		.gitignore
AbLM_pretraining.py		AbLM_pretraining.py
ESM_weighted_masking_model.py		ESM_weighted_masking_model.py
LICENSE		LICENSE
README.md		README.md
preferential-masking_train-config.yaml		preferential-masking_train-config.yaml
uniform-masking_train-config.yaml		uniform-masking_train-config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Focused learning by antibody language models using preferential masking of non-templated regions

pre-training

how should I cite this?

About

Releases

Packages

Languages

License

brineylab/preferential-masking-paper

Folders and files

Latest commit

History

Repository files navigation

Focused learning by antibody language models using preferential masking of non-templated regions

pre-training

how should I cite this?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages