LC-PLM

license

LC-PLM

LC-PLM is a frontier long-context protein language model based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models. It is pretrained on UniRef50/90 with masked language modeling (MLM) objective. For detailed information on the model architecture, training data, and evaluation performance, please refer to the accompanying paper.

You can use LC-PLM to extract embeddings for amino acid residues and protein sequences. It can also be fine-tuned to predict residue- or protein- level properties.

Getting started

Install Python dependencies

pip install transformers mamba-ssm==2.2.2

Clone this repo with pretrained model weights

We use Git Large File Storage (LFS) to version the model weights. You can obtain the pretrained model and its related files simply by cloning this repo:

git clone https://github.com/amazon-science/LC-PLM.git

Run inference with the pretrained model

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load the model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("./LC-PLM", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")

# Input a protein sequence:
# fun fact: this is [Mambalgin-1](https://www.uniprot.org/uniprotkb/P0DKR6/entry) from Black mamba
sequence = "MKTLLLTLLVVTIVCLDLGYSLKCYQHGKVVTCHRDMKFCYHNTGMPFRNLKLILQGCSSSCSETENNKCCSTDRCNK"

# Tokenize the sequence:
inputs = tokenizer(sequence, return_tensors="pt")

# Inference with LC-PLM on GPU
device = torch.device("cuda:0")
model = model.to(device)
inputs = {key: val.to(device) for key, val in inputs.items()}
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# Retrieve the embeddings
last_hidden_state = outputs.hidden_states[-1]
print(last_hidden_state.shape) # [batch_size, sequence_length, hidden_dim]

Citation

@misc{wang2024longcontextproteinlanguagemodel,
      title={Long-context Protein Language Model}, 
      author={Yingheng Wang and Zichen Wang and Gil Sadeh and Luca Zancato and Alessandro Achille and George Karypis and Huzefa Rangwala},
      year={2024},
      eprint={2411.08909},
      archivePrefix={arXiv},
      primaryClass={q-bio.BM},
      url={https://arxiv.org/abs/2411.08909}, 
}

Security

See CONTRIBUTING for more information.

License

This project is licensed under the CC-BY-NC-4.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitattributes		.gitattributes
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.json		config.json
configuration_lcplm.py		configuration_lcplm.py
layer_norm.py		layer_norm.py
model.safetensors		model.safetensors
modeling_lcplm.py		modeling_lcplm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LC-PLM

Getting started

Install Python dependencies

Clone this repo with pretrained model weights

Run inference with the pretrained model

Citation

Security

License

About

Releases

Packages

Contributors 2

Languages

License

amazon-science/LC-PLM

Folders and files

Latest commit

History

Repository files navigation

LC-PLM

Getting started

Install Python dependencies

Clone this repo with pretrained model weights

Run inference with the pretrained model

Citation

Security

License

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages