Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare OpenProteinSet #28

Open
NZ99 opened this issue Aug 25, 2023 · 6 comments
Open

Prepare OpenProteinSet #28

NZ99 opened this issue Aug 25, 2023 · 6 comments
Assignees

Comments

@NZ99
Copy link
Contributor

NZ99 commented Aug 25, 2023

Download and prepare OpenProteinSet on the cluster, while deleting the old version on S3.

Multiple sequence alignments (MSAs) for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters. Template hits are also provided for the PDB chains and 270,000 UniClust30 clusters chosen for maximal diversity and MSA depth.

https://registry.opendata.aws/openfold/

@NZ99 NZ99 converted this from a draft issue Aug 25, 2023
@NZ99 NZ99 assigned NZ99 and unassigned NZ99 Aug 25, 2023
@csjackson0 csjackson0 removed their assignment Aug 25, 2023
@cmvcordova
Copy link

Has this been looked into? I could take a look at it if someone could help me sanity check it.

@NZ99
Copy link
Contributor Author

NZ99 commented Sep 7, 2023

I have not yet. Wanna collaborate over it @cmvcordova? I can start pulling the latest version on the cluster (there is a fairly old one already, but there is no point in using it if that decreases reproducibility) though I'm not 100% clear on what kind of preprocessing is needed.

@cmvcordova
Copy link

Let's do it! We can probably ping the rest of the team in the discord channel as we progress, to ensure we're on the right track

@cmvcordova
Copy link

Quick update:

We're currently facing issues with downloading the dataset on the ingress node. Zipped files are approximately 3.3 TB which exceeds any user's limit. After contacting the StabilityAI team, we'll redirect our approach to downloading directing to S3 using the spark cluster node instead.

@pascalnotin pascalnotin moved this from Todo to In Progress in project-lm-scaling Sep 21, 2023
@pascalnotin
Copy link
Collaborator

@NZ99 @cmvcordova -- I believe this is now completed based on latest conversation with Niccolo. Could you please confirm?

@cmvcordova
Copy link

Confirming OPS is on the cluster and accessible through s3://openbioml/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

4 participants