Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create cross-validation split on input dataset #18

Open
pascalnotin opened this issue Aug 17, 2023 · 9 comments
Open

Create cross-validation split on input dataset #18

pascalnotin opened this issue Aug 17, 2023 · 9 comments

Comments

@pascalnotin
Copy link
Collaborator

pascalnotin commented Aug 17, 2023

  1. Cluster input data at 30% similarity (eg., Uniref30 from colabfoldDB)

  2. Split sequences from ProteinGym into two subgroups:

  • Group 1: 10/87 sequences fully removed from training data clusters [@pascalnotin to select the subset of 10]
  • Group 2: 77/87 sequences in training data
  1. Assign sequences in Group 1 to test set, and Group 2 to training set

  2. Assign all other clusters at 30% similarity from input data to train-val-test [val and test get 1M clusters each; rest goes to training]

@pascalnotin pascalnotin converted this from a draft issue Aug 17, 2023
@pascalnotin
Copy link
Collaborator Author

@jamaliki -- does that look good to you?

@jamaliki
Copy link
Collaborator

Yes this looks good to me @pascalnotin We can use colabfolddb so that if we want to use uniref100 later, we can do so easily.

@pascalnotin
Copy link
Collaborator Author

@jamaliki - I updated the description based on our latest discussion. Lmk your thoughts!

@jamaliki
Copy link
Collaborator

jamaliki commented Sep 1, 2023

Makes sense @pascalnotin !

@csjackson0
Copy link
Contributor

I can help on this if needed! @pascalnotin is there a Group1/Group2 split made up for ProteinGym?

@pascalnotin
Copy link
Collaborator Author

Thanks @csjackson0 ! I will create a split and share here over the weekend

@pascalnotin
Copy link
Collaborator Author

@csjackson0 -- apologies for the slight delay here. I think it will be best if the implementation of this issue takes as input two csv files: 'PG_test_sequences.csv' and 'PG_train_sequences.csv', where each file contains one reference sequence per line.

We will be releasing a much larger version of the ProteinGym benchmark in a few weeks, so this will help ensure we can re-run the corresponding script when this is out by just updating the two csv files as needed.

For the time being, let's use the following:

  • All sequences can be found in this mapping file: https://github.com/OATML-Markslab/ProteinGym/blob/main/ProteinGym_reference_file_substitutions.csv
  • The 10 sequences in the 'PG_test_sequences.csv' file should be as follows:
    A4D664_9INFA (Soh et al., 2019)
    A4GRB6_PSEAI (Chen et al., 2020)
    AMIE_PSEAE (Wrenbeck et al., 2017)
    CALM1_HUMAN (Weile et al.,2017)
    DLG4_RAT (McLaughlin Jr et al.,2012)
    GAL4_YEAST (Kitzman et al.,2015)
    PA_I34A1 (Wu et al.,2015)
    Q2N0S59_HIV1 (Haddox et al.,2018)
    SPG1_STRSG (Olson et al.,2014)
    TPOR_HUMAN (Bridgford et al.,2020)
  • All other sequences would go to 'PG_train_sequences.csv'

@pascalnotin pascalnotin moved this from Todo to In Progress in project-lm-scaling Sep 21, 2023
@csjackson0
Copy link
Contributor

@pascalnotin Thanks for sending the sequences. Excited about the new ProteinGym benchmark!

I was thinking of also adding the Uniref30 cluster file "uniref30_2302.tsv" to the input of the script.

@pascalnotin
Copy link
Collaborator Author

Hi @csjackson0 -- sounds good to use this file as an example to structure the code. All WT sequences from ProteinGym are in Uniref100, but not sure whether they will all be chosen as Uniref30 cluster representatives, so we may have to test for approximate match in case certain cluster representatives differ a bit from the WT seqs used in PG.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

3 participants