Create cross-validation split on input dataset #18

pascalnotin · 2023-08-17T15:18:07Z

Cluster input data at 30% similarity (eg., Uniref30 from colabfoldDB)
Split sequences from ProteinGym into two subgroups:

Group 1: 10/87 sequences fully removed from training data clusters [@pascalnotin to select the subset of 10]
Group 2: 77/87 sequences in training data

Assign sequences in Group 1 to test set, and Group 2 to training set
Assign all other clusters at 30% similarity from input data to train-val-test [val and test get 1M clusters each; rest goes to training]

pascalnotin · 2023-08-17T15:18:35Z

@jamaliki -- does that look good to you?

jamaliki · 2023-08-17T15:47:24Z

Yes this looks good to me @pascalnotin We can use colabfolddb so that if we want to use uniref100 later, we can do so easily.

pascalnotin · 2023-09-01T05:24:12Z

@jamaliki - I updated the description based on our latest discussion. Lmk your thoughts!

jamaliki · 2023-09-01T05:41:47Z

Makes sense @pascalnotin !

csjackson0 · 2023-09-14T22:47:28Z

I can help on this if needed! @pascalnotin is there a Group1/Group2 split made up for ProteinGym?

pascalnotin · 2023-09-15T09:27:10Z

Thanks @csjackson0 ! I will create a split and share here over the weekend

pascalnotin · 2023-09-21T15:03:47Z

@csjackson0 -- apologies for the slight delay here. I think it will be best if the implementation of this issue takes as input two csv files: 'PG_test_sequences.csv' and 'PG_train_sequences.csv', where each file contains one reference sequence per line.

We will be releasing a much larger version of the ProteinGym benchmark in a few weeks, so this will help ensure we can re-run the corresponding script when this is out by just updating the two csv files as needed.

For the time being, let's use the following:

All sequences can be found in this mapping file: https://github.com/OATML-Markslab/ProteinGym/blob/main/ProteinGym_reference_file_substitutions.csv
The 10 sequences in the 'PG_test_sequences.csv' file should be as follows:
A4D664_9INFA (Soh et al., 2019)
A4GRB6_PSEAI (Chen et al., 2020)
AMIE_PSEAE (Wrenbeck et al., 2017)
CALM1_HUMAN (Weile et al.,2017)
DLG4_RAT (McLaughlin Jr et al.,2012)
GAL4_YEAST (Kitzman et al.,2015)
PA_I34A1 (Wu et al.,2015)
Q2N0S59_HIV1 (Haddox et al.,2018)
SPG1_STRSG (Olson et al.,2014)
TPOR_HUMAN (Bridgford et al.,2020)
All other sequences would go to 'PG_train_sequences.csv'

csjackson0 · 2023-09-22T15:27:26Z

@pascalnotin Thanks for sending the sequences. Excited about the new ProteinGym benchmark!

I was thinking of also adding the Uniref30 cluster file "uniref30_2302.tsv" to the input of the script.

pascalnotin · 2023-09-24T02:09:31Z

Hi @csjackson0 -- sounds good to use this file as an example to structure the code. All WT sequences from ProteinGym are in Uniref100, but not sure whether they will all be chosen as Uniref30 cluster representatives, so we may have to test for approximate match in case certain cluster representatives differ a bit from the WT seqs used in PG.

pascalnotin added this to project-lm-scaling Aug 17, 2023

pascalnotin converted this from a draft issue Aug 17, 2023

othertea mentioned this issue Sep 11, 2023

feat: add validation split, wandb logging, multigpu compatibility #49

Merged

pascalnotin moved this from Todo to In Progress in project-lm-scaling Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create cross-validation split on input dataset #18

Create cross-validation split on input dataset #18

pascalnotin commented Aug 17, 2023 •

edited

Loading

pascalnotin commented Aug 17, 2023

jamaliki commented Aug 17, 2023

pascalnotin commented Sep 1, 2023

jamaliki commented Sep 1, 2023

csjackson0 commented Sep 14, 2023

pascalnotin commented Sep 15, 2023

pascalnotin commented Sep 21, 2023

csjackson0 commented Sep 22, 2023

pascalnotin commented Sep 24, 2023

Create cross-validation split on input dataset #18

Create cross-validation split on input dataset #18

Comments

pascalnotin commented Aug 17, 2023 • edited Loading

pascalnotin commented Aug 17, 2023

jamaliki commented Aug 17, 2023

pascalnotin commented Sep 1, 2023

jamaliki commented Sep 1, 2023

csjackson0 commented Sep 14, 2023

pascalnotin commented Sep 15, 2023

pascalnotin commented Sep 21, 2023

csjackson0 commented Sep 22, 2023

pascalnotin commented Sep 24, 2023

pascalnotin commented Aug 17, 2023 •

edited

Loading