-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create cross-validation split on input dataset #18
Comments
@jamaliki -- does that look good to you? |
Yes this looks good to me @pascalnotin We can use colabfolddb so that if we want to use uniref100 later, we can do so easily. |
@jamaliki - I updated the description based on our latest discussion. Lmk your thoughts! |
Makes sense @pascalnotin ! |
I can help on this if needed! @pascalnotin is there a Group1/Group2 split made up for ProteinGym? |
Thanks @csjackson0 ! I will create a split and share here over the weekend |
@csjackson0 -- apologies for the slight delay here. I think it will be best if the implementation of this issue takes as input two csv files: 'PG_test_sequences.csv' and 'PG_train_sequences.csv', where each file contains one reference sequence per line. We will be releasing a much larger version of the ProteinGym benchmark in a few weeks, so this will help ensure we can re-run the corresponding script when this is out by just updating the two csv files as needed. For the time being, let's use the following:
|
@pascalnotin Thanks for sending the sequences. Excited about the new ProteinGym benchmark! I was thinking of also adding the Uniref30 cluster file "uniref30_2302.tsv" to the input of the script. |
Hi @csjackson0 -- sounds good to use this file as an example to structure the code. All WT sequences from ProteinGym are in Uniref100, but not sure whether they will all be chosen as Uniref30 cluster representatives, so we may have to test for approximate match in case certain cluster representatives differ a bit from the WT seqs used in PG. |
Cluster input data at 30% similarity (eg., Uniref30 from colabfoldDB)
Split sequences from ProteinGym into two subgroups:
Assign sequences in Group 1 to test set, and Group 2 to training set
Assign all other clusters at 30% similarity from input data to train-val-test [val and test get 1M clusters each; rest goes to training]
The text was updated successfully, but these errors were encountered: