-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine and filter NAGL2 Optimization Datasets Part 1 + Part 2 #416
Conversation
…ers, despite me limiting it to 5
QCSubmit Validation Report
QC Specification Report
QCSubmit version information(click to expand)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this, Lexie! I was working on something similar last week but by combining the dataset.json.bz2
files directly. This is a little more complicated but has the obvious benefit of being able to filter the records, so I think this looks good. Also glad to see the hashes check! Hopefully QCArchive will detect this as complete right after submission.
Lifecycle - QCSubmit Submission Report : SUCCESS
Response from public QCArchive:
QCSubmit version information(click to expand)
|
Current status - Error CyclingConsider manually moving this. |
Lifecycle - Error Cycling Report
All errored tasks will be restarted.
|
specification | COMPLETE | RUNNING | WAITING | ERROR | CANCELLED | INVALID | DELETED |
---|---|---|---|---|---|---|---|
default | 128281 | 0 | 0 | 0 | 0 | 0 | 0 |
OptimizationRecord
Error Tracebacks:
Tracebacks (click to expand)
QCSubmit version information(click to expand)
version | |
---|---|
openff.qcsubmit | 0.54.0 |
openff.toolkit | 0.16.6 |
basis_set_exchange | 0.10 |
qcelemental | 0.28.0 |
rdkit | 2024.09.3 |
Current status - Archived/CompleteConsider manually moving this. |
Lifecycle - Archived/Complete
Dataset Complete! |
Lifecycle - Error Cycling Report
All errored tasks will be restarted.
|
specification | COMPLETE | RUNNING | WAITING | ERROR | CANCELLED | INVALID | DELETED |
---|---|---|---|---|---|---|---|
default | 128281 | 0 | 0 | 0 | 0 | 0 | 0 |
OptimizationRecord
Error Tracebacks:
Tracebacks (click to expand)
QCSubmit version information(click to expand)
version | |
---|---|
openff.qcsubmit | 0.54.0 |
openff.toolkit | 0.16.6 |
basis_set_exchange | 0.10 |
qcelemental | 0.28.0 |
rdkit | 2024.09.3 |
Current status - Archived/CompleteConsider manually moving this. |
Lifecycle - Archived/Complete
Dataset Complete! |
Lifecycle - Error Cycling Report
All errored tasks will be restarted.
|
specification | COMPLETE | RUNNING | WAITING | ERROR | CANCELLED | INVALID | DELETED |
---|---|---|---|---|---|---|---|
default | 128281 | 0 | 0 | 0 | 0 | 0 | 0 |
OptimizationRecord
Error Tracebacks:
Tracebacks (click to expand)
QCSubmit version information(click to expand)
version | |
---|---|
openff.qcsubmit | 0.54.0 |
openff.toolkit | 0.16.6 |
basis_set_exchange | 0.10 |
qcelemental | 0.28.0 |
rdkit | 2024.09.3 |
Current status - Archived/CompleteConsider manually moving this. |
Lifecycle - Archived/Complete
Dataset Complete! |
I have combined the NAGL part 1 and part 2 datasets, and filtered out problematic records (crashes, geometric rearrangements, and stereochemistry). We may want to filter for Conformer RMSD, but I thought we would save that for the single point dataset creation if we do want to do it, let me know what you think.
To combine the two optimization datasets and have QCA recognize they had already been computed, I had to add the initial molecule to the dataset, not the optimized one. It should then recognize the hash and skip the computation, and I confirmed that the hashes were the same between the separate and combined datasets.
Also, I called it
v4.0
, but I'm not sure if it should be calledv4.1
since it has been filtered? It would be the first version of the combined dataset, which is why I pickedv4.0
, but I'm not sure how this fits into the naming convention. Happy to change it.It was non-trivial to combine these two datasets into one, partly due to it being an optimization dataset, and partly because of the large size of the dataset, so I am even more in favor of @ntBre's suggestion of tagging individual records than I was before.
New Submission Checklist
README.md
describing the dataset see here for examplesdataset*.json
; may feature a compression extension, such as.bz2
README.md