Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine and filter NAGL2 Optimization Datasets Part 1 + Part 2 #416

Merged
merged 14 commits into from
Dec 9, 2024

Conversation

amcisaac
Copy link
Collaborator

@amcisaac amcisaac commented Dec 9, 2024

I have combined the NAGL part 1 and part 2 datasets, and filtered out problematic records (crashes, geometric rearrangements, and stereochemistry). We may want to filter for Conformer RMSD, but I thought we would save that for the single point dataset creation if we do want to do it, let me know what you think.

To combine the two optimization datasets and have QCA recognize they had already been computed, I had to add the initial molecule to the dataset, not the optimized one. It should then recognize the hash and skip the computation, and I confirmed that the hashes were the same between the separate and combined datasets.

Also, I called it v4.0, but I'm not sure if it should be called v4.1 since it has been filtered? It would be the first version of the combined dataset, which is why I picked v4.0, but I'm not sure how this fits into the naming convention. Happy to change it.

It was non-trivial to combine these two datasets into one, partly due to it being an optimization dataset, and partly because of the large size of the dataset, so I am even more in favor of @ntBre's suggestion of tagging individual records than I was before.

New Submission Checklist

  • Created a new folder in the submissions directory containing the dataset
  • Added README.md describing the dataset see here for examples
  • All files used to produce the dataset are included with a description
  • Dataset follows the QCSubmit schema defined for Datasets, OptimizationDatasets and TorsionDriveDatasets
  • Dataset filename matches pattern dataset*.json; may feature a compression extension, such as .bz2
  • A PDF depicting the molecules is attached, in the case of torsiondrives this should include the highlighting of the central bond, this can be done automatically using qcsubmit.
  • QCSubmit validation passed
  • Made a new dataset entry in the mapping table in repository README.md
  • Ready to submit!

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2024-12-09-OpenFF-NAGL2-Training-Optimization-Dataset-v4.0/dataset.json.bz2
Dataset Name OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type OptimizationDataset
Elements H ,N ,Br ,I ,Si ,B ,O ,C ,P ,F ,Cl ,S
Valid Cmiles 🔥
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes 🔥
Valid Constraints 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2024-12-09-OpenFF-NAGL2-Training-Optimization-Dataset-v4.0/dataset.json.bz2/default
Specification Name default
Method B3LYP-D3BJ
Basis DZVP
Wavefunction Protocol none
Implicit Solvent
Keywords {}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.6
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@amcisaac amcisaac requested review from lilyminium and ntBre December 9, 2024 07:35
Copy link
Collaborator

@ntBre ntBre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, Lexie! I was working on something similar last week but by combining the dataset.json.bz2 files directly. This is a little more complicated but has the obvious benefit of being able to filter the records, so I think this looks good. Also glad to see the hashes check! Hopefully QCArchive will detect this as complete right after submission.

@amcisaac amcisaac merged commit 0e6e6da into master Dec 9, 2024
5 checks passed
@openff-dangerbot
Copy link
Contributor

Lifecycle - QCSubmit Submission Report : SUCCESS

Dataset Name OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type optimization
UTC Datetime 2024-12-09 21:44 UTC

Response from public QCArchive:

None

QCSubmit version information(click to expand)
version
openff.qcsubmit 0.53.0
openff.toolkit 0.16.4
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.1

@openff-dangerbot
Copy link
Contributor

Current status - Error Cycling

Consider manually moving this.

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type optimization
UTC Datetime 2024-12-09 22:24 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 128281 0 0 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.6
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Current status - Archived/Complete

Consider manually moving this.

@openff-dangerbot
Copy link
Contributor

Lifecycle - Archived/Complete

Dataset Name OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type optimization
UTC Datetime 2024-12-09 22:24 UTC

Dataset Complete!

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type optimization
UTC Datetime 2024-12-10 12:08 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 128281 0 0 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.6
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Current status - Archived/Complete

Consider manually moving this.

@openff-dangerbot
Copy link
Contributor

Lifecycle - Archived/Complete

Dataset Name OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type optimization
UTC Datetime 2024-12-10 12:08 UTC

Dataset Complete!

@openff-dangerbot
Copy link
Contributor

Lifecycle - Error Cycling Report

Dataset Name OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type optimization
UTC Datetime 2024-12-11 12:07 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

OptimizationRecord current status

specification COMPLETE RUNNING WAITING ERROR CANCELLED INVALID DELETED
default 128281 0 0 0 0 0 0

OptimizationRecord Error Tracebacks:

Tracebacks (click to expand)


QCSubmit version information(click to expand)
version
openff.qcsubmit 0.54.0
openff.toolkit 0.16.6
basis_set_exchange 0.10
qcelemental 0.28.0
rdkit 2024.09.3

@openff-dangerbot
Copy link
Contributor

Current status - Archived/Complete

Consider manually moving this.

@openff-dangerbot
Copy link
Contributor

Lifecycle - Archived/Complete

Dataset Name OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type optimization
UTC Datetime 2024-12-11 12:07 UTC

Dataset Complete!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: End of Life
Development

Successfully merging this pull request may close these issues.

3 participants