Combine and filter NAGL2 Optimization Datasets Part 1 + Part 2 #416

amcisaac · 2024-12-09T07:27:08Z

I have combined the NAGL part 1 and part 2 datasets, and filtered out problematic records (crashes, geometric rearrangements, and stereochemistry). We may want to filter for Conformer RMSD, but I thought we would save that for the single point dataset creation if we do want to do it, let me know what you think.

To combine the two optimization datasets and have QCA recognize they had already been computed, I had to add the initial molecule to the dataset, not the optimized one. It should then recognize the hash and skip the computation, and I confirmed that the hashes were the same between the separate and combined datasets.

Also, I called it v4.0, but I'm not sure if it should be called v4.1 since it has been filtered? It would be the first version of the combined dataset, which is why I picked v4.0, but I'm not sure how this fits into the naming convention. Happy to change it.

It was non-trivial to combine these two datasets into one, partly due to it being an optimization dataset, and partly because of the large size of the dataset, so I am even more in favor of @ntBre's suggestion of tagging individual records than I was before.

New Submission Checklist

Created a new folder in the submissions directory containing the dataset
Added README.md describing the dataset see here for examples
All files used to produce the dataset are included with a description
Dataset follows the QCSubmit schema defined for Datasets, OptimizationDatasets and TorsionDriveDatasets
Dataset filename matches pattern dataset*.json; may feature a compression extension, such as .bz2
A PDF depicting the molecules is attached, in the case of torsiondrives this should include the highlighting of the central bond, this can be done automatically using qcsubmit.
QCSubmit validation passed
Made a new dataset entry in the mapping table in repository README.md
Ready to submit!

…ers, despite me limiting it to 5

openff-dangerbot · 2024-12-09T07:32:58Z

QCSubmit Validation Report

	submissions/2024-12-09-OpenFF-NAGL2-Training-Optimization-Dataset-v4.0/dataset.json.bz2
Dataset Name	OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type	OptimizationDataset
Elements	H ,N ,Br ,I ,Si ,B ,O ,C ,P ,F ,Cl ,S
Valid Cmiles	🔥
Connected Dihedrals	🔥
No Linear Torsions	🔥
No Molecular Complexes	🔥
Valid Constraints	🔥
Complete Metatdata	🔥

QC Specification Report

	submissions/2024-12-09-OpenFF-NAGL2-Training-Optimization-Dataset-v4.0/dataset.json.bz2/default
Specification Name	default
Method	B3LYP-D3BJ
Basis	DZVP
Wavefunction Protocol	none
Implicit Solvent
Keywords	{}
Validated	🔥
Valid SCF Properties	🔥
Full Basis Coverage	🔥

QCSubmit version information(click to expand)

	version
openff.qcsubmit	0.54.0
openff.toolkit	0.16.6
basis_set_exchange	0.10
qcelemental	0.28.0
rdkit	2024.09.3

ntBre

Thanks for doing this, Lexie! I was working on something similar last week but by combining the dataset.json.bz2 files directly. This is a little more complicated but has the obvious benefit of being able to filter the records, so I think this looks good. Also glad to see the hashes check! Hopefully QCArchive will detect this as complete right after submission.

openff-dangerbot · 2024-12-09T21:44:48Z

Lifecycle - QCSubmit Submission Report : SUCCESS


Dataset Name	OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type	optimization
UTC Datetime	2024-12-09 21:44 UTC

Response from public QCArchive:

None

QCSubmit version information(click to expand)

	version
openff.qcsubmit	0.53.0
openff.toolkit	0.16.4
basis_set_exchange	0.10
qcelemental	0.28.0
rdkit	2024.09.1

openff-dangerbot · 2024-12-09T21:44:50Z

Current status - Error Cycling

Consider manually moving this.

openff-dangerbot · 2024-12-09T22:24:48Z

Lifecycle - Error Cycling Report


Dataset Name	OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type	optimization
UTC Datetime	2024-12-09 22:24 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

`OptimizationRecord` current status

specification	COMPLETE	RUNNING	WAITING	ERROR	CANCELLED	INVALID	DELETED
default	128281	0	0	0	0	0	0

`OptimizationRecord` Error Tracebacks:

Tracebacks (click to expand)

QCSubmit version information(click to expand)

	version
openff.qcsubmit	0.54.0
openff.toolkit	0.16.6
basis_set_exchange	0.10
qcelemental	0.28.0
rdkit	2024.09.3

openff-dangerbot · 2024-12-09T22:24:50Z

Current status - Archived/Complete

Consider manually moving this.

openff-dangerbot · 2024-12-09T22:25:06Z

Lifecycle - Archived/Complete


Dataset Name	OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type	optimization
UTC Datetime	2024-12-09 22:24 UTC

Dataset Complete!

openff-dangerbot · 2024-12-10T12:08:19Z

Lifecycle - Error Cycling Report


Dataset Name	OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type	optimization
UTC Datetime	2024-12-10 12:08 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

`OptimizationRecord` current status

specification	COMPLETE	RUNNING	WAITING	ERROR	CANCELLED	INVALID	DELETED
default	128281	0	0	0	0	0	0

`OptimizationRecord` Error Tracebacks:

Tracebacks (click to expand)

QCSubmit version information(click to expand)

	version
openff.qcsubmit	0.54.0
openff.toolkit	0.16.6
basis_set_exchange	0.10
qcelemental	0.28.0
rdkit	2024.09.3

openff-dangerbot · 2024-12-10T12:08:21Z

Current status - Archived/Complete

Consider manually moving this.

openff-dangerbot · 2024-12-10T12:08:37Z

Lifecycle - Archived/Complete


Dataset Name	OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type	optimization
UTC Datetime	2024-12-10 12:08 UTC

Dataset Complete!

openff-dangerbot · 2024-12-11T12:07:51Z

Lifecycle - Error Cycling Report


Dataset Name	OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type	optimization
UTC Datetime	2024-12-11 12:07 UTC

All errored tasks will be restarted.
Errored states prior to restart reported below.

`OptimizationRecord` current status

specification	COMPLETE	RUNNING	WAITING	ERROR	CANCELLED	INVALID	DELETED
default	128281	0	0	0	0	0	0

`OptimizationRecord` Error Tracebacks:

Tracebacks (click to expand)

QCSubmit version information(click to expand)

	version
openff.qcsubmit	0.54.0
openff.toolkit	0.16.6
basis_set_exchange	0.10
qcelemental	0.28.0
rdkit	2024.09.3

openff-dangerbot · 2024-12-11T12:07:53Z

Current status - Archived/Complete

Consider manually moving this.

openff-dangerbot · 2024-12-11T12:08:10Z

Lifecycle - Archived/Complete


Dataset Name	OpenFF NAGL2 Training Optimization Dataset v4.0
Dataset Type	optimization
UTC Datetime	2024-12-11 12:07 UTC

Dataset Complete!

amcisaac added 14 commits November 27, 2024 11:16

adding files

7dec641

renamed 4.0 and updated repo readme

869c5bc

removing logs

b34af83

adding filtered datasets

2df3108

Adding filtered dataset

90d7899

Modifying the README for part 1 --somehow one molecule got 10 conform…

ae1cdc0

…ers, despite me limiting it to 5

modifying README

9583c3e

Update date

a668dde

changing date on submission directory

ca75f3b

fixing names

36c3d23

updating date on directory

fabb9e0

updating date on overall readme

e27b2b9

fixing date in URL in dataset metadata

894710b

removing debug notebook

89e5e5a

amcisaac added the tracking label Dec 9, 2024

amcisaac requested review from lilyminium and ntBre December 9, 2024 07:35

ntBre approved these changes Dec 9, 2024

View reviewed changes

amcisaac added the compute-nagl2-small label Dec 9, 2024

amcisaac merged commit 0e6e6da into master Dec 9, 2024
5 checks passed

ntBre removed the compute-nagl2-small label Dec 11, 2024

openff-dangerbot added scientific-review end-of-life and removed scientific-review labels Dec 12, 2024

amcisaac deleted the combine-nagl2-opt branch January 8, 2025 23:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine and filter NAGL2 Optimization Datasets Part 1 + Part 2 #416

Combine and filter NAGL2 Optimization Datasets Part 1 + Part 2 #416

amcisaac commented Dec 9, 2024 •

edited

Loading

openff-dangerbot commented Dec 9, 2024

ntBre left a comment

openff-dangerbot commented Dec 9, 2024

openff-dangerbot commented Dec 9, 2024

openff-dangerbot commented Dec 9, 2024

openff-dangerbot commented Dec 9, 2024

openff-dangerbot commented Dec 9, 2024

openff-dangerbot commented Dec 10, 2024

openff-dangerbot commented Dec 10, 2024

openff-dangerbot commented Dec 10, 2024

openff-dangerbot commented Dec 11, 2024

openff-dangerbot commented Dec 11, 2024

openff-dangerbot commented Dec 11, 2024

Combine and filter NAGL2 Optimization Datasets Part 1 + Part 2 #416

Combine and filter NAGL2 Optimization Datasets Part 1 + Part 2 #416

Conversation

amcisaac commented Dec 9, 2024 • edited Loading

New Submission Checklist

openff-dangerbot commented Dec 9, 2024

QCSubmit Validation Report

QC Specification Report

ntBre left a comment

Choose a reason for hiding this comment

openff-dangerbot commented Dec 9, 2024

Lifecycle - QCSubmit Submission Report : SUCCESS

openff-dangerbot commented Dec 9, 2024

Current status - Error Cycling

openff-dangerbot commented Dec 9, 2024

Lifecycle - Error Cycling Report

OptimizationRecord current status

OptimizationRecord Error Tracebacks:

openff-dangerbot commented Dec 9, 2024

Current status - Archived/Complete

openff-dangerbot commented Dec 9, 2024

Lifecycle - Archived/Complete

openff-dangerbot commented Dec 10, 2024

Lifecycle - Error Cycling Report

OptimizationRecord current status

OptimizationRecord Error Tracebacks:

openff-dangerbot commented Dec 10, 2024

Current status - Archived/Complete

openff-dangerbot commented Dec 10, 2024

Lifecycle - Archived/Complete

openff-dangerbot commented Dec 11, 2024

Lifecycle - Error Cycling Report

OptimizationRecord current status

OptimizationRecord Error Tracebacks:

openff-dangerbot commented Dec 11, 2024

Current status - Archived/Complete

openff-dangerbot commented Dec 11, 2024

Lifecycle - Archived/Complete

amcisaac commented Dec 9, 2024 •

edited

Loading

`OptimizationRecord` current status

`OptimizationRecord` Error Tracebacks:

`OptimizationRecord` current status

`OptimizationRecord` Error Tracebacks:

`OptimizationRecord` current status

`OptimizationRecord` Error Tracebacks: