openforcefield · amcisaac · Dec 9, 2024 · Nov 27, 2024 · Nov 27, 2024 · Nov 27, 2024
diff --git a/README.md b/README.md
@@ -297,6 +297,8 @@ These are currently used to find a minimum energy conformation of a molecule.
 | `OpenFF Lipid Optimization Training Supplement v1.0` | [2024-10-08-OpenFF-Lipid-Optimization-Training-Supplement-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-10-08-OpenFF-Lipid-Optimization-Training-Supplement-v1.0) | Additional optimization training data for Sage from representative LIPID MAPS fragments | I, Br, O, H, P, C, N, Cl, F, S | |
 | `OpenFF NAGL2 Training Optimization Dataset Part 1 v4.0` | [2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-1-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-1-v4.0) | Optimization dataset for NAGL2 training, part 1 | Cl, O, C, P, I, Br, B, S, N, F, H, Si | |
 | `OpenFF NAGL2 Training Optimization Dataset Part 2 v4.0` | [2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-2-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-2-v4.0) | Optimization dataset for NAGL2 training, part 2 | Si, B, O, I, S, Cl, N, H, C, P, F, Br | |
+| `OpenFF NAGL2 Training Optimization Dataset v4.0` | [2024-12-09-OpenFF-NAGL2-Training-Optimization-Dataset-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-12-09-OpenFF-NAGL2-Training-Optimization-Dataset-v4.0) | Optimization dataset for NAGL2 training, combined and filtered | Si, B, O, I, S, Cl, N, H, C, P, F, Br | |
+
 
 # TorsionDrive Datasets
 These are currently used perform a complete rotation of one or more selected bonds, where optimizations are performed over a discrete set of angles.

diff --git a/...ons/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-1-v4.0/README.md b/...ons/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-1-v4.0/README.md
@@ -24,7 +24,7 @@ using the OpenEye backend of the OpenFF toolkit
 * Name: OpenFF NAGL2 Training Optimization Dataset Part 1 v4.0
 * Number of unique molecules: 55134
 * Number of conformers: 131198
-* Number of conformers (min, mean, max): 1.00, 2.38, 5.00
+* Number of conformers (min, mean, max): 1.00, 2.38, 10.00
 * Molecular weight (min, mean, max): 32.12, 158.53, 299.97
 * Charges: -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
 * Dataset submitter: Alexandra McIsaac

diff --git a/submissions/2024-12-09-OpenFF-NAGL2-Training-Optimization-Dataset-v4.0/README.md b/submissions/2024-12-09-OpenFF-NAGL2-Training-Optimization-Dataset-v4.0/README.md
@@ -0,0 +1,64 @@
+# OpenFF NAGL2 Training Optimization Dataset v4.0
+
+## Description
+A dataset containing molecules from the [`MLPepper RECAP Optimized Fragments v1.0`](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-07-26-MLPepper-RECAP-Optimized-Fragments-v1.0)
+and [`MLPepper RECAP Optimized Fragments v1.0 Add Iodines`](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-10-11-MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0) datasets,
+with new conformers and optimized at the OpenFF default level of theory (B3LYP-D3BJ/DZVP).
+The dataset is intended to be used for calculating single point energies and properties,
+which will then be used to train our second-generation graph neural network charge model (NAGL2).
+This is the final dataset, with [part 1](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-1-v4.0) and [part 2](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-2-v4.0) combined, and with errored records, geometric rearrangements, and stereochemistry issues filtered out.
+
+For each molecule, a set of up to 5 conformers were generated by:
+
+  * generating a set of up to 1000 conformers with a RMS cutoff of 0.1 Å
+using the OpenEye backend of the OpenFF toolkit
+
+  * applying ELF conformer selection (max 5 conformers) using OpenEye
+
+
+## General information
+* Date: 2024-12-09
+* Class: OpenFF Optimization Dataset
+* Purpose: Conformer optimization
+* Name: OpenFF NAGL2 Training Optimization Dataset v4.0
+* Number of unique molecules: 54422
+* Number of conformers: 128281
+* Number of conformers (min, mean, max): 1.00, 2.36, 10.00
+* Molecular weight (min, mean, max): 32.12, 163.93, 701.59
+* Charges: -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
+* Dataset submitter: Alexandra McIsaac
+* Dataset generator: Alexandra McIsaac
+
+## QCSubmit generation pipeline
+* `filter-and-combine-ds.py` was used to combine part 1 and part 2, and filter out problematic records
+* `filter-and-combine-ds.sh` was used to run `filter-and-combine-ds.py` on HPC3
+* `generate-combined-dataset.py` was used to create the combined and filtered dataset for submission to QCArchive.
+* `fix-date.py` was used to update the date in the github URL for the final dataset
+
+## QCSubmit Manifest
+* `dataset.json.bz2`: compressed dataset ready for submission
+* `dataset.pdf`: Visualization of dataset molecules
+* `dataset.smi`: Smiles strings for dataset molecules
+* `filter-and-combine-ds.py`: Script used to combine part 1 and part 2, and filter out problematic records
+* `filter-and-combine-ds.sh`: Script used to run `filter-and-combine-ds.py` on HPC3
+* `filtered_and_combined_nagl2_opt.json`: Output of `filter-and-combine-ds.py` and input to `generate-combined-dataset.py`
+* `generate-combined-dataset.py`: Script describing dataset generation and submission
+* `generate-combined-datsaset.out`: Log file of `generate-combined-dataset.py`.
+* `fix-date.py`: Script used to update date in dataset metadata
+* `input-environment.yaml`: Environment file used to create Python environment for the notebook
+* `input-environment-full.yaml`: Fully-resolved environment used to execute the notebook.
+
+## Metadata
+* Elements: {I, N, B, Si, H, Br, C, O, Cl, F, S, P}
+* Spec: default
+  * basis: DZVP
+  * implicit_solvent: None
+  * keywords: {}
+  * maxiter: 200
+  * method: B3LYP-D3BJ
+  * program: psi4
+  * SCF properties:
+    * dipole
+    * quadrupole
+    * wiberg_lowdin_indices
+    * mayer_indices