Merge pull request #410 from openforcefield/nagl2-training-opt-p2

Nagl2 training opt p2
openforcefield · Nov 21, 2024 · e54e981 · e54e981
2 parents 4cf9b44 + 420966b
commit e54e981
Show file tree

Hide file tree

Showing 9 changed files with 2,690 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -296,6 +296,7 @@ These are currently used to find a minimum energy conformation of a molecule.
 | `OpenFF Sulfur Optimization Benchmarking Coverage Supplement v1.0` | [2024-09-18-OpenFF-Sulfur-Optimization-Benchmarking-Coverage-Supplement-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-09-18-OpenFF-Sulfur-Optimization-Benchmarking-Coverage-Supplement-v1.0) | Additional optimization benchmarking data for Sage sulfur and phosphorus parameters | S, P, Cl, C, N, O, H, Br, F | |
 | `OpenFF Lipid Optimization Training Supplement v1.0` | [2024-10-08-OpenFF-Lipid-Optimization-Training-Supplement-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-10-08-OpenFF-Lipid-Optimization-Training-Supplement-v1.0) | Additional optimization training data for Sage from representative LIPID MAPS fragments | I, Br, O, H, P, C, N, Cl, F, S | |
 | `OpenFF NAGL2 Training Optimization Dataset Part 1 v4.0` | [2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-1-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-1-v4.0) | Optimization dataset for NAGL2 training, part 1 | Cl, O, C, P, I, Br, B, S, N, F, H, Si | |
+| `OpenFF NAGL2 Training Optimization Dataset Part 2 v4.0` | [2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-2-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-2-v4.0) | Optimization dataset for NAGL2 training, part 2 | Si, B, O, I, S, Cl, N, H, C, P, F, Br | |
 
 # TorsionDrive Datasets
 These are currently used perform a complete rotation of one or more selected bonds, where optimizations are performed over a discrete set of angles.

diff --git a/...ons/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-2-v4.0/README.md b/...ons/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-2-v4.0/README.md
@@ -0,0 +1,57 @@
+# OpenFF NAGL2 Training Optimization Dataset Part 2 v4.0
+
+## Description
+A dataset containing molecules from the [`MLPepper RECAP Optimized Fragments v1.0`](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-07-26-MLPepper-RECAP-Optimized-Fragments-v1.0)
+and [`MLPepper RECAP Optimized Fragments v1.0 Add Iodines`](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-10-11-MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0) datasets,
+with additional conformers and optimized at the OpenFF default level of theory (B3LYP-D3BJ/DZVP).
+The dataset is intended to be used for calculating single point energies and properties,
+which will then be used to train our second-generation graph neural network charge model (NAGL2).
+This is part 2, for molecules with molecular weight greater than 300 Da.
+
+
+For each molecule, a set of up to 5 conformers were generated by:
+
+  * generating a set of up to 1000 conformers with a RMS cutoff of 0.1 Å
+using the OpenEye backend of the OpenFF toolkit
+
+  * applying ELF conformer selection (max 5 conformers) using OpenEye
+
+## General information
+* Date: 2024-11-19
+* Class: OpenFF Optimization Dataset
+* Purpose: Conformer optimization
+* Name: OpenFF NAGL2 Training Optimization Dataset Part 2 v4.0
+* Number of unique molecules: 1197
+* Number of conformers: 2323
+* Number of conformers (min, mean, max): 1.00, 1.94, 5.00
+* Molecular weight (min, mean, max): 300.08, 377.82, 701.59
+* Charges: -4.0 -2.0 -1.0 0.0 1.0 2.0
+* Dataset submitter: Alexandra McIsaac
+* Dataset generator: Alexandra McIsaac
+
+## QCSubmit generation pipeline
+* `generate-dataset-part2.ipynb` was used to generate conformers from CMILES and create the dataset.
+
+## QCSubmit Manifest
+* `dataset_part2.json.bz2`: compressed dataset ready for submission
+* `dataset_part2.pdf`: Visualization of dataset molecules
+* `dataset_part2.smi`: Smiles strings for dataset molecules
+* `generate-dataset-part2.ipynb`: Notebook describing dataset generation and submission
+* `input-environment.yaml`: Environment file used to create Python environment for the notebook
+* `input-environment-full.yaml`: Fully-resolved environment used to execute the notebook.
+* `mlpepper.json.bz2`: Zipped version of the mlpepper dataset that can be read in for quicker conformer generation
+
+## Metadata
+* Elements: {Si, B, O, I, S, Cl, N, H, C, P, F, Br}
+* Spec: default
+  * basis: DZVP
+  * implicit_solvent: None
+  * keywords: {}
+  * maxiter: 200
+  * method: B3LYP-D3BJ
+  * program: psi4
+  * SCF properties:
+    * dipole
+    * quadrupole
+    * wiberg_lowdin_indices
+    * mayer_indices
diff --git a/.../2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-2-v4.0/dataset_part2.json.bz2 b/.../2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-2-v4.0/dataset_part2.json.bz2
diff --git a/...sions/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-2-v4.0/dataset_part2.pdf b/...sions/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-2-v4.0/dataset_part2.pdf