Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine and filter NAGL2 Optimization Datasets Part 1 + Part 2 #416

Merged
merged 14 commits into from
Dec 9, 2024
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,8 @@ These are currently used to find a minimum energy conformation of a molecule.
| `OpenFF Lipid Optimization Training Supplement v1.0` | [2024-10-08-OpenFF-Lipid-Optimization-Training-Supplement-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-10-08-OpenFF-Lipid-Optimization-Training-Supplement-v1.0) | Additional optimization training data for Sage from representative LIPID MAPS fragments | I, Br, O, H, P, C, N, Cl, F, S | |
| `OpenFF NAGL2 Training Optimization Dataset Part 1 v4.0` | [2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-1-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-1-v4.0) | Optimization dataset for NAGL2 training, part 1 | Cl, O, C, P, I, Br, B, S, N, F, H, Si | |
| `OpenFF NAGL2 Training Optimization Dataset Part 2 v4.0` | [2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-2-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-2-v4.0) | Optimization dataset for NAGL2 training, part 2 | Si, B, O, I, S, Cl, N, H, C, P, F, Br | |
| `OpenFF NAGL2 Training Optimization Dataset v4.0` | [2024-12-09-OpenFF-NAGL2-Training-Optimization-Dataset-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-12-09-OpenFF-NAGL2-Training-Optimization-Dataset-v4.0) | Optimization dataset for NAGL2 training, combined and filtered | Si, B, O, I, S, Cl, N, H, C, P, F, Br | |


# TorsionDrive Datasets
These are currently used perform a complete rotation of one or more selected bonds, where optimizations are performed over a discrete set of angles.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ using the OpenEye backend of the OpenFF toolkit
* Name: OpenFF NAGL2 Training Optimization Dataset Part 1 v4.0
* Number of unique molecules: 55134
* Number of conformers: 131198
* Number of conformers (min, mean, max): 1.00, 2.38, 5.00
* Number of conformers (min, mean, max): 1.00, 2.38, 10.00
* Molecular weight (min, mean, max): 32.12, 158.53, 299.97
* Charges: -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
* Dataset submitter: Alexandra McIsaac
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# OpenFF NAGL2 Training Optimization Dataset v4.0

## Description
A dataset containing molecules from the [`MLPepper RECAP Optimized Fragments v1.0`](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-07-26-MLPepper-RECAP-Optimized-Fragments-v1.0)
and [`MLPepper RECAP Optimized Fragments v1.0 Add Iodines`](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-10-11-MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0) datasets,
with new conformers and optimized at the OpenFF default level of theory (B3LYP-D3BJ/DZVP).
The dataset is intended to be used for calculating single point energies and properties,
which will then be used to train our second-generation graph neural network charge model (NAGL2).
This is the final dataset, with [part 1](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-1-v4.0) and [part 2](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-11-19-OpenFF-NAGL2-Training-Optimization-Dataset-Part-2-v4.0) combined, and with errored records, geometric rearrangements, and stereochemistry issues filtered out.

For each molecule, a set of up to 5 conformers were generated by:

* generating a set of up to 1000 conformers with a RMS cutoff of 0.1 Å
using the OpenEye backend of the OpenFF toolkit

* applying ELF conformer selection (max 5 conformers) using OpenEye


## General information
* Date: 2024-12-09
* Class: OpenFF Optimization Dataset
* Purpose: Conformer optimization
* Name: OpenFF NAGL2 Training Optimization Dataset v4.0
* Number of unique molecules: 54422
* Number of conformers: 128281
* Number of conformers (min, mean, max): 1.00, 2.36, 10.00
* Molecular weight (min, mean, max): 32.12, 163.93, 701.59
* Charges: -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
* Dataset submitter: Alexandra McIsaac
* Dataset generator: Alexandra McIsaac

## QCSubmit generation pipeline
* `filter-and-combine-ds.py` was used to combine part 1 and part 2, and filter out problematic records
* `filter-and-combine-ds.sh` was used to run `filter-and-combine-ds.py` on HPC3
* `generate-combined-dataset.py` was used to create the combined and filtered dataset for submission to QCArchive.
* `fix-date.py` was used to update the date in the github URL for the final dataset

## QCSubmit Manifest
* `dataset.json.bz2`: compressed dataset ready for submission
* `dataset.pdf`: Visualization of dataset molecules
* `dataset.smi`: Smiles strings for dataset molecules
* `filter-and-combine-ds.py`: Script used to combine part 1 and part 2, and filter out problematic records
* `filter-and-combine-ds.sh`: Script used to run `filter-and-combine-ds.py` on HPC3
* `filtered_and_combined_nagl2_opt.json`: Output of `filter-and-combine-ds.py` and input to `generate-combined-dataset.py`
* `generate-combined-dataset.py`: Script describing dataset generation and submission
* `generate-combined-datsaset.out`: Log file of `generate-combined-dataset.py`.
* `fix-date.py`: Script used to update date in dataset metadata
* `input-environment.yaml`: Environment file used to create Python environment for the notebook
* `input-environment-full.yaml`: Fully-resolved environment used to execute the notebook.

## Metadata
* Elements: {I, N, B, Si, H, Br, C, O, Cl, F, S, P}
* Spec: default
* basis: DZVP
* implicit_solvent: None
* keywords: {}
* maxiter: 200
* method: B3LYP-D3BJ
* program: psi4
* SCF properties:
* dipole
* quadrupole
* wiberg_lowdin_indices
* mayer_indices
Loading
Loading