-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coordinate running calculations #11
Comments
CCing @pavankum @dotsdl @SimonBoothroyd @trevorgokey to see who might have available bandwidth to submit these to QCFractal. |
I can help with preparing the submissions, @jthorton already made a template so I can start from that. Just to clarify, these are all optimization datasets, right? Or single point calculations? |
These are single point calculations. For each one we want to compute the energy and forces, as well as the quantities listed in #7 (comment). Also the orbital coefficients and eigenvalues, if the storage requirements aren't prohibitive. |
Maybe we can compute some with GPUGRID as well.
g
…On Fri, Oct 1, 2021 at 4:02 AM Peter Eastman ***@***.***> wrote:
These are single point calculations. For each one we want to compute the
energy and forces, as well as the quantities listed in #7 (comment)
<#7 (comment)>.
Also the orbital coefficients and eigenvalues, if the storage requirements
aren't prohibitive.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#11 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOWDZLZORTY3U36RF7DUEUJEXANCNFSM5FC6N7RA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Is there anything we can do to get this going? @pavankum you said you were waiting on a new release of QCFractal. Is that release expected soon? I notice they seem to do infrequent releases. The last one was in June, and the one before that was November of last year. If we're just waiting for the next regularly scheduled release, this project could be stuck on hold for months. |
I already pinged Ben Pritchard from Molssi ten days ago on openff qcfractal channel and offered help, and he said he would push to make a release last week, maybe @jchodera can inquire about the current status. |
Great, thanks! |
MolSSI just had a major advisory board meeting in Washington DC. That just concluded yesterday. |
@pavankum : It looks like @bennybp released QCFractal 0.15.7 and QCPortal 0.15.7, and the QCArchive server was upgraded yesterday, so you may be clear to proceed now. When submitting these conformer datasets, can we also submit a standard |
@jchodera Sure, will make the submission PRs today. |
@peastman I got the ball rolling on submissions, made the PRs to qca-dataset-submission repo (pubchem sets - 243, 245, 246, 247, 248, 249, des370k - 244, solvated amino acids - 239, dipeptides - 251) and will push to compute once they get reviewed by David Dotson/Josh Horton. I used SPICE as a placeholder name, is there a consensus on the naming convention? For example, a subset of pubchem molecules (2501-5000) is named as "SPICE PubChem Set 2 Single Points Dataset", does this look okay? @jchodera do you want optimization datasets for a particular subset for comparison, or for all of these sets? |
@pavankum: Thank you so much!
I think this means you officially get to name this dataset since you did the submission work. :)
Let's just do dipeptides (251) for now---that should provide an excellent comparison set without adding much to the compute burden. @peastman : I notice we skipped a dataset of monomers extracted from des370k---could we add those in with higher priority than the dimers? All you have to do is extract the unique set of monomers and conformations. Also, I notice that many of those PubChem sets look completely nuts: I'll prioritize trying to get approval to redistribute the other datasets we discussed under nonrestrictive licenses. |
@jchodera Thank you for the feedback, will add that to the submission list |
Congratulations on your excellent choice of name!
There are definitely some odd ones in there. This is partly due to how I ordered the molecules: it tries to choose ones that are maximally different from anything that has come before, so if a molecule is really unusual, it gets put very early in the dataset. If you only look at the first few pages of the first set, you'll get the impression this collection is full of things that don't look much like drugs, but it quickly settles down into much more ordinary ones.
Is there any reason to think they'll be useful? Back when I was trying to train models on DES370K and nothing else, I found that training just on dimers was difficult and adding in a few monomer conformations helped it to learn. But in this case it already has tons of data for single molecules. A few hundred extra molecules isn't likely to make much difference. |
|
Hey all, we are currently executing SPICE PubChem Set 1 Single Points Dataset v1.1. Based on the growth rate of storage usage on QCArchive, at about 5MiB per calculation with wavefunction stored, we will go beyond QCArchive's storage capacity if we proceed in this way with the other 5 PubChem sets. Is it known now whether or not wavefunctions (orbitals and eigenvalues) will be needed for the downstream use case of these datasets? If this is not known, can we begin using set 1 for that downstream case to arrive at a decision? If wavefunctions are not needed at all, we can switch off wavefunction storage for the remaining 5 and proceed immediately. If wavefunctions are or may be required, we will need at least a 5TiB storage expansion of some kind on QCArchive. |
Wavefunctions aren't needed for any of the applications I'm interested in. I think the argument for saving them was that it could save time if we later decided we wanted to compute additional quantities, or redo the computation with a more accurate method (#7 (comment)). But if it causes storage problems, I don't think it's necessary. |
Thanks for clarifying, @peastman! @dotsdl : Since the PubChem Set 1 is only 10% done after 4-5 days of compute, maybe it makes sense to purge the dataset and start over without wavefunctions? 5 MiB x 11K calculations is 55 GiB of data that is probably unnecessary, and the dataset would eventually consume 595 GiB for no reason. |
@dotsdl : One other quick question: Are the molecules sorted in order of increasing size? It may make sense to do so if you are regenerating the datasets, since this would allow the highest throughput initially, enabling us to catch other issues earlier (rather than later) in dataset generation. |
@pavankum : It's a great question! I don't doubt that information derived from wavefunction/orbital data would be valuable in training advanced machine learning potentials, but I don't believe any of the architectures we are considering now would make use of this information. @jeherr: This is a really interesting idea---something to think about! |
Not from the training data, if that's what you mean. Their model begins with a semi-empirical calculation, and the outputs of it become the input to the model. So in that sense, it likely does involve orbital information (I haven't looked at the details to see exactly what values they use). But the dataset just has energies, no orbitals or even forces. That's what they fit to. |
Thank you very much for the clarifications!! |
Thanks all! I just spoke to @bennybp, and I propose we proceed as follows:
Please let me know if you object to any of this. |
If that works for you, go for it! |
Where can I find the completed datasets? |
Looks like even the SPICE PubChem Set 1 Single Points Dataset v1.1 has a long ways to go: The data is all available in real time through the QCPortal API---see the example usage here, though I think this dataset is a new The QCPortal API is not very performant for bulk downloads yet---@dotsdl has been working with @bennybp on speeding this up, and both MolSSI is recruiting a new postdoc and OpenFF is hiring a contractor (we're still searching) to make improvements to this infrastructure. Another goal is to make the data available via monolithic HDF5 files on the machine learning datasets dashboard---this currently has to be prepared by hand. |
What about openforcefield/qca-dataset-submission#254? It claims to be complete. I just want to look it over to make sure all the data content and organization looks right. |
I launched a MyBinder notebook using the OptimizationDataset example and tweaked it a bit: Browsing the SPICE datasets notebook Takeaways:
It's possible these issues are caused by the mybinder image being out of date (it has QCPortal v0.15.7), but we're going to need some help from @pavankum @bennybp here. EDIT: It looks like this is the latest QCPortal version available on conda-forge (~1 month old). |
See openforcefield/openff-qcsubmit#196 for @jthorton's implementation of HDF5 export. This is the better route in my opinion for getting what we need here. @jchodera, @jeherr can you enumerate the data elements you need included in this kind of export? |
We need a generic exporter that can include arbitrary information. An important part of this dataset is that we're including not just forces and energies, but also other useful quantities like MBIS multipoles, bond orders, etc. See #7. Rather than hardcoding particular fields, we should be able to store everything contained in the records. We also need to be sure the molecule IDs from the original input files are included in a clear way. Those are all meaningful identifiers, such as PubChem substance IDs. They're present in the QCArchive data, but only in an obfuscated form. When you call |
I decided to try writing my own exporter as a proof of concept. It appears to me that the data available through FractalClient is missing a lot of information. Elements? SMILES strings? Atom positions? How do I retrieve those? |
this would give the molecule details,
I think @jthorton made a sample exporter in qcsubmit here and he can update it accordingly to meet your needs, and can also map back to the original hdf5 you created for the molecules. |
Thanks! |
What units are all the quantities in? Including positions, energies, forces, charges, multipole moments, etc. |
all are in atomic units |
Here is a proof of concept script for building the HDF5 file. Does it look like I'm doing everything correctly? Am I selecting the correct fields? from qcportal import FractalClient
from collections import defaultdict
import numpy as np
import h5py
client = FractalClient()
ds = client.get_collection("Dataset", "SPICE Solvated Amino Acids Single Points Dataset v1.1")
spec = ds.list_records().iloc[0].to_dict()
recs = ds.get_records(method=spec['method'], basis=spec['basis'], program=spec['program'], keywords=spec['keywords'])
recs_by_name = defaultdict(list)
for i in range(len(recs)):
rec = recs.iloc[i]
index = recs.index[i]
name = index[:index.rfind('-')]
recs_by_name[name].append(rec.record)
outputfile = h5py.File('output.hdf5', 'w')
for name in recs_by_name:
group = outputfile.create_group(name)
group_recs = recs_by_name[name]
molecules = [r.get_molecule() for r in group_recs]
qcvars = [r.extras['qcvars'] for r in group_recs]
group.create_dataset('smiles', data=[molecules[0].extras['canonical_isomeric_explicit_hydrogen_mapped_smiles']], dtype=h5py.string_dtype())
group.create_dataset("atomic_numbers", data=molecules[0].atomic_numbers, dtype=np.int16)
conformations = group.create_dataset('conformations', data=np.array([m.geometry for m in molecules]), dtype=np.float32)
conformations.attrs['units'] = 'bohr'
energies = group.create_dataset('energies', data=np.array([vars['DFT TOTAL ENERGY'] for vars in qcvars]), dtype=np.float32)
energies.attrs['units'] = 'hartree'
gradients = group.create_dataset('gradients', data=np.array([vars['DFT TOTAL GRADIENT'] for vars in qcvars]), dtype=np.float32)
gradients.attrs['units'] = 'hartree/bohr'
mbis_charges = group.create_dataset('mbis_charges', data=np.array([vars['MBIS CHARGES'] for vars in qcvars]), dtype=np.float32)
mbis_charges.attrs['units'] = 'e'
mbis_dipoles = group.create_dataset('mbis_dipoles', data=np.array([vars['MBIS DIPOLES'] for vars in qcvars]), dtype=np.float32)
mbis_dipoles.attrs['units'] = 'e*bohr'
mbis_quadrupoles = group.create_dataset('mbis_quadrupoles', data=np.array([vars['MBIS QUADRUPOLES'] for vars in qcvars]), dtype=np.float32)
mbis_quadrupoles.attrs['units'] = 'e*bohr^2'
mbis_octupoles = group.create_dataset('mbis_octupoles', data=np.array([vars['MBIS OCTUPOLES'] for vars in qcvars]), dtype=np.float32)
mbis_octupoles.attrs['units'] = 'e*bohr^3'
scf_dipoles = group.create_dataset('scf_dipoles', data=np.array([vars['SCF DIPOLE'] for vars in qcvars]), dtype=np.float32)
scf_dipoles.attrs['units'] = 'e*bohr'
scf_quadrupoles = group.create_dataset('scf_quadrupoles', data=np.array([vars['SCF QUADRUPOLE'] for vars in qcvars]), dtype=np.float32)
scf_quadrupoles.attrs['units'] = 'e*bohr^2'
group.create_dataset('wiberg_lowdin_indices', data=np.array([vars['WIBERG LOWDIN INDICES'] for vars in qcvars]), dtype=np.float32)
group.create_dataset('mayer_indices', data=np.array([vars['MAYER INDICES'] for vars in qcvars]), dtype=np.float32) It's truly absurd how slow it is. The solvated amino acids dataset is tiny: just 26 molecules with 50 conformations each. It takes over 12 minutes to run. To process everything at that rate would take about week. We may also want to reconsider what data to include in the HDF5 file. It comes out to about 100 MB for this tiny dataset. The full SPICE v1 dataset will probably add up to around 50 GB. Is that too big? The largest fields are the bond indices (size is O(n^2) in the number of atoms) and the MBIS octupoles (27 elements per atom per conformation). |
One caution about the energy and gradient, the functional 'wb97m' and dispersion correction 'd3bj' are calculated separately and we need to add them to get the final energy as well as gradient. Their fields are |
According to the Psi4 documentation, |
@peastman one possible reason for the slowness is the individual calls to The example I put together was supposed to just unblock fitting but if you need more information in the hdf5 then we can add it based on your proof of concept. I think it might be nice to have a flexible interface that defaults to extract common information like qcvars = {"MBIS CHARGES": "e", "SCF DIPOLE": "e*bohr"}
dataset.to_hdf5(filename="my_dataset.hdf5", qcvars=qcvars) With this kind of use, users can extract all of the information they want to use in their training sets which lets us avoid having to distribute these large files which just duplicate the information in QCFractal and in many cases may be redundant as users only want a subsection of the data. Instead, we just release each version of the SPICE dataset as a collection of record ids in qcfractal, as collections are mutable and can grow after release as calculations finish or more or added by mistake this is currently the safest way to build a fixed dataset. This is something we have done with the fitting of our sage force field and example datasets are here these datasets are JSON serialisations of QCSubmit result objects and can easily be loaded up and converted to hdf5 as needed from openff.qcsubmit.results import BasicResultCollection
dataset = BasicResultCollection.parse_file("dataset.json")
# get the raw records to use them directly
records_and_molecules = dataset.to_records()
# or build a hdf5 file for later use
qcvars = {"MBIS CHARGES": "e", "SCF DIPOLE": "e*bohr"}
dataset.to_hdf5(filename="my_dataset.hdf5", qcvars=qcvars) |
50 GB is ok in my view.
…On Thu, Apr 21, 2022 at 10:38 AM Josh Horton ***@***.***> wrote:
@peastman <https://github.com/peastman> one possible reason for the
slowness is the individual calls to get_molecule() in molecules =
[r.get_molecule() for r in group_recs] I have found it is much faster to
batch all queries to the server in batches the size of the query limit
which is 1000. So I would collect together all of the molecule ids you
which to query and then do it in large batches as is done in qcsubmit.
The example I put together was supposed to just unblock fitting but if you
need more information in the hdf5 then we can add it based on your proof of
concept. I think it might be nice to have a flexible interface that
defaults to extract common information like smiles, elements,
return_energy, return_gradient, return_hessian etc then also allow users
to pass an optional dictionary of qcvars they wish to include in the file
along with the units they are in. So combining your example with the
proposed interface in qcsubmit would be something like
qcvars = {"MBIS CHARGES": "e", "SCF DIPOLE": "e*bohr"}dataset.to_hdf5(filename="my_dataset.hdf5", qcvars=qcvars)
With this kind of use, users can extract all of the information they want
to use in their training sets which lets us avoid having to distribute
these large files which just duplicate the information in QCFractal and in
many cases may be redundant as users only want a subsection of the data.
Instead, we just release each version of the SPICE dataset as a collection
of record ids in qcfractal, as collections are mutable and can grow after
release as calculations finish or more or added by mistake this is
currently the safest way to build a fixed dataset. This is something we
have done with the fitting of our sage force field and example datasets are
here
<https://github.com/openforcefield/openff-sage/blob/main/data-set-curation/quantum-chemical/data-sets/1-2-0-hess-set-v2.json>
these datasets are JSON serialisations of QCSubmit result objects and can
easily be loaded up and converted to hdf5 as needed
from openff.qcsubmit.results import BasicResultCollectiondataset = BasicResultCollection.parse_file("dataset.json")# get the raw records to use them directly records_and_molecules = dataset.to_records()# or build a hdf5 file for later use qcvars = {"MBIS CHARGES": "e", "SCF DIPOLE": "e*bohr"}dataset.to_hdf5(filename="my_dataset.hdf5", qcvars=qcvars)
—
Reply to this email directly, view it on GitHub
<#11 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOW6TXEKS7GPFVSM4MLVGEHYRANCNFSM5FC6N7RA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@peastman Ahh yeah, never mind, I completely forgot that we're not splitting for these calculations. No need to do any extra modifications. |
I replied over at #21. We're getting pretty far off topic for this issue! |
All my workers have started reporting errors: "Acquisition of new tasks was not successful." And some of them are stopping early with this message:
|
@peastman I think there is a problem with an intermediate server. I am looking into it |
Progress seems to have slowed way down. Over the last two weeks, it has averaged less than 3500 calculations per day. In the most recent 24 hour period, it only did 837. |
Yeah, sorry about that, for the last one month there's been slow throughput because of drop in node availability,
David is also pushing to get some extra compute from Max-Planck clusters, courtesy of Bert de Groot's lab, so we may see some improvement in throughput this week. |
Nearly all the jobs I had running yesterday exited with errors:
Does something need to be fixed on the server? |
I think the server is being a bit overloaded at the moment. Let’s wait a bit and see if it clears up. some stuff is getting through, but with delays. |
We're almost there! PubChem set 6 has now completed its initial pass through the data. We'll need a little more time for error cycling, and then we can mark it as complete and release the dataset! |
And we are done! The release is at https://github.com/openmm/spice-dataset/releases/tag/1.0, including an HDF5 file with the most commonly used data fields. Congratulations everyone! I'm now training a model on the finished dataset. I should have a complete first draft of the paper ready to review in a few days. |
@peastman: For posterity (or potentially for the paper), I totaled up the number of core-hours consumed, and came up with 4,057,659 core-hours. Here's the script I used to pull this information from QCPortal. It ran overnight. import qcportal as ptl
client = ptl.FractalClient()
client
import yaml
with open('config.yaml') as input:
config = yaml.safe_load(input.read())
import numpy as np
wall_time = 0.0
for subset in config['subsets']:
# Download the next subset.
print('Processing', subset)
ds = client.get_collection('Dataset', subset)
all_molecules = ds.get_molecules()
for row in ds.list_records().iloc:
spec = row.to_dict()
if spec['method'] == 'wb97m-d3bj':
recs = ds.get_records(method=spec['method'], basis=spec['basis'], program=spec['program'], keywords=spec['keywords'])
break
nrecs = len(recs)
for i in range(nrecs):
rec = recs.iloc[i].record
wall_time += rec.provenance.wall_time
print(f'{wall_time / 60 / 60} core-hours used') |
Thanks, that's useful information to have. It means an average of just under four core-hours per conformation, which sounds about right. Are you sure |
Here's an example of the
What I'm reporting here ( |
Whoops, I realize I neglected to multiply by We may be off by a factor of 10x. I'll rerun the corrected script. |
Update: I had run the correct script and the measure should be accurate. I had just pasted the wrong script here. import qcportal as ptl
client = ptl.FractalClient()
client
import yaml
with open('config.yaml') as input:
config = yaml.safe_load(input.read())
import numpy as np
wall_time = 0.0
for subset in config['subsets']:
# Download the next subset.
print('Processing', subset)
ds = client.get_collection('Dataset', subset)
all_molecules = ds.get_molecules()
for row in ds.list_records().iloc:
spec = row.to_dict()
if spec['method'] == 'wb97m-d3bj':
recs = ds.get_records(method=spec['method'], basis=spec['basis'], program=spec['program'], keywords=spec['keywords'])
break
nrecs = len(recs)
for i in range(nrecs):
rec = recs.iloc[i].record
try:
wall_time += rec.provenance.wall_time * rec.provenance.nthreads
except AttributeError as e:
pass
if i % 100 == 0:
print(f'{wall_time / 60 / 60} core-hours used')
print(f'{wall_time / 60 / 60} core-hours used')
print(f'{wall_time / 60 / 60} core-hours used') Apologies for the confusion. |
That's probably the best estimate we can get, but keeping in mind it's only an estimate. Wall clock time can be influenced by a lot of things. For example, jobs that ended up failing may have caused other jobs running at the same time to take longer to complete, so they're still influencing the results. |
Are we ready to start submitting calculations? We have three molecule collections ready (solvated amino acids, dipeptides, and DES370K). We have agreement on the level of theory to use and what quantities to compute. I think that means we're ready to put together a draft of the first submission?
The solvated amino acids are probably the best one to start with. It's a very small collection (only 1300 conformations), although they're individually on the large side (about 75-95 atoms).
The text was updated successfully, but these errors were encountered: