Coordinate running calculations #11

peastman · 2021-09-30T18:29:35Z

Are we ready to start submitting calculations? We have three molecule collections ready (solvated amino acids, dipeptides, and DES370K). We have agreement on the level of theory to use and what quantities to compute. I think that means we're ready to put together a draft of the first submission?

The solvated amino acids are probably the best one to start with. It's a very small collection (only 1300 conformations), although they're individually on the large side (about 75-95 atoms).

jchodera · 2021-09-30T20:32:27Z

CCing @pavankum @dotsdl @SimonBoothroyd @trevorgokey to see who might have available bandwidth to submit these to QCFractal.

pavankum · 2021-10-01T01:47:21Z

I can help with preparing the submissions, @jthorton already made a template so I can start from that. Just to clarify, these are all optimization datasets, right? Or single point calculations?
Another question is about the name to prepend to every dataset, such as ML Dipeptides, ML Solvated Amino Acids, etc. @peastman already posed it here #9, it would be great to add that acronym in the dataset names.

peastman · 2021-10-01T02:02:40Z

These are single point calculations. For each one we want to compute the energy and forces, as well as the quantities listed in #7 (comment). Also the orbital coefficients and eigenvalues, if the storage requirements aren't prohibitive.

giadefa · 2021-10-01T05:54:48Z

Maybe we can compute some with GPUGRID as well. g

…

On Fri, Oct 1, 2021 at 4:02 AM Peter Eastman ***@***.***> wrote: These are single point calculations. For each one we want to compute the energy and forces, as well as the quantities listed in #7 (comment) <#7 (comment)>. Also the orbital coefficients and eigenvalues, if the storage requirements aren't prohibitive. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#11 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOWDZLZORTY3U36RF7DUEUJEXANCNFSM5FC6N7RA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

peastman · 2021-11-03T16:34:39Z

Is there anything we can do to get this going? @pavankum you said you were waiting on a new release of QCFractal. Is that release expected soon? I notice they seem to do infrequent releases. The last one was in June, and the one before that was November of last year. If we're just waiting for the next regularly scheduled release, this project could be stuck on hold for months.

pavankum · 2021-11-03T16:43:52Z

I already pinged Ben Pritchard from Molssi ten days ago on openff qcfractal channel and offered help, and he said he would push to make a release last week, maybe @jchodera can inquire about the current status.

peastman · 2021-11-03T17:04:19Z

Great, thanks!

jchodera · 2021-11-03T17:06:10Z

MolSSI just had a major advisory board meeting in Washington DC. That just concluded yesterday.
Now is probably a good time to inquire about status.

jchodera · 2021-11-07T23:54:58Z

@pavankum : It looks like @bennybp released QCFractal 0.15.7 and QCPortal 0.15.7, and the QCArchive server was upgraded yesterday, so you may be clear to proceed now.

When submitting these conformer datasets, can we also submit a standard OptimizationDataset alongside them as well starting from the list of molecule SMILES? That will help us establish which conformer generation scheme(s) are actually useful, since it's not clear which approach provides the most efficient use of computational resources at this point.

pavankum · 2021-11-08T15:34:03Z

@jchodera Sure, will make the submission PRs today.

pavankum · 2021-11-10T21:56:35Z

@peastman I got the ball rolling on submissions, made the PRs to qca-dataset-submission repo (pubchem sets - 243, 245, 246, 247, 248, 249, des370k - 244, solvated amino acids - 239, dipeptides - 251) and will push to compute once they get reviewed by David Dotson/Josh Horton.

I used SPICE as a placeholder name, is there a consensus on the naming convention? For example, a subset of pubchem molecules (2501-5000) is named as "SPICE PubChem Set 2 Single Points Dataset", does this look okay?

@jchodera do you want optimization datasets for a particular subset for comparison, or for all of these sets?

jchodera · 2021-11-10T22:09:59Z

@pavankum: Thank you so much!

I used SPICE as a placeholder name, is there a consensus on the naming convention? For example, a subset of pubchem molecules (2501-5000) is named as "SPICE PubChem Set 2 Single Points Dataset", does this look okay?

I think this means you officially get to name this dataset since you did the submission work. :)

@jchodera do you want optimization datasets for a particular subset for comparison, or for all of these sets?

Let's just do dipeptides (251) for now---that should provide an excellent comparison set without adding much to the compute burden.

@peastman : I notice we skipped a dataset of monomers extracted from des370k---could we add those in with higher priority than the dimers? All you have to do is extract the unique set of monomers and conformations.

Also, I notice that many of those PubChem sets look completely nuts:

I'll prioritize trying to get approval to redistribute the other datasets we discussed under nonrestrictive licenses.

pavankum · 2021-11-10T22:18:51Z

Let's just do dipeptides (251) for now---that should provide an excellent comparison set without adding much to the compute burden.

@jchodera Thank you for the feedback, will add that to the submission list

peastman · 2021-11-10T23:30:02Z

I think this means you officially get to name this dataset since you did the submission work. :)

Congratulations on your excellent choice of name!

Also, I notice that many of those PubChem sets look completely nuts:

There are definitely some odd ones in there. This is partly due to how I ordered the molecules: it tries to choose ones that are maximally different from anything that has come before, so if a molecule is really unusual, it gets put very early in the dataset. If you only look at the first few pages of the first set, you'll get the impression this collection is full of things that don't look much like drugs, but it quickly settles down into much more ordinary ones.

I notice we skipped a dataset of monomers extracted from des370k

Is there any reason to think they'll be useful? Back when I was trying to train models on DES370K and nothing else, I found that training just on dimers was difficult and adding in a few monomer conformations helped it to learn. But in this case it already has tons of data for single molecules. A few hundred extra molecules isn't likely to make much difference.

jchodera · 2021-11-10T23:40:11Z

Is there any reason to think they'll be useful? Back when I was trying to train models on DES370K and nothing else, I found that training just on dimers was difficult and adding in a few monomer conformations helped it to learn. But in this case it already has tons of data for single molecules. A few hundred extra molecules isn't likely to make much difference.

It's very quick to run
It's a minimal set covering a lot of biomolecular chemical space
Many tools aren't yet set up to run on multiple molecules (e.g. espaloma, for building MM force fields)
There's issues like BSSE that mean the monomer set is actually a bit different than well-separated molecules
We'll easily be able to compare different strategies for generating conformers on this small standard dataset
As you point out, it really helps learning to separate intermolecular from intramolecular interactions to include monomers

dotsdl · 2021-12-20T19:53:38Z

Hey all, we are currently executing SPICE PubChem Set 1 Single Points Dataset v1.1. Based on the growth rate of storage usage on QCArchive, at about 5MiB per calculation with wavefunction stored, we will go beyond QCArchive's storage capacity if we proceed in this way with the other 5 PubChem sets.

Is it known now whether or not wavefunctions (orbitals and eigenvalues) will be needed for the downstream use case of these datasets? If this is not known, can we begin using set 1 for that downstream case to arrive at a decision?

If wavefunctions are not needed at all, we can switch off wavefunction storage for the remaining 5 and proceed immediately. If wavefunctions are or may be required, we will need at least a 5TiB storage expansion of some kind on QCArchive.

peastman · 2021-12-20T20:41:17Z

Wavefunctions aren't needed for any of the applications I'm interested in. I think the argument for saving them was that it could save time if we later decided we wanted to compute additional quantities, or redo the computation with a more accurate method (#7 (comment)). But if it causes storage problems, I don't think it's necessary.

jchodera · 2021-12-20T21:16:20Z

Thanks for clarifying, @peastman!

@dotsdl : Since the PubChem Set 1 is only 10% done after 4-5 days of compute, maybe it makes sense to purge the dataset and start over without wavefunctions? 5 MiB x 11K calculations is 55 GiB of data that is probably unnecessary, and the dataset would eventually consume 595 GiB for no reason.

jchodera · 2021-12-20T21:18:34Z

@dotsdl : One other quick question: Are the molecules sorted in order of increasing size? It may make sense to do so if you are regenerating the datasets, since this would allow the highest throughput initially, enabling us to catch other issues earlier (rather than later) in dataset generation.

pavankum · 2021-12-20T23:08:22Z

@peastman @jchodera A naive question, I am wondering whether Orbnet used any orbital information in their model, and you think that might be something relevant to your ML models as well, apart from forces and energies

jchodera · 2021-12-20T23:10:52Z

@pavankum : It's a great question! I don't doubt that information derived from wavefunction/orbital data would be valuable in training advanced machine learning potentials, but I don't believe any of the architectures we are considering now would make use of this information.

@jeherr: This is a really interesting idea---something to think about!

peastman · 2021-12-20T23:25:42Z

I am wondering whether Orbnet used any orbital information in their model

Not from the training data, if that's what you mean. Their model begins with a semi-empirical calculation, and the outputs of it become the input to the model. So in that sense, it likely does involve orbital information (I haven't looked at the details to see exactly what values they use). But the dataset just has energies, no orbitals or even forces. That's what they fit to.

pavankum · 2021-12-20T23:48:54Z

Thank you very much for the clarifications!!

dotsdl · 2021-12-21T17:03:27Z

Thanks all! I just spoke to @bennybp, and I propose we proceed as follows:

We will let SPICE PubChem Set 1 Single Points Dataset v1.1 run as is, with wavefunctions attached. This will allow experimentation with wavefunctions if you are inclined in the near future.
We will submit sets 2 through 6 without wavefunctions attached.

Please let me know if you object to any of this.

jchodera · 2021-12-21T21:25:26Z

If that works for you, go for it!

peastman · 2022-01-04T18:49:48Z

Where can I find the completed datasets?

jchodera · 2022-01-04T20:21:21Z

Looks like even the SPICE PubChem Set 1 Single Points Dataset v1.1 has a long ways to go:

The data is all available in real time through the QCPortal API---see the example usage here, though I think this dataset is a new BasicDataset type that is not yet shown in the examples. You can use the example code in that notebook to browse and retrieve the available data.

The QCPortal API is not very performant for bulk downloads yet---@dotsdl has been working with @bennybp on speeding this up, and both MolSSI is recruiting a new postdoc and OpenFF is hiring a contractor (we're still searching) to make improvements to this infrastructure. Another goal is to make the data available via monolithic HDF5 files on the machine learning datasets dashboard---this currently has to be prepared by hand.

peastman · 2022-01-04T20:26:40Z

What about openforcefield/qca-dataset-submission#254? It claims to be complete. I just want to look it over to make sure all the data content and organization looks right.

jchodera · 2022-01-04T20:43:38Z

I launched a MyBinder notebook using the OptimizationDataset example and tweaked it a bit:

Browsing the SPICE datasets notebook

Takeaways:

The SPICE DES Monomers Single Points Dataset v1.0 appears as a Dataset, which I thought was the base class type and not a specific single-point dataset type
Browsing the collection for this datasest appears to give an empty dataframe
None of the methods for browsing or retrieving data seem to work

It's possible these issues are caused by the mybinder image being out of date (it has QCPortal v0.15.7), but we're going to need some help from @pavankum @bennybp here.

EDIT: It looks like this is the latest QCPortal version available on conda-forge (~1 month old).

dotsdl · 2022-01-04T21:50:40Z

@peastman and @jchodera, putting together a code snippet for how to access the data elements in each dataset. Will post here today.

dotsdl · 2022-04-15T02:08:05Z

See openforcefield/openff-qcsubmit#196 for @jthorton's implementation of HDF5 export. This is the better route in my opinion for getting what we need here.

@jchodera, @jeherr can you enumerate the data elements you need included in this kind of export?

peastman · 2022-04-20T19:38:51Z

We need a generic exporter that can include arbitrary information. An important part of this dataset is that we're including not just forces and energies, but also other useful quantities like MBIS multipoles, bond orders, etc. See #7. Rather than hardcoding particular fields, we should be able to store everything contained in the records.

We also need to be sure the molecule IDs from the original input files are included in a clear way. Those are all meaningful identifiers, such as PubChem substance IDs. They're present in the QCArchive data, but only in an obfuscated form. When you call get_records(), the values in the index column are of the form <molecule id>-<conformation index>. The molecule IDs aren't present in the individual records.

peastman · 2022-04-20T20:46:24Z

I decided to try writing my own exporter as a proof of concept. It appears to me that the data available through FractalClient is missing a lot of information. Elements? SMILES strings? Atom positions? How do I retrieve those?

pavankum · 2022-04-20T22:52:58Z

this would give the molecule details,

import qcportal as ptl

client = ptl.FractalClient.from_file()
ds = client.get_collection("Dataset", "SPICE Solvated Amino Acids Single Points Dataset v1.1")
spec = ds.list_records().iloc[0].to_dict()
print(spec)
recs = ds.get_records(method=spec['method'], basis=spec['basis'], program=spec['program'], keywords=spec['keywords'])
print(len(recs))

rec = recs.iloc[0]
print("cmiles:", rec.record.get_molecule().extras)
print("Elements:", rec.record.get_molecule().symbols)
print("coordinates in bohr:", rec.record.get_molecule().geometry)

I think @jthorton made a sample exporter in qcsubmit here and he can update it accordingly to meet your needs, and can also map back to the original hdf5 you created for the molecules.

peastman · 2022-04-20T23:20:47Z

Thanks!

peastman · 2022-04-21T00:10:57Z

What units are all the quantities in? Including positions, energies, forces, charges, multipole moments, etc.

pavankum · 2022-04-21T00:13:30Z

all are in atomic units

peastman · 2022-04-21T03:35:18Z

Here is a proof of concept script for building the HDF5 file. Does it look like I'm doing everything correctly? Am I selecting the correct fields?

from qcportal import FractalClient
from collections import defaultdict
import numpy as np
import h5py

client = FractalClient()
ds = client.get_collection("Dataset", "SPICE Solvated Amino Acids Single Points Dataset v1.1")
spec = ds.list_records().iloc[0].to_dict()
recs = ds.get_records(method=spec['method'], basis=spec['basis'], program=spec['program'], keywords=spec['keywords'])
recs_by_name = defaultdict(list)
for i in range(len(recs)):
    rec = recs.iloc[i]
    index = recs.index[i]
    name = index[:index.rfind('-')]
    recs_by_name[name].append(rec.record)
outputfile = h5py.File('output.hdf5', 'w')
for name in recs_by_name:
    group = outputfile.create_group(name)
    group_recs = recs_by_name[name]
    molecules = [r.get_molecule() for r in group_recs]
    qcvars = [r.extras['qcvars'] for r in group_recs]
    group.create_dataset('smiles', data=[molecules[0].extras['canonical_isomeric_explicit_hydrogen_mapped_smiles']], dtype=h5py.string_dtype())
    group.create_dataset("atomic_numbers", data=molecules[0].atomic_numbers, dtype=np.int16)
    conformations = group.create_dataset('conformations', data=np.array([m.geometry for m in molecules]), dtype=np.float32)
    conformations.attrs['units'] = 'bohr'
    energies = group.create_dataset('energies', data=np.array([vars['DFT TOTAL ENERGY'] for vars in qcvars]), dtype=np.float32)
    energies.attrs['units'] = 'hartree'
    gradients = group.create_dataset('gradients', data=np.array([vars['DFT TOTAL GRADIENT'] for vars in qcvars]), dtype=np.float32)
    gradients.attrs['units'] = 'hartree/bohr'
    mbis_charges = group.create_dataset('mbis_charges', data=np.array([vars['MBIS CHARGES'] for vars in qcvars]), dtype=np.float32)
    mbis_charges.attrs['units'] = 'e'
    mbis_dipoles = group.create_dataset('mbis_dipoles', data=np.array([vars['MBIS DIPOLES'] for vars in qcvars]), dtype=np.float32)
    mbis_dipoles.attrs['units'] = 'e*bohr'
    mbis_quadrupoles = group.create_dataset('mbis_quadrupoles', data=np.array([vars['MBIS QUADRUPOLES'] for vars in qcvars]), dtype=np.float32)
    mbis_quadrupoles.attrs['units'] = 'e*bohr^2'
    mbis_octupoles = group.create_dataset('mbis_octupoles', data=np.array([vars['MBIS OCTUPOLES'] for vars in qcvars]), dtype=np.float32)
    mbis_octupoles.attrs['units'] = 'e*bohr^3'
    scf_dipoles = group.create_dataset('scf_dipoles', data=np.array([vars['SCF DIPOLE'] for vars in qcvars]), dtype=np.float32)
    scf_dipoles.attrs['units'] = 'e*bohr'
    scf_quadrupoles = group.create_dataset('scf_quadrupoles', data=np.array([vars['SCF QUADRUPOLE'] for vars in qcvars]), dtype=np.float32)
    scf_quadrupoles.attrs['units'] = 'e*bohr^2'
    group.create_dataset('wiberg_lowdin_indices', data=np.array([vars['WIBERG LOWDIN INDICES'] for vars in qcvars]), dtype=np.float32)
    group.create_dataset('mayer_indices', data=np.array([vars['MAYER INDICES'] for vars in qcvars]), dtype=np.float32)

It's truly absurd how slow it is. The solvated amino acids dataset is tiny: just 26 molecules with 50 conformations each. It takes over 12 minutes to run. To process everything at that rate would take about week.

We may also want to reconsider what data to include in the HDF5 file. It comes out to about 100 MB for this tiny dataset. The full SPICE v1 dataset will probably add up to around 50 GB. Is that too big? The largest fields are the bond indices (size is O(n^2) in the number of atoms) and the MBIS octupoles (27 elements per atom per conformation).

pavankum · 2022-04-21T04:05:16Z

One caution about the energy and gradient, the functional 'wb97m' and dispersion correction 'd3bj' are calculated separately and we need to add them to get the final energy as well as gradient. Their fields are 'DISPERSION CORRECTION ENERGY', and 'DISPERSION CORRECTION GRADIENT'. Other than that this looks good to me.
Runtime wise we can try to run this server side through @bennybp or there are some tricks @dotsdl or @jthorton may know.

peastman · 2022-04-21T04:30:51Z

According to the Psi4 documentation, DFT TOTAL ENERGY includes the dispersion correction. DFT FUNCTIONAL TOTAL ENERGY is the one that doesn't include it.

jthorton · 2022-04-21T08:38:20Z

@peastman one possible reason for the slowness is the individual calls to get_molecule() in molecules = [r.get_molecule() for r in group_recs] I have found it is much faster to batch all queries to the server in batches the size of the query limit which is 1000. So I would collect together all of the molecule ids you which to query and then do it in large batches as is done in qcsubmit.

The example I put together was supposed to just unblock fitting but if you need more information in the hdf5 then we can add it based on your proof of concept. I think it might be nice to have a flexible interface that defaults to extract common information like smiles, elements, return_energy, return_gradient, return_hessian etc then also allow users to pass an optional dictionary of qcvars they wish to include in the file along with the units they are in. So combining your example with the proposed interface in qcsubmit would be something like

qcvars = {"MBIS CHARGES": "e", "SCF DIPOLE": "e*bohr"}
dataset.to_hdf5(filename="my_dataset.hdf5", qcvars=qcvars)

With this kind of use, users can extract all of the information they want to use in their training sets which lets us avoid having to distribute these large files which just duplicate the information in QCFractal and in many cases may be redundant as users only want a subsection of the data. Instead, we just release each version of the SPICE dataset as a collection of record ids in qcfractal, as collections are mutable and can grow after release as calculations finish or more or added by mistake this is currently the safest way to build a fixed dataset. This is something we have done with the fitting of our sage force field and example datasets are here these datasets are JSON serialisations of QCSubmit result objects and can easily be loaded up and converted to hdf5 as needed

from openff.qcsubmit.results import BasicResultCollection
dataset = BasicResultCollection.parse_file("dataset.json")
# get the raw records to use them directly 
records_and_molecules = dataset.to_records()
# or build a hdf5 file for later use 
qcvars = {"MBIS CHARGES": "e", "SCF DIPOLE": "e*bohr"}
dataset.to_hdf5(filename="my_dataset.hdf5", qcvars=qcvars)

giadefa · 2022-04-21T08:42:09Z

50 GB is ok in my view.

…

On Thu, Apr 21, 2022 at 10:38 AM Josh Horton ***@***.***> wrote: @peastman <https://github.com/peastman> one possible reason for the slowness is the individual calls to get_molecule() in molecules = [r.get_molecule() for r in group_recs] I have found it is much faster to batch all queries to the server in batches the size of the query limit which is 1000. So I would collect together all of the molecule ids you which to query and then do it in large batches as is done in qcsubmit. The example I put together was supposed to just unblock fitting but if you need more information in the hdf5 then we can add it based on your proof of concept. I think it might be nice to have a flexible interface that defaults to extract common information like smiles, elements, return_energy, return_gradient, return_hessian etc then also allow users to pass an optional dictionary of qcvars they wish to include in the file along with the units they are in. So combining your example with the proposed interface in qcsubmit would be something like qcvars = {"MBIS CHARGES": "e", "SCF DIPOLE": "e*bohr"}dataset.to_hdf5(filename="my_dataset.hdf5", qcvars=qcvars) With this kind of use, users can extract all of the information they want to use in their training sets which lets us avoid having to distribute these large files which just duplicate the information in QCFractal and in many cases may be redundant as users only want a subsection of the data. Instead, we just release each version of the SPICE dataset as a collection of record ids in qcfractal, as collections are mutable and can grow after release as calculations finish or more or added by mistake this is currently the safest way to build a fixed dataset. This is something we have done with the fitting of our sage force field and example datasets are here <https://github.com/openforcefield/openff-sage/blob/main/data-set-curation/quantum-chemical/data-sets/1-2-0-hess-set-v2.json> these datasets are JSON serialisations of QCSubmit result objects and can easily be loaded up and converted to hdf5 as needed from openff.qcsubmit.results import BasicResultCollectiondataset = BasicResultCollection.parse_file("dataset.json")# get the raw records to use them directly records_and_molecules = dataset.to_records()# or build a hdf5 file for later use qcvars = {"MBIS CHARGES": "e", "SCF DIPOLE": "e*bohr"}dataset.to_hdf5(filename="my_dataset.hdf5", qcvars=qcvars) — Reply to this email directly, view it on GitHub <#11 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOW6TXEKS7GPFVSM4MLVGEHYRANCNFSM5FC6N7RA> . You are receiving this because you commented.Message ID: ***@***.***>

pavankum · 2022-04-21T09:24:50Z

@peastman Ahh yeah, never mind, I completely forgot that we're not splitting for these calculations. No need to do any extra modifications.

peastman · 2022-04-21T16:03:26Z

I replied over at #21. We're getting pretty far off topic for this issue!

peastman · 2022-04-24T17:36:21Z

All my workers have started reporting errors: "Acquisition of new tasks was not successful." And some of them are stopping early with this message:

[W 220423 10:00:48 managers:501] Could not post jobs from 1 ago and over attempt limit, marking jobs as stale.
[E 220423 10:00:48 managers:516] Exceeded number of stale updates allowed! Attempting to shutdown gracefully...
[E 220423 10:00:48 managers:516] Exceeded number of stale updates allowed! Attempting to shutdown gracefully...

bennybp · 2022-04-24T19:02:28Z

@peastman I think there is a problem with an intermediate server. I am looking into it

peastman · 2022-05-04T18:41:40Z

Progress seems to have slowed way down. Over the last two weeks, it has averaged less than 3500 calculations per day. In the most recent 24 hour period, it only did 837.

pavankum · 2022-05-05T05:20:21Z

Yeah, sorry about that, for the last one month there's been slow throughput because of drop in node availability,

Lilac: had to scale down because of an issue with their job monitoring tool with large slurm arrays, David sorted this out with the hpc admins and bumped the job count yesterday,
UCI-hpc3: had to reserve some nodes for testing openff-toolkit stuff, they're back to running qca jobs now

David is also pushing to get some extra compute from Max-Planck clusters, courtesy of Bert de Groot's lab, so we may see some improvement in throughput this week.

peastman · 2022-06-12T14:54:40Z

Nearly all the jobs I had running yesterday exited with errors:

[W 220612 04:10:16 managers:694] Acquisition of new tasks was not successful.
[W 220612 04:12:47 managers:501] Could not post jobs from 1 ago and over attempt limit, marking jobs as stale.
[E 220612 04:12:47 managers:516] Exceeded number of stale updates allowed! Attempting to shutdown gracefully...
[E 220612 04:12:47 managers:516] Exceeded number of stale updates allowed! Attempting to shutdown gracefully...

Does something need to be fixed on the server?

bennybp · 2022-06-12T16:16:04Z

I think the server is being a bit overloaded at the moment. Let’s wait a bit and see if it clears up.

some stuff is getting through, but with delays.

peastman · 2022-07-05T16:39:53Z

We're almost there! PubChem set 6 has now completed its initial pass through the data. We'll need a little more time for error cycling, and then we can mark it as complete and release the dataset!

peastman · 2022-07-12T19:50:20Z

And we are done! The release is at https://github.com/openmm/spice-dataset/releases/tag/1.0, including an HDF5 file with the most commonly used data fields. Congratulations everyone!

I'm now training a model on the finished dataset. I should have a complete first draft of the paper ready to review in a few days.

jchodera · 2022-10-26T21:49:15Z

@peastman: For posterity (or potentially for the paper), I totaled up the number of core-hours consumed, and came up with 4,057,659 core-hours.
This only counts completed ResultRecords (failures are not counted) and the cores actually used (since we have to pad the batch queue requests with more thread-slots than we actually use).

Here's the script I used to pull this information from QCPortal. It ran overnight.

import qcportal as ptl
client = ptl.FractalClient()
client

import yaml
with open('config.yaml') as input:
    config = yaml.safe_load(input.read())

import numpy as np
wall_time = 0.0
for subset in config['subsets']:
    # Download the next subset.

    print('Processing', subset)
    ds = client.get_collection('Dataset', subset)
    all_molecules = ds.get_molecules()
    for row in ds.list_records().iloc:
        spec = row.to_dict()
        if spec['method'] == 'wb97m-d3bj':
            recs = ds.get_records(method=spec['method'], basis=spec['basis'], program=spec['program'], keywords=spec['keywords'])
            break
            
    nrecs = len(recs)
    for i in range(nrecs):
        rec = recs.iloc[i].record
        wall_time += rec.provenance.wall_time            
        print(f'{wall_time / 60 / 60} core-hours used')

peastman · 2022-10-26T23:27:03Z

Thanks, that's useful information to have. It means an average of just under four core-hours per conformation, which sounds about right. Are you sure wall_time is the right value to use?

jchodera · 2022-10-26T23:43:10Z

Here's an example of the provenance information available in a ResultRecord:

Provenance(creator='Psi4', version='1.4.1', routine='psi4.schema_runner.run_qcschema', cpu='Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz', module='scf', qcengine_version='v0.22.0', username='dotson', nthreads=10, memory=70.0, wall_time=2.985147714614868, hostname='lt04')

What I'm reporting here (wall_time * nthreads) is the only sensible measure of effort that it being captured at the moment.

jchodera · 2022-10-26T23:45:06Z

Whoops, I realize I neglected to multiply by nthreads in the version I ran (and pasted)! I had been accumulated wall_time * nthreads in a MyBinder session, but it kept timing out, and I must have neglected to copy over the correct version to my laptop to run locally.

We may be off by a factor of 10x. I'll rerun the corrected script.

jchodera · 2022-10-26T23:45:55Z

Update: I had run the correct script and the measure should be accurate. I had just pasted the wrong script here.
The script I ran was:

import qcportal as ptl
client = ptl.FractalClient()
client

import yaml
with open('config.yaml') as input:
    config = yaml.safe_load(input.read())

import numpy as np
wall_time = 0.0
for subset in config['subsets']:
    # Download the next subset.

    print('Processing', subset)
    ds = client.get_collection('Dataset', subset)
    all_molecules = ds.get_molecules()
    for row in ds.list_records().iloc:
        spec = row.to_dict()
        if spec['method'] == 'wb97m-d3bj':
            recs = ds.get_records(method=spec['method'], basis=spec['basis'], program=spec['program'], keywords=spec['keywords'])
            break

    nrecs = len(recs)
    for i in range(nrecs):
        rec = recs.iloc[i].record
        try:
            wall_time += rec.provenance.wall_time * rec.provenance.nthreads
        except AttributeError as e:
            pass
        if i % 100 == 0:
            print(f'{wall_time / 60 / 60} core-hours used')

    print(f'{wall_time / 60 / 60} core-hours used')


print(f'{wall_time / 60 / 60} core-hours used')

Apologies for the confusion.

peastman · 2022-10-27T00:16:42Z

That's probably the best estimate we can get, but keeping in mind it's only an estimate. Wall clock time can be influenced by a lot of things. For example, jobs that ended up failing may have caused other jobs running at the same time to take longer to complete, so they're still influencing the results.

peastman mentioned this issue Apr 21, 2022

How to organize and distribute data #21

Closed

peastman closed this as completed Jul 12, 2022

Coordinate running calculations #11

Coordinate running calculations #11

Comments

peastman commented Sep 30, 2021

jchodera commented Sep 30, 2021

pavankum commented Oct 1, 2021

peastman commented Oct 1, 2021

giadefa commented Oct 1, 2021 via email

peastman commented Nov 3, 2021

pavankum commented Nov 3, 2021

peastman commented Nov 3, 2021

jchodera commented Nov 3, 2021

jchodera commented Nov 7, 2021

pavankum commented Nov 8, 2021

pavankum commented Nov 10, 2021

jchodera commented Nov 10, 2021 • edited Loading

pavankum commented Nov 10, 2021

peastman commented Nov 10, 2021

jchodera commented Nov 10, 2021

dotsdl commented Dec 20, 2021

peastman commented Dec 20, 2021

jchodera commented Dec 20, 2021

jchodera commented Dec 20, 2021 • edited Loading

pavankum commented Dec 20, 2021

jchodera commented Dec 20, 2021

peastman commented Dec 20, 2021

pavankum commented Dec 20, 2021

dotsdl commented Dec 21, 2021

jchodera commented Dec 21, 2021

peastman commented Jan 4, 2022

jchodera commented Jan 4, 2022

peastman commented Jan 4, 2022

jchodera commented Jan 4, 2022 • edited Loading

dotsdl commented Jan 4, 2022

dotsdl commented Apr 15, 2022

peastman commented Apr 20, 2022

peastman commented Apr 20, 2022

pavankum commented Apr 20, 2022 • edited Loading

peastman commented Apr 20, 2022

peastman commented Apr 21, 2022

pavankum commented Apr 21, 2022

peastman commented Apr 21, 2022

pavankum commented Apr 21, 2022

peastman commented Apr 21, 2022

jthorton commented Apr 21, 2022

giadefa commented Apr 21, 2022 via email

pavankum commented Apr 21, 2022

peastman commented Apr 21, 2022

peastman commented Apr 24, 2022

bennybp commented Apr 24, 2022

peastman commented May 4, 2022

pavankum commented May 5, 2022 • edited Loading

peastman commented Jun 12, 2022

bennybp commented Jun 12, 2022

peastman commented Jul 5, 2022

peastman commented Jul 12, 2022

jchodera commented Oct 26, 2022

peastman commented Oct 26, 2022

jchodera commented Oct 26, 2022

jchodera commented Oct 26, 2022

jchodera commented Oct 26, 2022

peastman commented Oct 27, 2022

jchodera commented Nov 10, 2021 •

edited

Loading

jchodera commented Dec 20, 2021 •

edited

Loading

jchodera commented Jan 4, 2022 •

edited

Loading

pavankum commented Apr 20, 2022 •

edited

Loading

pavankum commented May 5, 2022 •

edited

Loading