Property calculation toolkit from the Open Forcefield Consortium.
This toolkit provides an API for storing, manipulating, and computing measured physical properties from simulation data.
Physical property measurements are measured properties of a substance that provide some information about the physical parameters that define the interactions within the substance. A physical property is defined by a combination of:
- A
Mixture
specifying the substance that the measurement was performed on - A
ThermodynamicState
specifying the thermodynamic conditions under which the measurement was performed - A
PhysicalProperty
is the physical property that was measured - A
MeasurementMethod
specifying the kind of measurement that was performed
An example of each:
Mixture
: a 0.8 mole fraction mixture of ethanol and waterThermodynamicState
: 298 kelvin, 1 atmospherePhysicalProperty
: mass densityMeasurementMethod
: vibrating tube method
We generally use the concept of a liquid or gas Mixture
, which is a subclass of Substance
.
A simple liquid has only one component:
liquid = Mixture()
liquid.addComponent('water')
A binary mixture has two components:
binary_mixture = Mixture()
binary_mixture.addComponent('water', mole_fraction=0.2)
binary_mixture.addComponent('methanol') # assumed to be rest of mixture if no mole_fraction specified
A ternary mixture has three components:
ternary_mixture = Mixture()
ternary_mixture.addComponent('ethanol', mole_fraction=0.2)
ternary_mixture.addComponent('methanol', mole_fraction=0.2)
ternary_mixture.addComponent('water')
The infinite dilution of one solute within a solvent or mixture is also specified as a Mixture
, where the solute has zero mole fraction:
infinite_dilution = Mixture()
infinite_dilution.addComponent('phenol', impurity=True) # infinite dilution; one copy only of the impurity
infinite_dilution.addComponent('water')
For testing against synthetic data, we also make use of isolatedMolecule
objects (also subclasses of Substance
) that represent a single molecule:
phenol = IsolatedMolecule(iupac='phenol')
ethane = IsolatedMolecule(smiles='CC')
You can iterate over the components in a mixture:
for component in mixture.components:
print (component.iupac_name, component.mole_fraction)
retrieve a component by name:
component = mixture.components['ethanol']
or get the number of components in a mixture:
ncomponents = mixture.ncomponents
or check if a component is an impurity:
if component.impurity == True:
...
A ThermodynamicState
specifies a combination of thermodynamic parameters (e.g. temperature, pressure) at which a measurement is performed.
from simtk import unit
thermodynamic_state = ThermodynamicState(pressure=500*unit.kilopascals, temperature=298.15*unit.kelvin)
We use the simtk.unit
unit system from OpenMM for units (though we may later migrate to pint
for portability).
A MeasurementMethod
subclass has information specific to the particular method used to measure a property (such as experimental uncertainty guidance).
Some examples:
FlowCalorimetry
forHeatCapacity
orExcessMolarEnthalpy
VibratingTubeMethod
forMassDensity
A MeasuredPhysicalProperty
is a combination of Substance
, ThermodynamicState
, and a unit-bearing measured property value
and uncertainty
:
# Define mixture
mixture = Mixture()
mixture.addComponent('water', mole_fraction=0.2)
mixture.addComponent('methanol')
# Define thermodynamic state
thermodynamic_state = ThermodynamicState(pressure=500*unit.kilopascals, temperature=298.15*unit.kelvin)
# Define measurement
measurement = ExcessMolarEnthalpy(substance, thermodynamic_state, value=83.3863244*unit.kilojoules_per_mole, uncertainty=0.1220794866*unit.kilojoules_per_mole)
The various properties are all subclasses of MeasuredPhysicalProperty
and generally follow the <ePropName/>
ThermoML tag names.
Some examples of MeasuredPhysicalProperty
:
MassDensity
- mass densityExcessMolarEnthalpy
- excess partial apparent molar enthalpyHeatCapacity
- molar heat capacity at constant pressure
There are also some examples of measured physical properties we use exclusively for testing against synthetic data:
MeanPotentialEnergy
- mean potential energy of a systemBondMoment
- then
th moment (meanE[x]
if n==1, or central momentE[(x-E[x])^n]
if n >= 2) of a specified bond of an isolated moleculeAngleMoment
- then
th moment (meanE[x]
if n==1, or central momentE[(x-E[x])^n]
if n >= 2) of a specified angle of an isolated moleculeTorsionMoment
- then
th circular moment (E[e^{i*n*phi}]
) of a specified torsion of an isolated molecule
For example:
molecule = IsolatedMolecule(smiles=smiles)
thermodynamic_state = ThermodynamicState(temperature=300*unit.kelvin)
mean_potential = MeanPotentialEnergy(molecule, thermodynamic_state, value=124.4*unit.kilojoules_per_mole, uncertainty=14.5*unit.kilojoules_per_mole)
bond_average = BondMoment(molecule, thermodynamic_state, value=1.12*unit.angstroms, uncertainty=0.02*unit.angstroms, moment=1, smirks='[#6:1]-[#6:2]')
bond_variance = BondMoment(molecule, thermodynamic_state, value=0.05*unit.angstroms**2, uncertainty=0.02*unit.angstroms**2, moment=2, smirks='[#6:1]-[#6:2]')
angle_average = AngleMoment(molecule, thermodynamic_state, value=20*unit.degrees, uncertainty=0.05*unit.degrees, moment=1, smirks='[#6:1]-[#6:2]-[#6:3]')
torsion_moment = TorsionMoment(molecule, thermodynamic_state, value=(0.5, 0.3), uncertainty=(0.05, 0.03), moment=1, smirks='[#6:1]-[#6:2]-[#6:3]-[#6:4]')
QUESTIONS
- Is it useful to average over any bonds that match the SMARTS strings? How would you even construct those SMARTS strings algorithmically for the molecules in the set, to avoid matching anything else?
- Note that we need the first noncentral moment (average), so I had an exception where we don't compute the central moments for
moment == 1
- Do we really want the variance (second central moment) instead of the standard deviation?
A roadmap of physical properties to be implemented is available. Please raise an issue if your physical property of interest is not listed!
Each MeasuredPhysicalProperty
has several properties:
.substance
- theMixture
for which the measurement was made.thermodynamic_state
- theThermodynamicState
at which the measurement was made.measurement_method
- theMeasurementMethod
used to measure the physical property.value
- the unit-bearing measurement value.uncertainty
- the standard uncertainty of the measurement.reference
- the literature reference (if available) for the measurement.DOI
- the literature reference DOI (if available) for the measurement
The value, uncertainty, reference, and DOI do not necessarily need to be defined for a dataset in order for property calculations to be performed.
A PhysicalPropertyDataset
is a collection of MeasuredPhysicalProperty
objects that are related in some way.
dataset = PhysicalPropertyDataset([measurement1, measurement2])
The dataset is iterable:
dataset = PhysicalPropertyDataset([measurement1, measurement2])
for measurement in dataset:
print measurement.value
and has accessors to retrieve DOIs and references associated with measurements in the dataset:
# Print the DOIs associated with this dataset
print(dataset.DOIs)
# Print the references associated with this dataset
print(dataset.references)
For convenience, you can retrieve the dataset as a pandas DataFrame
:
dataset.to_pandas()
A ThermoMLDataset
object represents a physical property dataset stored in the IUPAC-standard ThermoML for specifying thermodynamic properties in XML format.
ThermoMLDataset
is a subclass of PhysicalPropertyDataset
, and provides the same API interface (in addition to some ThermoML-specfic methods).
Direct access to the NIST ThermoML Archive is supported for obtaining physical property measurements in this format directly from the NIST TRC repository.
For example, to retrieve the ThermoML dataset that accompanies this paper, we can simply use the DOI 10.1016/j.jct.2005.03.012
as a key for creating a PhysicalPropertyDataset
subclassed object from the ThermoML Archive:
dataset = ThermoMLDataset(keys='10.1016/j.jct.2005.03.012')
You can also specify multiple ThermoML Archive keys to create a dataset from multiple ThermoML files:
thermoml_keys = ['10.1021/acs.jced.5b00365', '10.1021/acs.jced.5b00474']
dataset = ThermoMLDataset(keys=thermoml_keys)
It is also possible to specify ThermoML datasets housed at other locations, such as
dataset = ThermoMLDataset(url='http://openforcefieldgroup.org/thermoml-datasets')
or
dataset = ThermoMLDataset(url='file:///Users/choderaj/thermoml')
or
dataset = ThermoMLDataset(keys=['10.1021/acs.jced.5b00365', '10.1021/acs.jced.5b00474'], url='http://openforcefieldgroup.org/thermoml-datasets')
or from ThermoML and a different URL:
dataset = ThermoMLDataset(keys=thermoml_keys)
dataset.retrieve(keys=local_keys, url='http://openforcefieldgroup.org/thermoml-datasets')
You can see which DOIs contribute to the current ThermoMLDataset
with the convenience functions:
print(dataset.DOIs)
NIST has compiled a JSON frame of corrections to uncertainties.
These can be used to update or correct data uncertainties and discard outliers using applyNISTUncertainties()
:
# Modify uncertainties according to NIST evaluation
dataset.applyNISTUncertainties(nist_uncertainties, adjust_uncertainties=True, discard_outliers=True)
TODO:
- We should merge any other useful parts parts of the ThermoPyL API in here.
In future, we will add interfaces to other online datasets, such as
- BindingDB for retrieving host-guest binding affinity datasets.
The PropertyEstimator
class creates objects that handle property estimation of all of the properties in a dataset, given a set or sets of parameters.
The implementation will isolate the user from whatever backend (local machine, HPC cluster, XSEDE resources, Amazon EC2) is being used to compute the properties, as well as whether new simulations are being launched and analyzed or existing simulation data is being reweighted.
Different backends will take different optional arguments, but here is an example that will launch and use 10 worker processes on a cluster:
estimator = PropertyEstimator(nworkers=10) # NOTE: multiple backends will be supported in the future
computed_properties = estimator.computeProperties(dataset, parameter_sets)
Here, dataset
is a PhysicalPropertyDataset
or subclass, and parameter_sets
is a list containing ForceField
objects used to parameterize the physical systems in the dataset.
This can be a single parameter set or multiple (usually closely related) parameter sets.
PropertyEstimator.computeProperties(...)
returns a list of ComputedPhysicalProperty
objects that provide access to several pieces of information:
property.value
- the computed property value, with appropriate unitsproperty.uncertainty
- the statistical uncertainty in the computed propertyproperty.parameters
- a reference to the parameter set used to compute this propertyproperty.property
- a reference to the correspondingMeasuredPhysicalProperty
this property was computed for
TODO:
- How should we instruct
computeProperties()
to provide gradients (or components of gradients)?
This API can be extended in the future to provide access to the simulation data used to estimate the property, such as
# Attach to my compute and storage resources
estimator = PropertyEstimator(...)
# Estimate some properties
computed_properties = estimator.computeProperties(dataset, parameters)
# Get statistics about simulation data that went into each property
for property in computed_properties:
# Get statistics about simulation data that was reweighted to estimate this property
for simulation in property.simulations:
print('The simulation was %.3f ns long' % (simulation.length / unit.nanoseconds))
print('The simulation was run at %.1f K and %.1f atm' % (simulation.thermodynamic_state.temperature / unit.kelvin, simulation.thermodynamic_state.pressure / unit.atmospheres))
# Get the ParameterSet that was used for this simulation
parameters = simulation.parameters
# what else do you want...?
In future, we will want to use a parallel key/value database like cassandra to store simulations, along with a distributed task management system like celery with redis.
In this example, datasets are retrieved from the ThermoML and filtered to retain certain properties. The corresponding properties for a given parameter set filename are then computed for a SMIRFF parameter set and printed.
# Define the input datasets from ThermoML
thermoml_keys = ['10.1016/j.jct.2005.03.012', ...]
dataset = ThermoMLDataset(thermoml_keys)
# Filter the dataset to include only molar heat capacities measured between 280-350 K
dataset.filter(ePropName='Excess molar enthalpy (molar enthalpy of mixing), kJ/mol') # filter to retain only this property name
dataset.filter(VariableType='eTemperature', min=280*unit.kelvin, max=350*kelvin) # retain only measurements with `eTemperature` in specified range
# Load an initial parameter set
parameter_set = [ SMIRFFParameterSet('smarty-initial.xml') ]
# Compute physical properties for these measurements
estimator = PropertyEstimator(nworkers=10) # NOTE: multiple backends will be supported in the future
computed_properties = estimator.computeProperties(dataset, parameter_set)
# Write out statistics about errors in computed properties
for (computed, measured) in (computed_properties, dataset):
property_unit = measured.value.unit
print('%24s : experiment %8.3f (%.3f) | calculated %8.3f (%.3f) %s' % (measured.value / property_unit, measured.uncertainty / property_unit, computed.value / property_unit, computed.uncertainty / property_unit, str(property_unit))
TBD