-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset API #321
Comments
The basic hierarchy is that a dataset ( The base pydantic model for properties has 5 components that need to be defined:
Name, value, and units are pretty self-explanatory.
This base model validates a few properties:
Individual properties are children of this base class; in general, a user will only need to set the value, units, and name. e.g., to define atomic positions, the child class already has
While not shown in this case, additional model validators are included that further address the expected shaped of the numpy array for value. Specifically, we know that Positions must be a 3d array and that the shape[2] has to be 3. Note that while the name field is by default set to "positions" it can be any string; this is especially important to note because Records may have multiple computed values corresponding to the same type of property (for example, Energies); a unique name is required for each property. Current property models defined are:
Note additional properties can be added via the RecordProperty class; it just requires more input from the user and will not have validation specific to the property itself (e.g., the more specific shape of Positions as we know that the property contains, x,y,z values, not some arbitrary length vector). Final validation occurs at the level of the dataset, e.g., ensuring that at minimum atomic numbers, energy, and positions are provided, it also calls the validation functions in each record. This can also be used in a mode that allows appending of properties. In such cases, adding energies (with the same name) will append the numpy arrays (after first ensuring compatibility of shape and converting to consistent units). |
In discussions with @MarshallYan, the idea of revamping the curation to make it easier to add in a dataset. While parsing out other file formats is going to need to be an ad hoc operation, we can probably ditch the base class in favor of an API to populate a pydantic class.
To do this we could likely have a few "base" pydantic classes. In general, every entry is going require the following info.
However, we likely would want different validation for each dataset property (per_atom vs per_system). We could also likely subclass very specific cases, such as: "geometry", which would have shape (n_configs, n_atoms, 3) and units compatible with distance (both with we can validate); "energy", "forces", "atomic_numbers", "charges", "total_charge", etc. all which we can do the same validation. This could help avoid someone giving the wrong dimensions (especially easy if dealing with a system with only a single config).
We still can use the generic per_atom and per_system classes for types that don't fit within the defined ones. Fields in a Metadata class wouldn't require unit and would need flexible value types (str, float, int, numpy.array). The classes themselves (whether the inherit from per_atom or per_system, or if we just set value_shape_definition in each type of class) will also contain the info about what how to tag this data for ease of reading in, without needing to set it in an array like is currently done.
In terms of assembling a system, we could simply add a property to a dictionary (I think wrap this in a class to make the interface easier). E.g., a class like
This would allow asynchronous assemble of a dataset, which is often necessary as data is spread amongst different files and give tools to validate easier during construction.
The text was updated successfully, but these errors were encountered: