Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset API #321

Open
Tracked by #334
chrisiacovella opened this issue Nov 20, 2024 · 1 comment
Open
Tracked by #334

Dataset API #321

chrisiacovella opened this issue Nov 20, 2024 · 1 comment

Comments

@chrisiacovella
Copy link
Member

chrisiacovella commented Nov 20, 2024

In discussions with @MarshallYan, the idea of revamping the curation to make it easier to add in a dataset. While parsing out other file formats is going to need to be an ad hoc operation, we can probably ditch the base class in favor of an API to populate a pydantic class.

To do this we could likely have a few "base" pydantic classes. In general, every entry is going require the following info.

class DatasetProperty(BaseModel):
     name: str
     value: numpy.array
     unit: unit.Quantity
     value_shape_definition: Enum(per_atom, per_system) <-- we can either infer from class type or manually set

However, we likely would want different validation for each dataset property (per_atom vs per_system). We could also likely subclass very specific cases, such as: "geometry", which would have shape (n_configs, n_atoms, 3) and units compatible with distance (both with we can validate); "energy", "forces", "atomic_numbers", "charges", "total_charge", etc. all which we can do the same validation. This could help avoid someone giving the wrong dimensions (especially easy if dealing with a system with only a single config).

We still can use the generic per_atom and per_system classes for types that don't fit within the defined ones. Fields in a Metadata class wouldn't require unit and would need flexible value types (str, float, int, numpy.array). The classes themselves (whether the inherit from per_atom or per_system, or if we just set value_shape_definition in each type of class) will also contain the info about what how to tag this data for ease of reading in, without needing to set it in an array like is currently done.

In terms of assembling a system, we could simply add a property to a dictionary (I think wrap this in a class to make the interface easier). E.g., a class like

class DataSet:
     def __init__(self, dataset_name:str):
             self.dataset_name = dataset_name
             self.records = {}

       # add a record to the Dataset by giving it a unique name
      def register_record(record_name:str)
          # add some code to see if a record already exists and alert if duplicate
          self.records[record_name] = {}

      # add a property to the record, just a wrapper to make it easier to access underlying dictionary
      def register_property(record_name:str, property:DatasetProperty):
          self.records[record_name][property.name] = property

     def save_hdf5_file(filename:str):
          # call a validate function that will loop over Properties in each record checking that things like 
          # number of configs is consistent across all entries  
          self.validate_input()

          # we want to convert units to our target output
           self.convert_units()

          # call the hdf5 saving function. 
          self.save_hdf5(filename)

      # add some accessor functions to grab individual properties as well to then manipulate

This would allow asynchronous assemble of a dataset, which is often necessary as data is spread amongst different files and give tools to validate easier during construction.

@chrisiacovella chrisiacovella mentioned this issue Jan 17, 2025
8 tasks
@chrisiacovella
Copy link
Member Author

The basic hierarchy is that a dataset (SourceDataset) is composed of records (Records), records contain properties (e.g., Positions, Energies, etc.).

The base pydantic model for properties has 5 components that need to be defined:

class RecordProperty(CurateBase):
    name: str
    value: NdArray
    units: unit.Unit
    classification: PropertyClassification
    property_type: Union[PropertyType, str]

Name, value, and units are pretty self-explanatory.

  • classification refers mainly to "per_atom", "per_system" (other keywords allowed are "atomic_numbers" and "meta_data").
  • property_type is just a tag as to tell us what "type" of units (e.g., length, energy, force, etc.). I'll note this in general not something a user will change, as it is automatically set in the child class that defines a specific type of property. Using these keywords seemed easier than using the pint approach (for example, force being defined as [energy]/[length] ). The global units system uses the same keywords to make a property_type to a specific unit.

This base model validates a few properties:

  • the shape of the numpy array give for value; for atomic_number we know that this should be 2d and that the last index should always be 1 ( expected shape [n_atoms, 1]). a per_system property needs to have at least 2 dimensions ( [n_configs, -1] ) , and a per_atom property at least 3 dimension ([n_configs, n_atoms, -1]) . Since at this stage we don't necessarily know how many atoms or configs should be in a given property, we can't validate those yet.
  • units compatibility: it uses property_type to look at the units in the global unit system and then checks that they are compatible.

Individual properties are children of this base class; in general, a user will only need to set the value, units, and name. e.g., to define atomic positions, the child class already has classification and property_type defined to the appropriate values:

class Positions(RecordProperty): name: str = "positions" value: NdArray units: unit.Unit classification: PropertyClassification = PropertyClassification.per_atom property_type: PropertyType = PropertyType.length

While not shown in this case, additional model validators are included that further address the expected shaped of the numpy array for value. Specifically, we know that Positions must be a 3d array and that the shape[2] has to be 3. Note that while the name field is by default set to "positions" it can be any string; this is especially important to note because Records may have multiple computed values corresponding to the same type of property (for example, Energies); a unique name is required for each property.

Current property models defined are:

  • AtomicNumbers
  • Positions
  • Energies
  • Forces
  • PartialCharges
  • TotalCharge
  • SpinMultiplicies
  • DipoleMoment
  • DipoleMomentScalar
  • QuadrupoleMoment
  • OctupoleMoment
  • MetaData

Note additional properties can be added via the RecordProperty class; it just requires more input from the user and will not have validation specific to the property itself (e.g., the more specific shape of Positions as we know that the property contains, x,y,z values, not some arbitrary length vector).
When adding a property to a record, it will check to ensure that a property with the same value of name is not entered twice (this name is used as a key to store the properties in the record and also for indexing within the hdf5 files that are generated). This prevents properties from accidentally being overwritten. Records provide validation functions, that can be called directly to ensure all properties have the same n_configs and n_atoms.

Final validation occurs at the level of the dataset, e.g., ensuring that at minimum atomic numbers, energy, and positions are provided, it also calls the validation functions in each record.

This can also be used in a mode that allows appending of properties. In such cases, adding energies (with the same name) will append the numpy arrays (after first ensuring compatibility of shape and converting to consistent units).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant