You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm thinking of switching to a different design pattern to make the implementation of new featurizers, especially based on fragments, graphs, and building blocks, easier.
I started introducing more and more conditionals on the input types for featurizers, be it graphs/molecules/structures which make the code really complicated -- especially if we also want to cache as much as possible.
All of this would be much easier if the inputs for the featurizers would always be MOF objects. I already started implementing MOF objects for mofchecker and moffragmentor and they typically have (memoized) methods for computing graph/fragments/hashes/...
If we were to use them, the object would simply "cache" the graph and the featurizer would no longer need to worry about it. Also, the featurizers could simply get the object attributes it needs (and the caller would not need to worry about what to provide, i.e. this logic is only implemented in one place).
In the long run, this might also make the implementation of the datasets easier as, again, the logic is only on the structure object and not partially re-implemented on the datasets.
Advantages
Fewer conditionals for making the control flow in featurizers dependent on the input type
Less logic for caching --- the input objects can cache it
Common design pattern with mofchecker and moffragmentor. Could, perhaps, be factored out in its very thin own library to just provide the API (to be seen)
Multiple featurizer would also no longer need to handle so much logic about primitive structures
We could relatively easily serialize fragments and features in the same workflow without computational overhead
Featurizers such as RACs would no longer need to have their own nearest-neighbor search thing
There is no need anymore for a MOF base featurizer.
Disadvantages
We would need a very thin wrapper for interoperability with matminer, i.e., we would basically always extract the structure attribute and then pass this to pymatgen
It would be a breaking change for the current API: need to change all docs, tutorials and the examples in the paper (however, doing it now might be a good time as there is not that much use yet)
BU featurizer would need to follow a different pattern and would still, in some way, need annotation on what type the featurizers extract
BU featurizer seems still difficult to couple with MultipleFeaturizer with mixed types. Probably, the recommended usage pattern would be to use a MultipleFeaturizer around BUFeaturizers if needed and the BU featurizer always only operates on one type which we could then easily extract and pass to the featurizer. Caching of the fragments is no problem as it is handled by the MOF object
many featurizers do not only work on MOFs. In this sense the naming (also of the library) is misleading
The featurizers would have the following signatures
classGraphFeaturizer:
deffeaturize(self, mof):
structure_graph=mof.structure_graphreturnself._featurize(structure_graph)
classStructureFeaturizer:
deffeaturize(self, mof):
structure=mof.structurereturnself._featurize(structure)
MatminerFeaturizer=StructureFeaturizerclassSiteFeaturizer:
deffeaturize(self, mof, i):
structure=mof.structurereturnself._featurize(structure, structuregraph, i)
classBUFeaturizer:
def__init__(self, featurizer):
ifisinstance(featurizer, MultiFeaturizer):
raiseValueError("featurizer must be a single featurizer")
deffeaturize(self, mof):
fragments=mof.fragments
In the last case, for the BUFeaturizer, the featurizer can inspect if it needs to get the molecules/structures/graphs from the fragments and then call the _featurize method of the featurizer passed in the constructor.
The MOFMultipleFeaturizer should probably always loop over structures for featurize_many.
Also, if I'd do something like featurizer = MOFMultipleFeaturizer([BUFeaturizer(LSOP()), BUFeaturizer(Dimensionality()), RACS()]) it should work without problems as all featurizers always accept MOF objects.
also, testing the implementation now, another benefit seems to be that I no longer need to think about the primitive option. From files, this is the default. However, if users already have pymatgen.core.Structure objects we assume that they know what they do
I'm thinking of switching to a different design pattern to make the implementation of new featurizers, especially based on fragments, graphs, and building blocks, easier.
I started introducing more and more conditionals on the input types for featurizers, be it graphs/molecules/structures which make the code really complicated -- especially if we also want to cache as much as possible.
All of this would be much easier if the inputs for the featurizers would always be
MOF
objects. I already started implementingMOF
objects formofchecker
andmoffragmentor
and they typically have (memoized) methods for computing graph/fragments/hashes/...If we were to use them, the object would simply "cache" the graph and the featurizer would no longer need to worry about it. Also, the featurizers could simply get the object attributes it needs (and the caller would not need to worry about what to provide, i.e. this logic is only implemented in one place).
In the long run, this might also make the implementation of the datasets easier as, again, the logic is only on the structure object and not partially re-implemented on the datasets.
Advantages
mofchecker
andmoffragmentor
. Could, perhaps, be factored out in its very thin own library to just provide the API (to be seen)MOFBBs
objectslru_cache
anymoreRACs
would no longer need to have their own nearest-neighbor search thingDisadvantages
matminer
, i.e., we would basically always extract the structure attribute and then pass this to pymatgenMultipleFeaturizer
with mixed types. Probably, the recommended usage pattern would be to use a MultipleFeaturizer around BUFeaturizers if needed and the BU featurizer always only operates on one type which we could then easily extract and pass to the featurizer. Caching of the fragments is no problem as it is handled by the MOF objectUsage example
Implementation idea
The featurizers would have the following signatures
In the last case, for the
BUFeaturizer
, the featurizer can inspect if it needs to get the molecules/structures/graphs from the fragments and then call the_featurize
method of the featurizer passed in the constructor.The
MOFMultipleFeaturizer
should probably always loop over structures forfeaturize_many
.Also, if I'd do something like
featurizer = MOFMultipleFeaturizer([BUFeaturizer(LSOP()), BUFeaturizer(Dimensionality()), RACS()])
it should work without problems as all featurizers always acceptMOF
objects.The text was updated successfully, but these errors were encountered: