Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

complex compositions take very long to featurize #126

Open
Pepe-Marquez opened this issue Jan 10, 2023 · 4 comments
Open

complex compositions take very long to featurize #126

Pepe-Marquez opened this issue Jan 10, 2023 · 4 comments

Comments

@Pepe-Marquez
Copy link

I would like to run modnet on a dataset in which I have compositions that have very complex stoichiometries. On example would be C100H3815Br21I279N2185Pb100

To reproduce, this could be an example code:

import pandas as pd
from modnet.models import MODNetModel
from modnet.preprocessing import MODData
from pymatgen.core import Composition

data = {'composition': ['Cu2ZnSnSe4', 'Cu2ZnSnS4', 'CsPbI3', 'CH3NH3PbI3', 'C100H3815Br21I279N2185Pb100' ],
        'target': [1.0, 1.5, 1.78, 1.6, 1.63]}
df_simple = pd.DataFrame(data)
df_simple["composition"] = df_simple["composition"].map(Composition)

data = MODData(
    materials=df_simple["composition"], # you can provide composition objects to MODData
    targets=df_simple["target"], # you can provide target values to MODData
    target_names=["target"]

data.featurize()

Am I doing something wrong here? Would there be a workaround to get these complex compositions running smoother through the featurizer?

Thanks!

@ml-evs
Copy link
Collaborator

ml-evs commented Jan 10, 2023

Hi @Pepe-Marquez, my guess is that the pymatgen/matminer oxidation state solver is choking up on that complex composition. By default, it allows every "site" of a particular species (not strictly sites in this case, but it is the same thing in practice) to have a different oxidation state compatible with its species, so it scales very poorly with number of "sites".

You can customize the featurization pipeline to circumvent this. We have a specific workaround for structure featurizers, but not for composition only. I have prepared a hack below that disables the one featurizer that uses oxidation states... we are looking to optimise this process in the upcoming release, so keep an eye out!

import pandas as pd
from modnet.models import MODNetModel
from modnet.preprocessing import MODData
from modnet.featurizers.presets import CompositionOnlyMatminer2023Featurizer
from pymatgen.core import Composition

featurizer = CompositionOnlyMatminer2023Featurizer()
featurizer.composition_featurizers = [f for f in featurizer.composition_featurizers if f.__class__.__name__ != "IonProperty"]

data = {'composition': ['Cu2ZnSnSe4', 'Cu2ZnSnS4', 'CsPbI3', 'CH3NH3PbI3', 'C100H3815Br21I279N2185Pb100' ],
        'target': [1.0, 1.5, 1.78, 1.6, 1.63]}
df_simple = pd.DataFrame(data)
df_simple["composition"] = df_simple["composition"].map(Composition)

data = MODData(
    materials=df_simple["composition"], # you can provide composition objects to MODData
    targets=df_simple["target"], # you can provide target values to MODData
    target_names=["target"],
    featurizer=featurizer,
)v

data.featurize()

This now runs in about 10 seconds on my laptop.

@ml-evs
Copy link
Collaborator

ml-evs commented Jan 10, 2023

Some more background at #46 (that I had completely forgotten about)

@Pepe-Marquez Pepe-Marquez changed the title complex compositions takes very long to featurize complex compositions take very long to featurize Jan 10, 2023
@Pepe-Marquez
Copy link
Author

This fixed the error for me. Thanks for the help! Happy to close if you think it's ready

@ml-evs
Copy link
Collaborator

ml-evs commented Feb 27, 2023

This fixed the error for me. Thanks for the help! Happy to close if you think it's ready

Awesome, thanks for letting us know. I think I'll actually keep it open until we fix it in the default preset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants