Encode molecules starting from their SMILES #26

marcosbodio · 2024-09-09T10:34:18Z

Hello, I would like to know if it is possible to use GraphMVP to encode molecule starting from their SMILES. I have read this issue, but that does not help much. I would be really grateful if you could provide some explanation, and ideally an example. Thank you!

chao1224 · 2024-09-10T14:25:14Z

Hi @marcosbodio,

Thank you for your question.

SMILES is a string representation for the molecule topology, and is not exactly the same as the 2D graph.
So if we want to reuse the implementation of the current repo, then the answer is no.
However, if we expand GraphMVP from 2D-3D to topology-geometry, then the answer is yes. What you need to do is to replace the 2D graph + 2D GNN (GIN in our paper) with SMILES + BERT (or any other sequence encoder).

marcosbodio · 2024-09-11T17:33:33Z

Hi @chao1224, thank you for your answer. I see in your paper that you have Table 5 where you list results on DTA tasks with Davis and KIBA. These datasets contains SMILES of molecules, so how did you use GraphMVP (or GraphMVP-G, GraphMVP-C) on these datasets? It would be very useful to see the code, because that would clarify what is the proper way of using your model starting from the SMILES of molecule.

chao1224 · 2024-09-11T17:38:02Z

Hi @marcosbodio,

Sure, you can check this python script, specifically, this line assigns which dataset to use.

marcosbodio · 2024-09-19T16:11:01Z

Hi @chao1224, I have looked at the script that you linked above, and I think that is for fine tuning your model, which I would prefer to avoid.

I was hoping to use a checkpoint of your model, for example output/3D_hybrid_02_masking/GEOM_3D_nmol50000_nconf5_nupper1000/CL_1_VAE_1/6_51_10_0.1/0.3_EBM_dot_prod_0.1_normalize_l2_detach_target_2_100_0/pretraining_model.pth in GraphMVP_simple_features_for_classification.zip (shared here)

I wonder if I could do something like this:

import torch
from rdkit import Chem
from rdkit.Chem.rdDistGeom import EmbedMolecule

from src_classification.GEOM_dataset_preparation import mol_to_graph_data_obj_simple_3D

smiles = 'Cn1cnc(c1)C(=O)c1ccc(CN2[C@H](Cc3ccccn3)C(=O)Nc3cc(Cl)ccc3C2=O)cc1'
mol = Chem.MolFromSmiles(smiles)
mol = Chem.AddHs(mol)
EmbedMolecule(mol=mol)
data = mol_to_graph_data_obj_simple_3D(mol)

and then feed data to the model loaded from the checkpoint to compute an embedding of the SMILES. What do you think?

chao1224 · 2024-09-20T04:01:56Z

Hi @marcosbodio,

Yes, I think this is right if you want to use the 3D representation.

When we create the checkpoints, we save the following modules (code):

            saver_dict = {
                'model': molecule_model_2D.state_dict(),
                'model_3D': molecule_model_3D.state_dict(),
                'AE_2D_3D_model': AE_2D_3D_model.state_dict(),
                'AE_3D_2D_model': AE_3D_2D_model.state_dict(),
            }

What you wrote above can be fed into the model_3D.
If you only want to use the 2D checkpoint, which is model above, then you can follow this pseudocode:

smiles = 'Cn1cnc(c1)C(=O)c1ccc(CN2[C@H](Cc3ccccn3)C(=O)Nc3cc(Cl)ccc3C2=O)cc1'
mol = Chem.MolFromSmiles(smiles)

data = mol_to_graph_data_obj_simple(mol)

where mol_to_graph_data_obj_simple is in this function.

marcosbodio · 2024-09-20T11:29:47Z

HI @chao1224 ,

I have tried to load one of your model checkpoint, but I do not see model_3D. Here is what I did:

model_path = 'output/3D_hybrid_02_masking/GEOM_3D_nmol50000_nconf5_nupper1000/CL_1_VAE_1/6_51_10_0.1/0.3_EBM_dot_prod_0.1_normalize_l2_detach_target_2_100_0/pretraining_model.pth'
model = torch.load(f=model_path, map_location=torch.device('cpu'))
print(model.keys())
print('model_3D' in model)

where model_path is from your file GraphMVP_simple_features_for_classification.zip (shared here)

The previous code prints the following:

odict_keys(['x_embedding1.weight', 'x_embedding2.weight', 'gnns.0.mlp.0.weight', 'gnns.0.mlp.0.bias', 'gnns.0.mlp.2.weight', 'gnns.0.mlp.2.bias', 'gnns.0.edge_embedding1.weight', 'gnns.0.edge_embedding2.weight', 'gnns.1.mlp.0.weight', 'gnns.1.mlp.0.bias', 'gnns.1.mlp.2.weight', 'gnns.1.mlp.2.bias', 'gnns.1.edge_embedding1.weight', 'gnns.1.edge_embedding2.weight', 'gnns.2.mlp.0.weight', 'gnns.2.mlp.0.bias', 'gnns.2.mlp.2.weight', 'gnns.2.mlp.2.bias', 'gnns.2.edge_embedding1.weight', 'gnns.2.edge_embedding2.weight', 'gnns.3.mlp.0.weight', 'gnns.3.mlp.0.bias', 'gnns.3.mlp.2.weight', 'gnns.3.mlp.2.bias', 'gnns.3.edge_embedding1.weight', 'gnns.3.edge_embedding2.weight', 'gnns.4.mlp.0.weight', 'gnns.4.mlp.0.bias', 'gnns.4.mlp.2.weight', 'gnns.4.mlp.2.bias', 'gnns.4.edge_embedding1.weight', 'gnns.4.edge_embedding2.weight', 'batch_norms.0.weight', 'batch_norms.0.bias', 'batch_norms.0.running_mean', 'batch_norms.0.running_var', 'batch_norms.0.num_batches_tracked', 'batch_norms.1.weight', 'batch_norms.1.bias', 'batch_norms.1.running_mean', 'batch_norms.1.running_var', 'batch_norms.1.num_batches_tracked', 'batch_norms.2.weight', 'batch_norms.2.bias', 'batch_norms.2.running_mean', 'batch_norms.2.running_var', 'batch_norms.2.num_batches_tracked', 'batch_norms.3.weight', 'batch_norms.3.bias', 'batch_norms.3.running_mean', 'batch_norms.3.running_var', 'batch_norms.3.num_batches_tracked', 'batch_norms.4.weight', 'batch_norms.4.bias', 'batch_norms.4.running_mean', 'batch_norms.4.running_var', 'batch_norms.4.num_batches_tracked'])
False

Am I loading the wrong checkpoint?

chao1224 · 2024-09-24T05:29:10Z

Hi @marcosbodio ,

I need to double-check the checkpoint files when I got time. Meanwhile, you should be able to use this checkpoint, which is one of the SOTA PaiNN pretraining methods (paper link)).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode molecules starting from their SMILES #26

Encode molecules starting from their SMILES #26

marcosbodio commented Sep 9, 2024

chao1224 commented Sep 10, 2024

marcosbodio commented Sep 11, 2024

chao1224 commented Sep 11, 2024

marcosbodio commented Sep 19, 2024

chao1224 commented Sep 20, 2024 •

edited

Loading

marcosbodio commented Sep 20, 2024

chao1224 commented Sep 24, 2024

Encode molecules starting from their SMILES #26

Encode molecules starting from their SMILES #26

Comments

marcosbodio commented Sep 9, 2024

chao1224 commented Sep 10, 2024

marcosbodio commented Sep 11, 2024

chao1224 commented Sep 11, 2024

marcosbodio commented Sep 19, 2024

chao1224 commented Sep 20, 2024 • edited Loading

marcosbodio commented Sep 20, 2024

chao1224 commented Sep 24, 2024

chao1224 commented Sep 20, 2024 •

edited

Loading