You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Atom mapping in the USPTO dataset was done using Indigo over 6 years ago (https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873), and better tools for atom mapping have since been created, e.g. rxn mapper (https://onlinelibrary.wiley.com/doi/10.1002/minf.202100138). Even though rxnmapper may be better than Indigo, the benchmarking study linked above may be slightly misleading when it comes to determining how much better rxnmapper is, because the benchmarking dataset was specifically curated to include very difficult reactions. Both tools are likely to perform very well on 'easy' reactions. On a more realistic dataset that contains both easy and hard reactions, mapping performance will likely be more similar.
With a better atom mapping, it may also be possible to expand the scope of reactant detection in a reaction string, e.g. by detecting previously unmapped atoms in the product and detecting these atoms among the agents, and then moving said agents to the reactants.
Rxnmapper is quite a heavy programme, and would take many hours to run on a few million reactions. Since the gain is likely to only be marginal coupled with us wanting to keep the programme relatively light weight, we have decided to keep the original mapping in the ORD dataset (Indigo in the case of USPTO data).
The text was updated successfully, but these errors were encountered:
Here's an example of where the atom mapping fails:
Br[CH2:2][C:3]1[CH:4]=[CH:5][C:6]2[O:15][C:10]3=[N:11][CH:12]=[CH:13][CH:14]=[C:9]3C:8[C:7]=2[CH:17]=1.[CH3:18]N:19C=O.[C-]#N.[Na+]>O>C:18#[N:19]
We would expect the triple-bonded N in the product to come from the triple-bonded N in the reactant ([C-]#N). Nothing we can do about this, we are at the mercy of the existing atom-mapping in ORD.
From: uspto-grants-1976_01.parquet ("ord-cc0d0a952867484fa3eb43ab33c5c8dd") index 412
Atom mapping in the USPTO dataset was done using Indigo over 6 years ago (https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873), and better tools for atom mapping have since been created, e.g. rxn mapper (https://onlinelibrary.wiley.com/doi/10.1002/minf.202100138). Even though rxnmapper may be better than Indigo, the benchmarking study linked above may be slightly misleading when it comes to determining how much better rxnmapper is, because the benchmarking dataset was specifically curated to include very difficult reactions. Both tools are likely to perform very well on 'easy' reactions. On a more realistic dataset that contains both easy and hard reactions, mapping performance will likely be more similar.
With a better atom mapping, it may also be possible to expand the scope of reactant detection in a reaction string, e.g. by detecting previously unmapped atoms in the product and detecting these atoms among the agents, and then moving said agents to the reactants.
Rxnmapper is quite a heavy programme, and would take many hours to run on a few million reactions. Since the gain is likely to only be marginal coupled with us wanting to keep the programme relatively light weight, we have decided to keep the original mapping in the ORD dataset (Indigo in the case of USPTO data).
The text was updated successfully, but these errors were encountered: