Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine ions to salt #73

Open
dswigh opened this issue Apr 19, 2023 · 1 comment
Open

Combine ions to salt #73

dswigh opened this issue Apr 19, 2023 · 1 comment
Labels
enhancement New feature or request wontfix This will not be worked on

Comments

@dswigh
Copy link
Collaborator

dswigh commented Apr 19, 2023

Combining ions to their respective salts will mean that there will be fewer components per reaction, which may be helpful for ML model prediction.

NB: "[Na+].[OH-]" becomes "[Na]O" because non-organic atoms must have square brackets around them for the string to be a valid smiles string.

@dswigh dswigh added enhancement New feature or request wontfix This will not be worked on labels Apr 21, 2023
@dswigh
Copy link
Collaborator Author

dswigh commented Apr 21, 2023

This turned out to be more complicated than initially expected.

We were planning to simply look at the lists of reactants, agents, and products, and in case there's a single cation and a single anion, merge them to the corresponding salt. After thinking more carefully, I realized the following issues with this:

  • Case 1: There's only one cation/anion (e.g. only [OH-] is listed, with no cation). This can't be merged, so the dataset would both have hydroxide ions and also salts
  • Case 2: There are multiple different anions and/or cations. How do we match up the ions?
  • Case 3: The anion is a reactant, while the cation is an agent/product. How do we match up the ions in this case?

I think it's quite likely that when there's multiple (formal) charges (especially within the same molecule) that adding such a function could lead to unexpected and unintended consequences. As an example, consider the following 2 reactions:

  1. CC1N=CC2C(C=1)=C(N+=O)C=CC=2.[Cl:15][C:16]1[CH:25]=[CH:24]C:23=[C:22]2[C:17]=1[CH:18]=[CH:19][N:20]=[CH:21]2.Cl.CC1N=CC2C(C=1)=C(N+=O)C=CC=2.[IH:44]>>[IH:44].[Cl:15][C:16]1[CH:25]=[CH:24]C:23=[C:22]2[C:17]=1[CH:18]=[CH:19][N:20]=[CH:21]2
  2. S:1([O-:4])(=[O:3])=[O:2].[NH4+:6].[NH4+]>O>S:1(=[O:2])([OH:5])[O-:4].[NH4+:6].S:1([O-:4])(=[O:3])=[O:2].[NH4+:6].[NH4+:6]

In addition to these reasons, consider the following:

  1. When we put trust_labelling = False, we take list(set()) of the molecules lists, which (purposely) breaks stoichiometry in favour of removing duplicates, however, by merging ions to salts we would need to selectively reintroduce stoichiometry (to match formal charges) - which sounds like a bad idea.
  2. It's tempting to add manual merging rules (e.g. when you see [Na+] and [OH-] in the same reaction, merge them to [Na]O, OR when you see [Pd] and C in the same reaction, merge to [Pd]C, but adding a rule-based system should only be done with extreme caution, since there are so many edge cases in chemistry - keep in mind we're already using a rule-based system for solvent identification, and the set of industrially relevant solvents must be much smaller than the set of molecules that can carry a formal charge

dswigh added a commit that referenced this issue Apr 21, 2023
Joearrowsmith added a commit that referenced this issue Apr 28, 2023
* implemented tetrakis_synonyms

* merged cat and ligand to one

* corrected acetal smiles typo

* make black

* set temp=0 if (ice present) AND (temp is None)

* implemented remove_none_and_empty_str

* make black

* assert '' not in data

* added test case of rxn with no product

* make black

* make black

* updated rxn participation logic

* remove reactions with no reactants or no products

* replace len(df) with (df.shape[0])

* checkpoint for ipynb

* checkpoint 2 for ipynb

* raw update to solvents

* del merge ions issue #73

* added catch in case rxn str is not mapped

* make black

* added temp fix for solvents.csv

* fixed mypy bugs

* fixed check_valid_order confusing col with row

* added source to some of solvents

* make black

* update test data

* updated clean handling of unresolved names

* make black

* import types into cleaner

* make black

* expanded make pytest

* expanded pytest

* changed _0 to _000

* updated tests: unresolved name handling

* make black

* import optional

* added todo

* Set num_reag = -1 will result in no filtering

* updated ipynb

* updaed todo text

* changed column headers

* separate synonyms into different columns

* remove placeholder O SMILES string

* comparison to yehia's solvents

* added "" around 2 cas to avoid excel date

* removed solvent_name_4 column

* moved the source column

* drop duplicates

* removed duplicate names in solvent_name_2 column

* sorting

* removed duplicates between different sources

* strip spaces

* make lower case

* removal of duplicate names

* updating solvents lists

* resolved all smiles conflicts in solvents.csv

* checked that all smiles are resolvable

* make black

* saved SMILES as canon SMILES in solvents.csv

* regenerated test data, solvents csv col name

* testing for duplicates

* deleted unnecessary solvents csv files

* removed ICE_PRESENT type

* added back . between metals and organic molecules

* Back to SMILES with . between metal (except Pd)

* added procedure for how solvents.csv was generated

* is_digit->is_number.Now del floats. name_list incl

* removed the .ipynb files and converted to .py, moved to notebooks section

* black

* mypy

* inspecting the large ORD file

* Reconsiling mac vs ubuntu (#85)

* adding reconcile changes

* removing none and nan for string to <empty>

* regen test data

* temp disable tests

* undoing

* regen test data

* updated doc string

* removed remapping of none to nan

* fixing the ubuntu bug

* don't include the large cleaned USPTO file

* reactivate test

* update doc string: why not incl cleaned USPTO

* add black and mypy

* make edit

---------

Co-authored-by: Daniel Wigh <[email protected]>

---------

Co-authored-by: joearrowsmith <[email protected]>
Co-authored-by: Joe Arrowsmith <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

1 participant