Add LIME explanations (#78)

* Initial pass at getting descriptors * Removed redundant imports * Added a linear model with WLS, not sure if correct * Modifying RF notebook for LIME implementation, a long way to go.. * Added weighted t-statistic surrogate lime * Lime with counterfactuals, we need a regression example * LIME with CFs * Plot tstats * Added plotting code but doesn't work yet * Added plotting code that doesn't work yet * Added MACCS keys description, need to return MACCSfps in explain_lime * fixed MACCS key indexing, not assigning to examples yet * Refactored and added new svg code * Wrote new svg code * Added test for different sizes * Added SVG support to CFS * Refactored descriptors * Basic example working * Reworked svg code to be more robust * Removed debugging code * Added tuples to svg tests * Actual descriptor plots in _descriptor_layout * Features calculated as difference from base, plotting successful :) * Added two descriptor types, MACCS key description complete * Solubility model trained, shorter MACCS descriptions * tstat significance * Working notebooks * MACCSkeys read accurately * Updated MACCS descriptions and notebook plots * Fixed the sign for beta * removed descriptor plotting * ignoring temp images * LIME with updated weights * Sigmoid weights for lime * Using SVD with pseudo-inverse for xinv in WLS * method working - needs cleaning up! * MACCS key annotations * Added doc string for LIME functions * Will clean up on local computer * Remove nonzero weights and add plot descriptors function * Fixed scipy import * progress in notebook cleaning * Cleaned up notebooks * svg=False for _mol_images * Ran precommit * remove extra files * Replaced mordred descriptors with rdkit descriptors * fized return types and reading of atom_pols * fixed data file imports, using pickled file for maccs svgs * Added import-libresources to setup * add files to be installed? * Fixed how package data is installed - I think * changed as per setuptools instructions - might work? * CI has a problem with everything! * fixed most mypy issues * fixed modt mypy issues - forgot this file * Fixed ALLL mypy issues! * Added description and made argument names consistent * Added substructure descriptorscorresponding to the instance * mypy strikes! * Removed highlights from morganfp bits * fix classic plots * Fixed selfies encoder error * renamed the notebook dirs and added LIME nbs to github actions * Fixed rdkit argument and added tikhonov reg * Changed kernel name to be consistent with Geemi's notebooks * Use a smaller frac of data * CI doesn't like me :| * Make sure frags don't show up * Silly errors * I made sure everything runs.. CI, be nice! * Update changelog and version bump * Fixed svg not showing for ECFP fps * Renamed descriptor method * Added nbs to toc and added printed note for MACCS desc * training alphabet sometimes gave an error * return beta conditionally in lime_explain * Updated doc string and instructions on accessing t-statistics * Made test cover more examples * argument is return_beta * Added ecfp example to readme * Added ecfp example to readme * Changed demo image to contain ecfp instead of MACCS * Removed Bertz Ct and fixed parity plots * Changed code for RF to include heteroatoms * Rdkit MACCS fps has a dummy key at index 0 * Grammatical corrections Co-authored-by: Heta Gandhi <[email protected]> Co-authored-by: Andrew White <[email protected]>
ur-whitelab · May 6, 2022 · 1661e50 · 1661e50
1 parent 98690b0
commit 1661e50
Show file tree

Hide file tree

Showing 31 changed files with 3,252 additions and 19 deletions.
diff --git a/.github/workflows/paper.yml b/.github/workflows/paper.yml
@@ -24,8 +24,13 @@ jobs:
     - name: Install
       run: |
         pip install .
-    - name: Install paper depends
+    - name: Install paper1 depends
       run: |
-        pip install -r paper/requirements.txt
-    - name: Run paper experiments
-      run: jupyter nbconvert --ExecutePreprocessor.timeout=-1 --execute "paper/*.ipynb" --to notebook --output-dir='temp' --clear-output
+        pip install -r paper1_CFs/requirements.txt
+    - name: Run paper1 experiments
+      run: jupyter nbconvert --ExecutePreprocessor.timeout=-1 --execute "paper1_CFs/*.ipynb" --to notebook --output-dir='temp' --clear-output
+    - name: Install paper2 depends
+      run: |
+        pip install -r paper2_LIME/requirements.txt
+    - name: Run paper2 experiments
+      run: jupyter nbconvert --ExecutePreprocessor.timeout=-1 --execute "paper2_LIME/*.ipynb" --to notebook --output-dir='temp' --clear-output
diff --git a/.gitignore b/.gitignore
@@ -3,7 +3,7 @@ __pycache__/
 *.py[cod]
 *$py.class
 # pickles
-*.pb
+# *.pb
 
 # C extensions
 *.so

diff --git a/README.md b/README.md
@@ -22,10 +22,18 @@ A counterfactual can explain a prediction by showing what would have to change i
 
 In addition to having a changed prediction, a molecular counterfactual must be similar to its base molecule as much as possible. Here is an example of a molecular counterfactual:
 
-<img alt="counterfactual demo" src="https://raw.githubusercontent.com/ur-whitelab/exmol/main/paper/svg_figs/counterfactual.png" width="400">
+<img alt="counterfactual demo" src="https://raw.githubusercontent.com/ur-whitelab/exmol/main/paper1_CFs/svg_figs/counterfactual.png" width="400">
 
 The counterfactual shows that if the carboxylic acid were an ester, the molecule would be active. It is up to the user to translate this set of structures into a meaningful sentence.
 
+## Descriptor Attribution
+This package also implements Model Agnostic Descriptor Attribution for molecules using LIME.
+Descriptor attributions can explain a prediction by computing QSARs for molecular structure properties independent of features used for model predictions. Here is an example of descriptor attribution:
+
+<img alt="descriptor demo" src="paper2_LIME/descriptor.png" width="800">
+
+The descriptor t-statistics show which chemical properties or substructures influence properety prediction for the pictured molecule. LIME is a perturbation based method and the descriptor attributions depend on the perturbed chemical space created around the molecule of interest.
+
 ## Usage
 
 Let's assume you have a deep learning model `my_model(s)` that takes in one SMILES string and outputs a predicted binary class. We first expand chemical space around the prediction of interest
@@ -46,7 +54,7 @@ cfs = exmol.cf_explain(samples)
 exmol.plot_cf(cfs)
 ```
 
-<img alt="set of counterfactuals" src="https://raw.githubusercontent.com/ur-whitelab/exmol/main/paper/svg_figs/rf-simple.png" width="500">
+<img alt="set of counterfactuals" src="https://raw.githubusercontent.com/ur-whitelab/exmol/main/paper1_CFs/svg_figs/rf-simple.png" width="500">
 
 We can also plot the space around the counterfactual. This is computed via PCA of the affinity matrix -- the similarity (Tanimoto of ECFP4) with the base molecule.
 Due to how similarity is calculated, the base is going to be the farthest from all other molecules. Thus your base should fall on the left (or right) extreme of your plot.
@@ -55,7 +63,7 @@ Due to how similarity is calculated, the base is going to be the farthest from a
 cfs = exmol.cf_explain(samples)
 exmol.plot_space(samples, cfs)
 ```
-<img alt="chemical space" src="https://raw.githubusercontent.com/ur-whitelab/exmol/main/paper/svg_figs/rf-space.png" width="600">
+<img alt="chemical space" src="https://raw.githubusercontent.com/ur-whitelab/exmol/main/paper1_CFs/svg_figs/rf-space.png" width="600">
 
 Each counterfactual is a Python `dataclass` with information allowing it to be used in your own analysis:
 
@@ -78,6 +86,27 @@ print(cfs[1])
 }
 ```
 
+We can use the same chemical space to get descriptor attributions for the molecule. Along with `samples`, we also need to supply the `descriptor_type` to get attributions. You can select from `Classic` Rdkit descriptors, `MACCS` fingerprint descriptors, `ECFP` substructure descriptors. If you'd like to use regression coefficients for analysis, specify `return_beta=True`. The descriptor t-statistics are stored in `descriptors.tstats` attribute for the base molecule and can be accessed using `space_tstats = space[0].descriptors.tstats`.
+
+```py
+beta = exmol.lime_explain(samples, descriptor_type='ECFP', return_beta=True)
+exmol.plot_descriptors(samples, descriptor_type='ECFP')
+```
+<img alt="ecfp descriptors" src="paper2_LIME/ECFP.svg" width="400">
+
+You can also plot the chemical space colored by fit to see how well the regression fits the original model. To plot by fit, regression coefficients `beta` need to be passed in as an argument.
+
+```py
+exmol.plot_utils.plot_space_by_fit(
+    samples,
+    [samples[0]],
+    beta=beta,
+    mol_size=(300, 250),
+    figure_kwargs={'figsize': (7,5)},
+)
+```
+<img alt="chemical space by fit" src="paper2_LIME/space_by_fit.png" width="500">
+
 ## Further Examples
 
 You can find more examples by looking at the exact code used to generate all figures from our paper [in the docs webpage](https://https://ur-whitelab.github.io/exmol/toc.html).

diff --git a/docs/source/changelog.rst b/docs/source/changelog.rst
@@ -1,5 +1,10 @@
 Change Log
 ==========
+v2.0.0 (2022-4-22)
+-------------------
+* Added surrogate model explanation method
+* Added support for attributing ECFP, MACCS fingerprints, rdkit descriptors and plotting them
+* Example notebooks for new method
 
 
 v1.1.0 (2022-5-2)

diff --git a/docs/source/toc.rst b/docs/source/toc.rst
@@ -8,7 +8,9 @@ exmol
     index.md
     changelog.rst
     api.rst
-    paper/Schematic.ipynb
-    paper/RF.ipynb
-    paper/GNN.ipynb
-    paper/Solubility-RNN.ipynb
+    paper1_CFs/Schematic.ipynb
+    paper1_CFs/RF.ipynb
+    paper1_CFs/GNN.ipynb
+    paper1_CFs/Solubility-RNN.ipynb
+    paper2_LIME/Solubility-RNN.ipynb
+    paper2_LIME/RF-LIME.ipynb
diff --git a/exmol/data.py b/exmol/data.py
@@ -3,6 +3,20 @@
 import numpy as np  # type: ignore
 
 
+@dataclass
+class Descriptors:
+    """Molecular descriptors"""
+
+    #: Descriptor type
+    descriptor_type: str
+    #: Descriptor values
+    descriptors: tuple
+    # Descriptor name
+    descriptor_names: tuple
+    # t_stats for each molecule
+    tstats: tuple = ()
+
+
 @dataclass
 class Example:
     """Example of a molecule"""
@@ -24,7 +38,9 @@ class Example:
     #: Index of cluster, can be -1 for no cluster
     cluster: int = 0
     #: Label for this example
-    label: Optional[str] = None
+    label: str = None  # type: ignore
+    #: Descriptors for this example
+    descriptors: Descriptors = None  # type: ignore
 
     # to make it look nicer
     def __str__(self):