Skip to content

Commit

Permalink
Add LIME explanations (#78)
Browse files Browse the repository at this point in the history
* Initial pass at getting descriptors

* Removed redundant imports

* Added a linear model with WLS, not sure if correct

* Modifying RF notebook for LIME implementation, a long way to go..

* Added weighted t-statistic surrogate lime

* Lime with counterfactuals, we need a regression example

* LIME with CFs

* Plot tstats

* Added plotting code but doesn't work yet

* Added plotting code that doesn't work yet

* Added MACCS keys description, need to return MACCSfps in explain_lime

* fixed MACCS key indexing, not assigning to examples yet

* Refactored and added new svg code

* Wrote new svg code

* Added test for different sizes

* Added SVG support to CFS

* Refactored descriptors

* Basic example working

* Reworked svg code to be more robust

* Removed debugging code

* Added tuples to svg tests

* Actual descriptor plots in _descriptor_layout

* Features calculated as difference from base, plotting successful :)

* Added two descriptor types, MACCS key description complete

* Solubility model trained, shorter MACCS descriptions

* tstat significance

* Working notebooks

* MACCSkeys read accurately

* Updated MACCS descriptions and notebook plots

* Fixed the sign for beta

* removed descriptor plotting

* ignoring temp images

* LIME with updated weights

* Sigmoid weights for lime

* Using SVD with pseudo-inverse for xinv in WLS

* method working - needs cleaning up!

* MACCS key annotations

* Added doc string for LIME functions

* Will clean up on local computer

* Remove nonzero weights and add plot descriptors function

* Fixed scipy import

* progress in notebook cleaning

* Cleaned up notebooks

* svg=False for _mol_images

* Ran precommit

* remove extra files

* Replaced mordred descriptors with rdkit descriptors

* fized return types and reading of atom_pols

* fixed data file imports, using pickled file for maccs svgs

* Added import-libresources to setup

* add files to be installed?

* Fixed how package data is installed - I think

* changed as per setuptools instructions - might work?

* CI has a problem with everything!

* fixed most mypy issues

* fixed modt mypy issues - forgot this file

* Fixed ALLL mypy issues!

* Added description and made argument names consistent

* Added substructure descriptorscorresponding to the instance

* mypy strikes!

* Removed highlights from morganfp bits

* fix classic plots

* Fixed selfies encoder error

* renamed the notebook dirs and added LIME nbs to github actions

* Fixed rdkit argument and added tikhonov reg

* Changed kernel name to be consistent with Geemi's notebooks

* Use a smaller frac of data

* CI doesn't like me :|

* Make sure frags don't show up

* Silly errors

* I made sure everything runs.. CI, be nice!

* Update changelog and version bump

* Fixed svg not showing for ECFP fps

* Renamed descriptor method

* Added nbs to toc and added printed note for MACCS desc

* training alphabet sometimes gave an error

* return beta conditionally in lime_explain

* Updated doc string and instructions on accessing t-statistics

* Made test cover more examples

* argument is return_beta

* Added ecfp example to readme

* Added ecfp example to readme

* Changed demo image to contain ecfp instead of MACCS

* Removed Bertz Ct and fixed parity plots

* Changed code for RF to include heteroatoms

* Rdkit MACCS fps has a dummy key at index 0

* Grammatical corrections

Co-authored-by: Heta Gandhi <[email protected]>
Co-authored-by: Andrew White <[email protected]>
  • Loading branch information
3 people authored May 6, 2022
1 parent 98690b0 commit 1661e50
Show file tree
Hide file tree
Showing 31 changed files with 3,252 additions and 19 deletions.
13 changes: 9 additions & 4 deletions .github/workflows/paper.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,13 @@ jobs:
- name: Install
run: |
pip install .
- name: Install paper depends
- name: Install paper1 depends
run: |
pip install -r paper/requirements.txt
- name: Run paper experiments
run: jupyter nbconvert --ExecutePreprocessor.timeout=-1 --execute "paper/*.ipynb" --to notebook --output-dir='temp' --clear-output
pip install -r paper1_CFs/requirements.txt
- name: Run paper1 experiments
run: jupyter nbconvert --ExecutePreprocessor.timeout=-1 --execute "paper1_CFs/*.ipynb" --to notebook --output-dir='temp' --clear-output
- name: Install paper2 depends
run: |
pip install -r paper2_LIME/requirements.txt
- name: Run paper2 experiments
run: jupyter nbconvert --ExecutePreprocessor.timeout=-1 --execute "paper2_LIME/*.ipynb" --to notebook --output-dir='temp' --clear-output
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ __pycache__/
*.py[cod]
*$py.class
# pickles
*.pb
# *.pb

# C extensions
*.so
Expand Down
35 changes: 32 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,18 @@ A counterfactual can explain a prediction by showing what would have to change i
In addition to having a changed prediction, a molecular counterfactual must be similar to its base molecule as much as possible. Here is an example of a molecular counterfactual:

<img alt="counterfactual demo" src="https://raw.githubusercontent.com/ur-whitelab/exmol/main/paper/svg_figs/counterfactual.png" width="400">
<img alt="counterfactual demo" src="https://raw.githubusercontent.com/ur-whitelab/exmol/main/paper1_CFs/svg_figs/counterfactual.png" width="400">

The counterfactual shows that if the carboxylic acid were an ester, the molecule would be active. It is up to the user to translate this set of structures into a meaningful sentence.

## Descriptor Attribution
This package also implements Model Agnostic Descriptor Attribution for molecules using LIME.
Descriptor attributions can explain a prediction by computing QSARs for molecular structure properties independent of features used for model predictions. Here is an example of descriptor attribution:

<img alt="descriptor demo" src="paper2_LIME/descriptor.png" width="800">

The descriptor t-statistics show which chemical properties or substructures influence properety prediction for the pictured molecule. LIME is a perturbation based method and the descriptor attributions depend on the perturbed chemical space created around the molecule of interest.

## Usage

Let's assume you have a deep learning model `my_model(s)` that takes in one SMILES string and outputs a predicted binary class. We first expand chemical space around the prediction of interest
Expand All @@ -46,7 +54,7 @@ cfs = exmol.cf_explain(samples)
exmol.plot_cf(cfs)
```

<img alt="set of counterfactuals" src="https://raw.githubusercontent.com/ur-whitelab/exmol/main/paper/svg_figs/rf-simple.png" width="500">
<img alt="set of counterfactuals" src="https://raw.githubusercontent.com/ur-whitelab/exmol/main/paper1_CFs/svg_figs/rf-simple.png" width="500">

We can also plot the space around the counterfactual. This is computed via PCA of the affinity matrix -- the similarity (Tanimoto of ECFP4) with the base molecule.
Due to how similarity is calculated, the base is going to be the farthest from all other molecules. Thus your base should fall on the left (or right) extreme of your plot.
Expand All @@ -55,7 +63,7 @@ Due to how similarity is calculated, the base is going to be the farthest from a
cfs = exmol.cf_explain(samples)
exmol.plot_space(samples, cfs)
```
<img alt="chemical space" src="https://raw.githubusercontent.com/ur-whitelab/exmol/main/paper/svg_figs/rf-space.png" width="600">
<img alt="chemical space" src="https://raw.githubusercontent.com/ur-whitelab/exmol/main/paper1_CFs/svg_figs/rf-space.png" width="600">

Each counterfactual is a Python `dataclass` with information allowing it to be used in your own analysis:

Expand All @@ -78,6 +86,27 @@ print(cfs[1])
}
```

We can use the same chemical space to get descriptor attributions for the molecule. Along with `samples`, we also need to supply the `descriptor_type` to get attributions. You can select from `Classic` Rdkit descriptors, `MACCS` fingerprint descriptors, `ECFP` substructure descriptors. If you'd like to use regression coefficients for analysis, specify `return_beta=True`. The descriptor t-statistics are stored in `descriptors.tstats` attribute for the base molecule and can be accessed using `space_tstats = space[0].descriptors.tstats`.

```py
beta = exmol.lime_explain(samples, descriptor_type='ECFP', return_beta=True)
exmol.plot_descriptors(samples, descriptor_type='ECFP')
```
<img alt="ecfp descriptors" src="paper2_LIME/ECFP.svg" width="400">

You can also plot the chemical space colored by fit to see how well the regression fits the original model. To plot by fit, regression coefficients `beta` need to be passed in as an argument.

```py
exmol.plot_utils.plot_space_by_fit(
samples,
[samples[0]],
beta=beta,
mol_size=(300, 250),
figure_kwargs={'figsize': (7,5)},
)
```
<img alt="chemical space by fit" src="paper2_LIME/space_by_fit.png" width="500">

## Further Examples

You can find more examples by looking at the exact code used to generate all figures from our paper [in the docs webpage](https://https://ur-whitelab.github.io/exmol/toc.html).
Expand Down
5 changes: 5 additions & 0 deletions docs/source/changelog.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
Change Log
==========
v2.0.0 (2022-4-22)
-------------------
* Added surrogate model explanation method
* Added support for attributing ECFP, MACCS fingerprints, rdkit descriptors and plotting them
* Example notebooks for new method


v1.1.0 (2022-5-2)
Expand Down
10 changes: 6 additions & 4 deletions docs/source/toc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@ exmol
index.md
changelog.rst
api.rst
paper/Schematic.ipynb
paper/RF.ipynb
paper/GNN.ipynb
paper/Solubility-RNN.ipynb
paper1_CFs/Schematic.ipynb
paper1_CFs/RF.ipynb
paper1_CFs/GNN.ipynb
paper1_CFs/Solubility-RNN.ipynb
paper2_LIME/Solubility-RNN.ipynb
paper2_LIME/RF-LIME.ipynb
18 changes: 17 additions & 1 deletion exmol/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,20 @@
import numpy as np # type: ignore


@dataclass
class Descriptors:
"""Molecular descriptors"""

#: Descriptor type
descriptor_type: str
#: Descriptor values
descriptors: tuple
# Descriptor name
descriptor_names: tuple
# t_stats for each molecule
tstats: tuple = ()


@dataclass
class Example:
"""Example of a molecule"""
Expand All @@ -24,7 +38,9 @@ class Example:
#: Index of cluster, can be -1 for no cluster
cluster: int = 0
#: Label for this example
label: Optional[str] = None
label: str = None # type: ignore
#: Descriptors for this example
descriptors: Descriptors = None # type: ignore

# to make it look nicer
def __str__(self):
Expand Down
Loading

0 comments on commit 1661e50

Please sign in to comment.