Publish code (#144)

* Update README * Add license
sustainable-processes · Jun 14, 2023 · ca1eefc · ca1eefc
1 parent 64e5f1b
commit ca1eefc
Show file tree

Hide file tree

Showing 17 changed files with 727 additions and 7,803 deletions.
diff --git a/.gitignore b/.gitignore
@@ -124,7 +124,6 @@ venv.bak/
 
 runs/
 data/
-!reaction_utils-main/rxnutils/data/
 
 # Spyder project settings
 .spyderproject

diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,20 @@
+Copyright (c) Daniel Wigh and others
+
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -1,89 +1,214 @@
 # ORDerly
 
-Cleaning and extraction of data from ORD
+🧪 Cleaning chemical reaction data 🧪
 
-The scripts herein will extract and clean data from ORD with various manual steps relying on chemical domain knowledge. This results in an open-source dataset containing a mapped reaction, reactants, products, solvents, reagents, catalysts, and yields in a pandas DataFrame structure that should also be easily usable by people with little knowledge of chemistry.
+## Quick Install
 
-# Usage
+```pip install orderly```
 
-### 1. Install
+🤔 What is this?
+-----------------
 
-#### I Download the ORD data
+Machine learning has the potential to provide tremendous value to chemistry. However, large amounts of clean high-quality data are needed to train models
 
-We want to download the ORD data locally, this can be done through any of the following methods:
+ORDerly cleans chemical reaction data from the growing [Open Reaction Database (ORD)](https://docs.open-reaction-database.org/en/latest/).
 
-- Follow the instructions at: https://github.com/open-reaction-database/ord-data, we specifically care about the folders in ```ord-data/data/```
-- Docker install with linux (run in terminal):
-    ```
-    make linux_download_ord
-    ``` 
-- Docker install with mac (run in terminal):
-    ```
-    make root_download_ord
-    make sudo_chown
-    ```
+Use ORDerly to:
+- Extract and clean your own dataset from ORD
+- Access the [ORDerly condition prediction benchmark dataset](https://figshare.com/articles/dataset/ORDerly_chemical_reactions_condition_benchmarks/23298467) for reaction condition prediction.
+- Reproduce results from our paper including training a ML model to predict reaction conditions
 
-#### II Install OS depenencies
+<img src="images/abstract_fig.png" alt="Abstract Figure" width="300">
+
+
+<!-- Section on extracting and cleaning a dataset-->
+
+📖 Extract and clean a dataset
+------------------------------
 
-You might need some environment dependencies. If running locally these will need to be dealt with. However, if running using docker, the depenencies will be managed in the build script.
+### Download data from ORD
+
+```orderly download```
+This will create a folder called ```/data/ord/``` in your current directory, and download the data into ```ord/```
+
+Alternatively, you can also follow the instructions on the [official website](https://github.com/open-reaction-database/ord-data) to download the data in ```ord-data/data/```.
+
+### Extract data from the ORD files
+
+```orderly extract```
+
+If you want to run ORDerly on your own data, and want to specify the input and output path:
+
+```orderly extract --input_path="/data/ord/" --output_path="/data/orderly/"```
+
+This will generate a parquet file for each ORD file.
+
+### Clean the data
+
+```orderly clean```
+
+This will produce train and test parquet files, along with a .json file showing the arguments used and a .log file showing the operations run.
+
+
+<!-- Section on downloading the benchmark -->
+🚀 Download the condition prediction benchmark dataset
+--------------------------------------------------------
+
+Reaction condition prediction is the problem of predicting the things "above the arrow" in chemical reactions.
+
+<!-- Include image of a reactions -->
+
+You can either download the [ORDerly condition prediction benchmark dataset](https://figshare.com/articles/dataset/ORDerly_chemical_reactions_condition_benchmarks/23298467) directly, use the following code to download it (without installing ORDerly). Make sure to install needed dependencies first.
+
+
+```pip install requests fastparquet pandas```
+
+<details>
+<summary>Toggle to see code to download benchmark</summary>
+
+```python
+import pathlib
+import zipfile
+
+import pandas as pd
+import requests
+
+
+def download_benchmark(
+    benchmark_zip_file="orderly_benchmark.zip",
+    benchmark_directory="orderly_benchmark/",
+    version=2,
+):
+    figshare_url = (
+        f"https://figshare.com/ndownloader/articles/23298467/versions/{version}"
+    )
+    print(f"Downloading benchmark from {figshare_url} to {benchmark_zip_file}")
+    r = requests.get(figshare_url, allow_redirects=True)
+    with open(benchmark_zip_file, "wb") as f:
+        f.write(r.content)
+
+    print("Unzipping benchmark")
+    benchmark_directory = pathlib.Path(benchmark_directory)
+    benchmark_directory.mkdir(parents=True, exist_ok=True)
+    with zipfile.ZipFile(benchmark_zip_file, "r") as zip_ref:
+        zip_ref.extractall(benchmark_directory)
+
+
+download_benchmark()
+train_df = pd.read_parquet("orderly_benchmark/orderly_benchmark_train.parquet")
+test_df = pd.read_parquet("orderly_benchmark/orderly_benchmark_test.parquet")
+```
+</details>
+
+
+## Reproduce results form the paper
+TODO
 
-- Linux: For you will likely have some missing dependencies, these can be installed via apt for example: 
+<!-- ###  Training a condition prediction algorithm with this data -->
 
+<!-- ### Requirements
+Python dependencies can be installed via ```poetry``` from within the ```orderly/condition_prediction``` folder:
+
+- run in terminal: ```poetry install```
+
+## Reproducing results 
+
+
+
+@Kobi see inspiration below:
+## Train
+
+To train the model(s) in the paper, run this command:
+
+```train
+python train.py --input-data <path_to_data> --alpha 10 --beta 20
 ```
-sudo apt-get update
-sudo apt-get install libpq-dev gcc -y
+
+>📋  Describe how to train the models, with example commands on how to train the models in your paper, including the full training procedure and appropriate hyperparameters.
+
+## Evaluation
+
+To evaluate my model on ImageNet, run:
+
+```eval
+python eval.py --model-file mymodel.pth --benchmark imagenet
 ```
 
-#### III Install Python dependencies
+>📋  Describe how to evaluate the trained models on benchmarks reported in the paper, give commands that produce the results (section below).
 
-To install the dependencies this can be done via ```poetry``` or you can run the environment through docker.
+## Pre-trained Models
 
-- For poetry (run in terminal):
-    Python dependencies: ```poetry install```
-- For docker (run in terminal):
-    ```bash
-    build_orderly
-    run_orderly
-    ```
-    You can validate the install works by running
-    ```bash
-    build_orderly_extras
-    run_orderly_pytest
-    ```
+You can download pretrained models here:
 
+- [My awesome model](https://drive.google.com/mymodel.pth) trained on ImageNet using parameters x,y,z. 
 
-### 2. Run extraction
+>📋  Give a link to where/how the pretrained models can be downloaded and how they were trained (if applicable).  Alternatively you can have an additional column in your results table with a link to the models.
+@Kobi see inspiration above -->
 
-We can run extraction using: ```poetry run python -m orderly.extract```. Using ```poetry run python -m orderly.extract --help``` will explain the arguments. Certain args must be set such as data paths.
+Reproducing results from paper
+------------------------------
 
-### 3. Run cleaning
+To reproduce the results from the paper, please clone the repository, and use poetry to install the requirements (see above). Towards the bottom of the makefile, you will find a comprehensive 8 step list of steps to generate all the datasets and reproduce all results presented in the paper. 
+
+### Results
+
+We run the condition prediction model on four different datasets, and find that trusting the labelling of the ORD data leads to overly confident test accuracy. We conclude that applying chemical logic to the reaction string is necessary to get a high-quality dataset, and that the best strategy for dealing with rare molecules is to delete reactions where they appear.
+
+Top-3 exact match combination accuracy (\%): frequency informed guess  // model prediction  //  AIB\%:
 
-We can run cleaning using: ```poetry run python -m orderly.clean```. Using ```poetry run python -m orderly.clean --help``` will explain the arguments. Certain args must be set such as data paths.
+| Dataset            | A (labeling; rare->"other")   | B (labeling; rare->delete rxn) | C (reaction string; rare->"other") | D (reaction string; rare->delete rxn) |
+|--------------------|--------------------------------|---------------------------------|------------------------------------|--------------------------------------|
+| Solvents           | 47 // 58 // 21%                | 50 // 61 // 22%                 | 23 // 42 // 26%                    | 24 // 45 // 28%                      |
+| Agents             | 54 // 70 // 35%                | 58 // 72 // 32%                 | 19 // 39 // 25%                    | 21 // 42 // 27%                      |
+| Solvents & Agents  | 31 // 44 // 19%                | 33 // 47 // 21%                 | 4 // 21 // 18%                     | 5 // 24 // 21%                       |
 
-# ML models trained on ORDerly
+Where AIB\% is the Average Improvement of the model over the Baseline (i.e. a frequency informed guess), where $A_m$ is the accuracy of the model, and $A_B$ is the accuracy of the baseline: 
+$`AIB = (A_m - A_b) / (1 - A_b)`$
 
-We plan to show the usefulness of ORDerly by training ML models from the literature on ORDerly for standardised prediction tasks. Prediction tasks include:
-- Yield prediction
-    - https://chemrxiv.org/engage/chemrxiv/article-details/6150143118be8575b030ad43
-- Retrosynthesis
-- Forward prediction
-- Condition prediction
 
-We may be able to use https://deepchem.io/models
 
+Full API documentation
+------------------------
 
-## Appendix
+## Extraction
+There are two different ways to extract data from ORD files, trusting the labelling, or using the reaction string (as specified in the ```trust_labelling``` boolean). Below you see all the arguments that can be passed to the extraction script, change as appropriate:
 
-### Solvents
+``` orderly extract --name_contains_substring="uspto" --trust_labelling=False --output_path="data/orderly/uspto_no_trust" --consider_molecule_names=False```
 
-In data/solvents.csv you'll find a list of solvens which we use to label solvents (to avoid relying on the labelling in ORD), this list was created from the intersection of solvents coming from three different sources. The following procedure was followed for the construction of solvents.csv:
+## Cleaning
+There are also a number of customisable steps for the cleaning:
+
+```orderly clean --output_path="data/orderly/datasets_$(dataset_version)/orderly_no_trust_no_map.parquet" --ord_extraction_path="data/orderly/uspto_no_trust/extracted_ords" --molecules_to_remove_path="data/orderly/uspto_no_trust/all_molecule_names.csv" --min_frequency_of_occurrence=100 --map_rare_molecules_to_other=False --set_unresolved_names_to_none_if_mapped_rxn_str_exists_else_del_rxn=True --remove_rxn_with_unresolved_names=False --set_unresolved_names_to_none=False --num_product=1 --num_reactant=2 --num_solv=2 --num_agent=3 --num_cat=0 --num_reag=0 --consistent_yield=True --scramble=True --train_test_split_fraction=0.9```
+
+
+## Issues?
+Submit an [issue](https://github.com/sustainable-processes/ORDerly/issues) or send an email to [email protected].
+
+## Citing
+
+If you find this project useful, we encourage you to
+
+* Star this repository :star: 
+<!-- * Cite our [paper](https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/cmtd.202000051).
+```
+@article{Felton2021,
+author = "Kobi Felton and Jan Rittig and Alexei Lapkin",
+title = "{Summit: Benchmarking Machine Learning Methods for Reaction Optimisation}",
+year = "2021",
+month = "2",
+url = "https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/cmtd.202000051",
+journal = "Chemistry Methods"
+} 
+```-->
+
+
+
+
+<!-- ### 2. Run extraction
+
+We can run extraction using: ```poetry run python -m orderly.extract```. Using ```poetry run python -m orderly.extract --help``` will explain the arguments. Certain args must be set such as data paths.
+
+### 3. Run cleaning
 
-1. Data curation: We compiled a list of solvent names from the following 3 sources. Unfortunately they did not include SMILES strings.
- - https://doi.org/10.1039/C9SC01844A
- - https://www.acs.org/greenchemistry/research-innovation/tools-for-green-chemistry/solvent-selection-tool.html
- - https://github.com/sustainable-processes/summit/blob/main/data/ucb_pharma_approved_list.csv
+We can run cleaning using: ```poetry run python -m orderly.clean```. Using ```poetry run python -m orderly.clean --help``` will explain the arguments. Certain args must be set such as data paths. -->
 
-2. Filtering: Make all solvent names lower case, strip spaces, find and remove duplicate names. (Before: 458+272+115=845 rows in total. After removing duplicates: 615)
-3. Name resolution: The dataframe has 4 columns of identifiers: 3 for (english) solvent names, and 1 for CAS numbers. We ran (Pura)[https://github.com/sustainable-processes/pura] with <services=[PubChem(autocomplete=True), Opsin(), CIR(),]> and <agreement=2> separately on each of the three solvent name columns, and <services=[CAS()]>, <agreement=1> to resolve the CAS numbers.
-4. Agreement: We now had up to 4 SMILES strings for each solvent, and the SMILES string was trusted when all of them were in agreement (either the other items were the same SMILES or empty). There were ~40 rows with disagreement, and these were resolved manually (by cross-checking name and CAS with PubChem/Wikipedia).
-5. The final dataset consists of the following columns: 'solvent_name_1', 'solvent_name_2', 'solvent_name_3', 'cas_number', 'chemical_formula', 'smiles', 'source'.
diff --git a/images/abstract_fig.png b/images/abstract_fig.png
diff --git a/images/abstract_fig_transparent.png b/images/abstract_fig_transparent.png
diff --git a/notebooks/ORDerly_args_decisions.ipynb b/notebooks/ORDerly_args_decisions.ipynb
-Original file line number
+Diff line change
@@ Expand Up / @@ -124,7 +124,6 @@ venv.bak/ @@
     runs/
     data/
-    !reaction_utils-main/rxnutils/data/
     # Spyder project settings
     .spyderproject
@@ Expand Down @@