This repo contains a workflow for classification of chemical compounds as active or inactive based on experimental outputs and a set of previously computed descriptors. Classification is performed via sci-kit learn's DecisionTreeClassifier and yields models that can be both interpretable and predictive.
A conda environment is provided in the multithreshold_env.yml file. To create the environment, run
conda env create --file=multithreshold_env.yml --name=multithreshold
using any name and set the environment as the kernel for the Jupyter notebook.
If you'd prefer to create your own environment, here is a list of known dependencies:
- Python 3.9.16
- Numpy 1.23.5
- Pandas 1.5.3
- Matplotlib 3.7.0
- Ipykernel 6.15.0
- Scikit-learn 1.2.1
- Openpyxl 3.0.10
- Ipympl 0.9.3
An environment with these packages can be created using the command
conda create -n multithreshold -c conda-forge python=3.9.16 numpy=1.23.5 pandas=1.5.3 matplotlib=3.7.0 ipykernel=6.15.0 scikit-learn=1.2.1 openpyxl=3.0.10 ipympl=0.9.3`
The full workflow can be run via the Multi-Threshold Analysis.ipynb notebook. Supporting functions and classes can be found in the hotspot_utils.py and hotspot_classes.py files respectively. Input data should be formatted as shown in the 'InputData/Multi-Threshold Analysis Data.xlsx' file 'Suzuki Yields and Parameters sheet, with a column of compound/experiment identifiers, all output columns, then all feature columns. The first row should contain x# labels for each feature and the second row should be feature names.