Skip to content

Commit

Permalink
Merge pull request #26 from JacksonBurns/v1-1/jcheminf_revisions
Browse files Browse the repository at this point in the history
Version 1.1 🎉
  • Loading branch information
JacksonBurns authored Jan 16, 2025
2 parents 1a14915 + 01ac026 commit 3089521
Show file tree
Hide file tree
Showing 17 changed files with 227 additions and 142 deletions.
13 changes: 9 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@
</p>

# Announcements
## alphaXiv
The `fastprop` paper is freely available online at [arxiv.org/abs/2404.02058](https://arxiv.org/abs/2404.02058) and we are conducting open source peer review on [alphaXiv](https://alphaxiv.org/abs/2404.02058) - comments are appreciated!
## alphaXiv Paper
The companion academic paper describing `fastprop` is freely available online at [alphaXiv](https://www.alphaxiv.org/abs/2404.02058).
The source for the paper is stored in this repository under the `paper` directory.

## Initial Release :tada:
Expand All @@ -22,7 +22,7 @@ Please try `fastprop` on your datasets and let us know what you think.
Feature requests and bug reports are **very** appreciated!

# Installing `fastprop`
`fastprop` supports Mac, Windows, and Linux on Python versions 3.8 to 3.12.
`fastprop` supports Mac, Windows, and Linux on Python versions 3.8 or newer.
Installing from `pip` is the best way to get `fastprop`, but if you need to check out a specific GitHub branch or you want to contribute to `fastprop` a source installation is recommended.
Pending interest from users, a `conda` package will be added.

Expand Down Expand Up @@ -69,14 +69,15 @@ There are four distinct steps in `fastprop` that define its framework:
_or_
- Load precomputed descriptors: filepath to where descriptors are already cached either manually or by `fastprop`
2. Preprocessing
- standardize: call `rdkit`'s `rdMolStandardize.Cleanup` function on the input molecules before calculating descriptors (`False` by default)
- _not configurable_: `fastprop` will always rescale input features, set invariant and missing features to zero, and impute missing values with the per-feature mean
3. Training
- Number of Repeats: How many times to split/train/test on the dataset (increments random seed by 1 each time).

_and_
- Number of FNN layers (default 2; repeated fully connected layers of hidden size)
- Hidden Size: number of neurons per FNN layer (default 1800)
- Clamp Input: Enable/Disable input clamp to +/-3 to aid in extrapolation (default False).
- Clamp Input: Enable/Disable input clamp to +/-3 (winsorization) to aid in extrapolation (default False).

_or_
- Hyperparameter optimization: runs hyperparameter optimization identify the optimal number of layers and hidden size
Expand All @@ -86,6 +87,7 @@ There are four distinct steps in `fastprop` that define its framework:
- Learning rate
- Batch size
- Problem type (one of: regression, binary, multiclass (start labels from 0), multilabel)
- Training, Validation, and Testing fraction (set testing to zero to use all data for training and validation)
4. Prediction
- Input SMILES: either a single SMILES or file of SMILES strings on individual lines
- Output format: filepath to write the results or nothing, defaults to stdout
Expand All @@ -102,6 +104,9 @@ After installation, `fastprop` is accessible from the command line via `fastprop

Try `fastprop --help` or `fastprop subcommand --help` for more information and see below.

> [!TIP]
> `fastprop` will use all of your CPUs for descriptor calculation by default - set the `MORDRED_NUM_PROC` environment variable to some other number to change this behavior.
### Configuration File [recommended]
See `examples/example_fastprop_train_config.yaml` for configuration files that show all options that can be configured during training.
It is everything shown in the [Configurable Parameters](#configurable-parameters) section.
Expand Down
63 changes: 0 additions & 63 deletions benchmarks/pgp/benchmark_data.csv.1

This file was deleted.

2 changes: 2 additions & 0 deletions examples/example_fastprop_train_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ optimize: False # True
#
# Which set of descriptors to calculate (either all or optimized)
descriptor_set: all
# Call rdMolStandardize.Cleanup on molecules before calculating descriptors
standardize: False
# Allow caching of descriptors
enable_cache: True
#
Expand Down
3 changes: 1 addition & 2 deletions examples/oom_training.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,8 +227,7 @@ def __init__(
# mock the target scaler used for reporting some human-readable metrics
self.target_scaler = SimpleNamespace(n_features_in_=1, inverse_transform=lambda i: np.array(i))

def setup(self, stage=None):
... # skip feature scaling and dataset splitting
def setup(self, stage=None): ... # skip feature scaling and dataset splitting

def _init_dataloader(self, shuffle, idxs):
return TorchDataloader(
Expand Down
Loading

0 comments on commit 3089521

Please sign in to comment.