Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 1.1 🎉 #26

Merged
merged 18 commits into from
Jan 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@
</p>

# Announcements
## alphaXiv
The `fastprop` paper is freely available online at [arxiv.org/abs/2404.02058](https://arxiv.org/abs/2404.02058) and we are conducting open source peer review on [alphaXiv](https://alphaxiv.org/abs/2404.02058) - comments are appreciated!
## alphaXiv Paper
The companion academic paper describing `fastprop` is freely available online at [alphaXiv](https://www.alphaxiv.org/abs/2404.02058).
The source for the paper is stored in this repository under the `paper` directory.

## Initial Release :tada:
Expand All @@ -22,7 +22,7 @@ Please try `fastprop` on your datasets and let us know what you think.
Feature requests and bug reports are **very** appreciated!

# Installing `fastprop`
`fastprop` supports Mac, Windows, and Linux on Python versions 3.8 to 3.12.
`fastprop` supports Mac, Windows, and Linux on Python versions 3.8 or newer.
Installing from `pip` is the best way to get `fastprop`, but if you need to check out a specific GitHub branch or you want to contribute to `fastprop` a source installation is recommended.
Pending interest from users, a `conda` package will be added.

Expand Down Expand Up @@ -69,14 +69,15 @@ There are four distinct steps in `fastprop` that define its framework:
_or_
- Load precomputed descriptors: filepath to where descriptors are already cached either manually or by `fastprop`
2. Preprocessing
- standardize: call `rdkit`'s `rdMolStandardize.Cleanup` function on the input molecules before calculating descriptors (`False` by default)
- _not configurable_: `fastprop` will always rescale input features, set invariant and missing features to zero, and impute missing values with the per-feature mean
3. Training
- Number of Repeats: How many times to split/train/test on the dataset (increments random seed by 1 each time).

_and_
- Number of FNN layers (default 2; repeated fully connected layers of hidden size)
- Hidden Size: number of neurons per FNN layer (default 1800)
- Clamp Input: Enable/Disable input clamp to +/-3 to aid in extrapolation (default False).
- Clamp Input: Enable/Disable input clamp to +/-3 (winsorization) to aid in extrapolation (default False).

_or_
- Hyperparameter optimization: runs hyperparameter optimization identify the optimal number of layers and hidden size
Expand All @@ -86,6 +87,7 @@ There are four distinct steps in `fastprop` that define its framework:
- Learning rate
- Batch size
- Problem type (one of: regression, binary, multiclass (start labels from 0), multilabel)
- Training, Validation, and Testing fraction (set testing to zero to use all data for training and validation)
4. Prediction
- Input SMILES: either a single SMILES or file of SMILES strings on individual lines
- Output format: filepath to write the results or nothing, defaults to stdout
Expand All @@ -102,6 +104,9 @@ After installation, `fastprop` is accessible from the command line via `fastprop

Try `fastprop --help` or `fastprop subcommand --help` for more information and see below.

> [!TIP]
> `fastprop` will use all of your CPUs for descriptor calculation by default - set the `MORDRED_NUM_PROC` environment variable to some other number to change this behavior.

### Configuration File [recommended]
See `examples/example_fastprop_train_config.yaml` for configuration files that show all options that can be configured during training.
It is everything shown in the [Configurable Parameters](#configurable-parameters) section.
Expand Down
63 changes: 0 additions & 63 deletions benchmarks/pgp/benchmark_data.csv.1

This file was deleted.

2 changes: 2 additions & 0 deletions examples/example_fastprop_train_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ optimize: False # True
#
# Which set of descriptors to calculate (either all or optimized)
descriptor_set: all
# Call rdMolStandardize.Cleanup on molecules before calculating descriptors
standardize: False
# Allow caching of descriptors
enable_cache: True
#
Expand Down
3 changes: 1 addition & 2 deletions examples/oom_training.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,8 +227,7 @@ def __init__(
# mock the target scaler used for reporting some human-readable metrics
self.target_scaler = SimpleNamespace(n_features_in_=1, inverse_transform=lambda i: np.array(i))

def setup(self, stage=None):
... # skip feature scaling and dataset splitting
def setup(self, stage=None): ... # skip feature scaling and dataset splitting

def _init_dataloader(self, shuffle, idxs):
return TorchDataloader(
Expand Down
Loading
Loading