Skip to content

Commit

Permalink
Added MSA and separated Metric from Dataset (URI-ABD#226)
Browse files Browse the repository at this point in the history
* feat: added msa and separated metric form dataset

* docs: updated main README

* fmt: python formatting

* fix: corrected overlapp_with method for SquishyBall

* feat: added min and max methods for Number trait

* wip: disconnecting msa and pancakes
  • Loading branch information
nishaq503 authored Jan 19, 2025
1 parent e034a28 commit 75cf0c0
Show file tree
Hide file tree
Showing 217 changed files with 17,384 additions and 8,121 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ logs
.tmp-earthly-out
.vscode/settings.json
.ruff_cache
*.svg

################################################################################
# Rust. Generated by Cargo #
Expand Down
43 changes: 30 additions & 13 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,45 +3,62 @@ members = [
"crates/abd-clam",
"crates/distances",
"crates/symagen",
"crates/results/chaoda",
"crates/results/cakes",
"crates/results/chaoda",
"crates/results/rite-solutions",
"crates/results/msa",
"pypi/distances",
"pypi/results/cakes",
"benches/utils",
"benches/cakes",
]
resolver = "2"

[workspace.dependencies]
abd-clam = { version = "0.31.0", path = "crates/abd-clam" }
abd-clam = { version = "0.32.0", path = "crates/abd-clam" }
distances = { version = "1.8.0", path = "crates/distances" }
symagen = { version = "0.5.0", path = "crates/symagen" }

rayon = "1.8"
rand = "0.8"
serde = { version = "1.0", features = ["derive"] }
bincode = "1.3"
ftlog = "0.2.0"
# bitcode = { version = "0.5" }
bitcode = { git = "https://github.com/nishaq503/bitcode.git", rev = "1c393ad97288555fc3fe41b292b2bd826486a992" }
libm = "0.2"
ndarray = { version = "0.15.6", features = ["rayon", "approx"] }
ndarray-npy = "0.8.0"
ordered-float = "4.2"
flate2 = { version = "1.0", features = ["zlib"] }
ndarray = { version = "0.16", features = ["rayon", "approx"] }
ndarray-npy = "0.9"
csv = { version = "1.3.0" }
flate2 = { version = "1.0" }
# For GCD and LCM calculations.
num-integer = "0.1"
# For reading fasta files.
bio = "2.0"
# For a faster implementation of Levenshtein distance.
stringzilla = "3.10"
# For CLI tools
clap = { version = "4.5", features = ["derive"] }
# For low-latency logging from multiple threads.
ftlog = { version = "0.2" }
# For reading and writing HDF5 files.
hdf5 = { package = "hdf5-metno", version = "0.9.0" }

# Python wrapper dependencies
numpy = "0.20.0"
pyo3 = { version = "0.20", features = ["extension-module", "abi3-py39"] }
pyo3-ffi = { version = "0.20", features = ["extension-module", "abi3-py39"] }
# For Python Wrappers
numpy = "0.23"
pyo3 = { version = "0.23", features = ["extension-module", "abi3-py39"] }
pyo3-ffi = { version = "0.23", features = ["extension-module", "abi3-py39"] }

[profile.test]
opt-level = 3
debug = true
overflow-checks = true

[profile.release]
# debug = true
opt-level = 3
strip = true
lto = true
codegen-units = 1

[profile.bench]
opt-level = 3
debug = true
overflow-checks = true
8 changes: 5 additions & 3 deletions Earthfile
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ ENV PATH="${RYE_HOME}/shims:${PATH}"

# This target prepares the recipe.json file for the build stage.
chef-prepare:
COPY --dir crates pypi .
COPY --dir benches crates pypi .
COPY Cargo.toml .
RUN cargo chef prepare
SAVE ARTIFACT recipe.json
Expand All @@ -42,6 +42,7 @@ chef-cook:
RUN cargo chef cook --release
COPY Cargo.toml pyproject.toml requirements.lock requirements-dev.lock ruff.toml rustfmt.toml .
# TODO: Replace with recursive globbing, blocked on https://github.com/earthly/earthly/issues/1230
COPY --dir benches .
COPY --dir crates .
COPY --dir pypi .
RUN rye sync --no-lock
Expand All @@ -67,17 +68,18 @@ lint:
# Apply any automated fixes.
fix:
FROM +chef-cook
RUN cargo fmt --all
RUN cargo fmt --all --all-features
RUN rye fmt --all
RUN cargo clippy --fix --allow-no-vcs
RUN rye lint --fix
SAVE ARTIFACT benches AS LOCAL ./
SAVE ARTIFACT crates AS LOCAL ./
SAVE ARTIFACT pypi AS LOCAL ./

# This target runs the tests.
test:
FROM +chef-cook
RUN cargo test --release --lib --bins --examples --tests --all-features
RUN cargo test -r -p abd-clam --all-features -p distances -p symagen
# TODO: switch to --all, blocked on https://github.com/astral-sh/rye/issues/853
RUN rye test --package abd-distances

Expand Down
24 changes: 19 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The Rust implementation of CLAM.
As of writing this document, the project is still in a pre-1.0 state.
This means that the API is not yet stable and breaking changes may occur frequently.

## Components
## Rust Crates and Python Packages

This repository is a workspace that contains the following crates:

Expand All @@ -16,14 +16,28 @@ and the following Python packages:

- `abd-distances`: A Python wrapper for the `distances` crate, providing drop-in replacements for distance function `scipy.spatial.distance`. See [here](python/distances/README.md) for more information.

## License
## Reproducing Results from Papers

- MIT
This repository contains CLI tools to reproduce results from some of our papers.

### CAKES

This paper is currently under review at SIMODS.
See [here](benches/cakes/README.md) for running Rust code to reproduce the results for the CAKES algorithms, and [here](benches/py-cakes/README.md) for running some Python code to generate plots from the results of running the Rust code.

### MSA

TODO

### PANCAKES

TODO

## Publications

- [CHESS](https://arxiv.org/abs/1908.08551)
- [CHAODA](https://arxiv.org/abs/2103.11774)
- [CHESS](https://arxiv.org/abs/1908.08551): Hierarchical Clustering and Ranged Nearest Neighbors Search
- [CHAODA](https://arxiv.org/abs/2103.11774): Anomaly Detection
- [PANCAKES](https://arxiv.org/pdf/2409.12161): Compression and Compressive Search

## Citation

Expand Down
16 changes: 16 additions & 0 deletions benches/cakes/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[package]
name = "bench-cakes"
version = "0.1.0"
edition = "2021"

[dependencies]
clap = { version = "4.5.16", features = ["derive"] }
bench-utils = { path = "../utils" }
ftlog = { workspace = true }
bitcode = { workspace = true }
abd-clam = { workspace = true, features = ["disk-io"] }
distances = { workspace = true }
rand = { workspace = true }
rayon = { workspace = true }
stringzilla = "3.9.5"
augurs-dtw = { version = "0.8.1", features = ["parallel"] }
40 changes: 40 additions & 0 deletions benches/cakes/src/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Benchmarks for CAKES Search Algorithms

This is crate provides a CLI to run benchmarks for the CAKES search algorithms and reproduce the results from our paper.

## Reproducing the Results

Let's say you have data from the [ANN-Benchmarks suite](https://github.com/erikbern/ann-benchmarks?tab=readme-ov-file#data-sets) in a directory `../data/input` and you want to run the benchmarks for the CAKES search algorithms on the `sift` dataset.
You can run the following command:

```bash
cargo run --release --package bench-cakes -- \
--inp-dir ../data/input/ \
--dataset sift \
--out-dir ../data/output/ \
--seed 42 \
--num-queries 10000 \
--max-power 7 \
--max-time 300 \
--balanced-data \
--permuted-trees
```

This will run the CAKES search algorithms on the `sift` dataset with 10000 search queries.
The results will be saved in the directory `../data/output/`.
The dataset will be augmented by powers of 2 up to 2^7.
Each algorithm will be run for at least 300 seconds.
The `--balanced` flag will build trees with balanced partitions.
The `--permuted` flag will permute the dataset into depth-first order after building the tree.

There are several other available options.
Running the following command will provide documentation on how to use the CLI:

```bash
cargo run --release --package bench-cakes -- --help
```

## Plotting the Results

The outputs from the benchmarks can be plotted using the python package we provide at `../py-cakes`.
See the associated README for more information.
Loading

0 comments on commit 75cf0c0

Please sign in to comment.