Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added MSA and separated Metric from Dataset #226

Merged
merged 6 commits into from
Jan 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ logs
.tmp-earthly-out
.vscode/settings.json
.ruff_cache
*.svg

################################################################################
# Rust. Generated by Cargo #
Expand Down
43 changes: 30 additions & 13 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,45 +3,62 @@ members = [
"crates/abd-clam",
"crates/distances",
"crates/symagen",
"crates/results/chaoda",
"crates/results/cakes",
"crates/results/chaoda",
"crates/results/rite-solutions",
"crates/results/msa",
"pypi/distances",
"pypi/results/cakes",
"benches/utils",
"benches/cakes",
]
resolver = "2"

[workspace.dependencies]
abd-clam = { version = "0.31.0", path = "crates/abd-clam" }
abd-clam = { version = "0.32.0", path = "crates/abd-clam" }
distances = { version = "1.8.0", path = "crates/distances" }
symagen = { version = "0.5.0", path = "crates/symagen" }

rayon = "1.8"
rand = "0.8"
serde = { version = "1.0", features = ["derive"] }
bincode = "1.3"
ftlog = "0.2.0"
# bitcode = { version = "0.5" }
bitcode = { git = "https://github.com/nishaq503/bitcode.git", rev = "1c393ad97288555fc3fe41b292b2bd826486a992" }
libm = "0.2"
ndarray = { version = "0.15.6", features = ["rayon", "approx"] }
ndarray-npy = "0.8.0"
ordered-float = "4.2"
flate2 = { version = "1.0", features = ["zlib"] }
ndarray = { version = "0.16", features = ["rayon", "approx"] }
ndarray-npy = "0.9"
csv = { version = "1.3.0" }
flate2 = { version = "1.0" }
# For GCD and LCM calculations.
num-integer = "0.1"
# For reading fasta files.
bio = "2.0"
# For a faster implementation of Levenshtein distance.
stringzilla = "3.10"
# For CLI tools
clap = { version = "4.5", features = ["derive"] }
# For low-latency logging from multiple threads.
ftlog = { version = "0.2" }
# For reading and writing HDF5 files.
hdf5 = { package = "hdf5-metno", version = "0.9.0" }

# Python wrapper dependencies
numpy = "0.20.0"
pyo3 = { version = "0.20", features = ["extension-module", "abi3-py39"] }
pyo3-ffi = { version = "0.20", features = ["extension-module", "abi3-py39"] }
# For Python Wrappers
numpy = "0.23"
pyo3 = { version = "0.23", features = ["extension-module", "abi3-py39"] }
pyo3-ffi = { version = "0.23", features = ["extension-module", "abi3-py39"] }

[profile.test]
opt-level = 3
debug = true
overflow-checks = true

[profile.release]
# debug = true
opt-level = 3
strip = true
lto = true
codegen-units = 1

[profile.bench]
opt-level = 3
debug = true
overflow-checks = true
8 changes: 5 additions & 3 deletions Earthfile
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ ENV PATH="${RYE_HOME}/shims:${PATH}"

# This target prepares the recipe.json file for the build stage.
chef-prepare:
COPY --dir crates pypi .
COPY --dir benches crates pypi .
COPY Cargo.toml .
RUN cargo chef prepare
SAVE ARTIFACT recipe.json
Expand All @@ -42,6 +42,7 @@ chef-cook:
RUN cargo chef cook --release
COPY Cargo.toml pyproject.toml requirements.lock requirements-dev.lock ruff.toml rustfmt.toml .
# TODO: Replace with recursive globbing, blocked on https://github.com/earthly/earthly/issues/1230
COPY --dir benches .
COPY --dir crates .
COPY --dir pypi .
RUN rye sync --no-lock
Expand All @@ -67,17 +68,18 @@ lint:
# Apply any automated fixes.
fix:
FROM +chef-cook
RUN cargo fmt --all
RUN cargo fmt --all --all-features
RUN rye fmt --all
RUN cargo clippy --fix --allow-no-vcs
RUN rye lint --fix
SAVE ARTIFACT benches AS LOCAL ./
SAVE ARTIFACT crates AS LOCAL ./
SAVE ARTIFACT pypi AS LOCAL ./

# This target runs the tests.
test:
FROM +chef-cook
RUN cargo test --release --lib --bins --examples --tests --all-features
RUN cargo test -r -p abd-clam --all-features -p distances -p symagen
# TODO: switch to --all, blocked on https://github.com/astral-sh/rye/issues/853
RUN rye test --package abd-distances

Expand Down
24 changes: 19 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The Rust implementation of CLAM.
As of writing this document, the project is still in a pre-1.0 state.
This means that the API is not yet stable and breaking changes may occur frequently.

## Components
## Rust Crates and Python Packages

This repository is a workspace that contains the following crates:

Expand All @@ -16,14 +16,28 @@ and the following Python packages:

- `abd-distances`: A Python wrapper for the `distances` crate, providing drop-in replacements for distance function `scipy.spatial.distance`. See [here](python/distances/README.md) for more information.

## License
## Reproducing Results from Papers

- MIT
This repository contains CLI tools to reproduce results from some of our papers.

### CAKES

This paper is currently under review at SIMODS.
See [here](benches/cakes/README.md) for running Rust code to reproduce the results for the CAKES algorithms, and [here](benches/py-cakes/README.md) for running some Python code to generate plots from the results of running the Rust code.

### MSA

TODO

### PANCAKES

TODO

## Publications

- [CHESS](https://arxiv.org/abs/1908.08551)
- [CHAODA](https://arxiv.org/abs/2103.11774)
- [CHESS](https://arxiv.org/abs/1908.08551): Hierarchical Clustering and Ranged Nearest Neighbors Search
- [CHAODA](https://arxiv.org/abs/2103.11774): Anomaly Detection
- [PANCAKES](https://arxiv.org/pdf/2409.12161): Compression and Compressive Search

## Citation

Expand Down
16 changes: 16 additions & 0 deletions benches/cakes/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[package]
name = "bench-cakes"
version = "0.1.0"
edition = "2021"

[dependencies]
clap = { version = "4.5.16", features = ["derive"] }
bench-utils = { path = "../utils" }
ftlog = { workspace = true }
bitcode = { workspace = true }
abd-clam = { workspace = true, features = ["disk-io"] }
distances = { workspace = true }
rand = { workspace = true }
rayon = { workspace = true }
stringzilla = "3.9.5"
augurs-dtw = { version = "0.8.1", features = ["parallel"] }
40 changes: 40 additions & 0 deletions benches/cakes/src/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Benchmarks for CAKES Search Algorithms

This is crate provides a CLI to run benchmarks for the CAKES search algorithms and reproduce the results from our paper.

## Reproducing the Results

Let's say you have data from the [ANN-Benchmarks suite](https://github.com/erikbern/ann-benchmarks?tab=readme-ov-file#data-sets) in a directory `../data/input` and you want to run the benchmarks for the CAKES search algorithms on the `sift` dataset.
You can run the following command:

```bash
cargo run --release --package bench-cakes -- \
--inp-dir ../data/input/ \
--dataset sift \
--out-dir ../data/output/ \
--seed 42 \
--num-queries 10000 \
--max-power 7 \
--max-time 300 \
--balanced-data \
--permuted-trees
```

This will run the CAKES search algorithms on the `sift` dataset with 10000 search queries.
The results will be saved in the directory `../data/output/`.
The dataset will be augmented by powers of 2 up to 2^7.
Each algorithm will be run for at least 300 seconds.
The `--balanced` flag will build trees with balanced partitions.
The `--permuted` flag will permute the dataset into depth-first order after building the tree.

There are several other available options.
Running the following command will provide documentation on how to use the CLI:

```bash
cargo run --release --package bench-cakes -- --help
```

## Plotting the Results

The outputs from the benchmarks can be plotted using the python package we provide at `../py-cakes`.
See the associated README for more information.
Loading
Loading