Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Using bitmagic for internal Nodegraph representation #1221

Closed
wants to merge 16 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 1 addition & 6 deletions .github/workflows/build_wheel.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ jobs:
linux-x86_64,
linux-aarch64,
linux-ppc64le,
linux-s390x,
macos-x86_64,
]
include:
Expand All @@ -34,10 +33,6 @@ jobs:
os: ubuntu-18.04
arch: ppc64le
macos_target: ''
- build: linux-s390x
os: ubuntu-18.04
arch: s390x
macos_target: ''
- build: macos-x86_64
os: macos-latest
arch: x86_64
Expand Down Expand Up @@ -65,10 +60,10 @@ jobs:
env:
CIBW_BUILD: "cp39-*"
CIBW_SKIP: "*-win32 *-manylinux_i686 *-musllinux_ppc64le *-musllinux_s390x"
CIBW_SKIP: "*-win32 *-manylinux_i686"
CIBW_BEFORE_BUILD: 'source .ci/install_cargo.sh'
CIBW_ENVIRONMENT: 'PATH="$HOME/.cargo/bin:$PATH"'
CIBW_ENVIRONMENT_MACOS: ${{ matrix.macos_target }}
CIBW_BUILD_VERBOSITY: 3
CIBW_ARCHS_LINUX: ${{ matrix.arch }}
CIBW_ARCHS_MACOS: ${{ matrix.arch }}

Expand Down
55 changes: 55 additions & 0 deletions doc/developer.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,61 @@ For the Rust core library we use `rMAJOR.MINOR.PATCH`
The Rust version is not automated,
and must be bumped in `src/core/Cargo.toml`.

## Nodegraph compatibility with khmer

For more information, check the [binary formats](https://khmer.readthedocs.io/en/latest/dev/binary-file-formats.html) section in khmer.

### Version 4 (same as khmer)

The header is in the format below, again in the order of file offset. Value
macro definitions are given in parenthesis

| Field | Len | Off | Value |
| ----------------- | --- | --- | ------------------------------------------- |
| Magic string | 4 | 0 | ``OXLI`` (``SAVED_SIGNATURE``) |
| Version | 1 | 4 | ``0x04`` (``SAVED_FORMAT_VERSION``) |
| File Type | 1 | 5 | ``0x02`` (``SAVED_HASHBITS``) |
| K-size | 4 | 6 | k-mer length. [``unsigned int``] |
| Number of Tables | 1 | 10 | Number of Nodegraph tables. [``uint8_t``] |
| Occupied Bins | 8 | 11 | Number of occupied bins |

Then follows the Nodegraph's tables. For each table:

| Field | Len | Off | Value |
| ----------------- | ------ | --- | -------------------------------------------- |
| Table size | 8 | 0 | Length of table, **in bits** (``uint64_t``). |
| Bins | N/8+1 | 8 | This table's bytes, length given by previous field, divided by 8, plus 1 (``uint8_t``). |

### Version 5

Version 5 is a new version incompatible with the khmer Nodegraphs because it uses
[BitMagic](http://bitmagic.io) for saving the tables.
It also includes the number of unique kmers,
something that both khmer and sourmash calculate when adding new elements
but don't serialize to the binary format in version 4.

The header is in the format below, again in the order of file offset. Value
macro definitions are given in parenthesis

| Field | Len | Off | Value |
| ----------------- | --- | --- | ----------------------------------------- |
| Magic string | 4 | 0 | ``OXLI`` (``SAVED_SIGNATURE``) |
| Version | 1 | 4 | ``0x04`` (``SAVED_FORMAT_VERSION``) |
| File Type | 1 | 5 | ``0x02`` (``SAVED_HASHBITS``) |
| K-size | 4 | 6 | k-mer length. [``unsigned int``] |
| Unique k-mers | 8 | 10 | Number of unique k-mers. [``uint64_t``] |
| Number of Tables | 1 | 10 | Number of Nodegraph tables. [``uint8_t``] |
| Occupied Bins | 8 | 11 | Number of occupied bins |

Then follows the Nodegraph's tables. Each table is serialized using the
BitMagic format, and must be deserialized using its deserializing methods.
For each table:

| Field | Len | Off | Value |
| ----------------- | --- | --- | -------------------------------------------- |
| Table size | 8 | 0 | Length of table, **in bytes** (``uint8_t``). |
| Bins | N | 8 | This table's BitMagic bit-vector. Length given by previous field (``BVector``). |

## Common errors and solutions

### Cannot import name `to_bytes` from `sourmash.minhash`
Expand Down
3 changes: 3 additions & 0 deletions flake.nix
Original file line number Diff line number Diff line change
Expand Up @@ -91,12 +91,15 @@
openssl
pkgconfig

cmake

git
stdenv.cc.cc.lib
(python310.withPackages (ps: with ps; [ virtualenv tox setuptools ]))
(python39.withPackages (ps: with ps; [ virtualenv setuptools ]))
(python38.withPackages (ps: with ps; [ virtualenv setuptools ]))

rust-bindgen
rust-cbindgen

wasmtime
Expand Down
2 changes: 2 additions & 0 deletions include/sourmash.h
Original file line number Diff line number Diff line change
Expand Up @@ -292,6 +292,8 @@ uintptr_t nodegraph_noccupied(const SourmashNodegraph *ptr);

uintptr_t nodegraph_ntables(const SourmashNodegraph *ptr);

void nodegraph_save_khmer(const SourmashNodegraph *ptr, const char *filename);

void nodegraph_save(const SourmashNodegraph *ptr, const char *filename);

const uint8_t *nodegraph_to_buffer(const SourmashNodegraph *ptr,
Expand Down
1 change: 1 addition & 0 deletions src/core/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,4 @@ wasm-bindgen-test = "0.3.0"

### These crates don't compile on wasm
[target.'cfg(not(all(target_arch = "wasm32", target_vendor="unknown")))'.dependencies]
bitmagic = { version = "0.2.0", git = "https://github.com/luizirber/bitmagic-rs", branch = "sync_send" }
17 changes: 17 additions & 0 deletions src/core/src/ffi/nodegraph.rs
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,23 @@ unsafe fn nodegraph_save(ptr: *const SourmashNodegraph, filename: *const c_char)
}
}

ffi_fn! {
unsafe fn nodegraph_save_khmer(ptr: *const SourmashNodegraph, filename: *const c_char) -> Result<()> {
let ng = SourmashNodegraph::as_rust(ptr);

// FIXME use buffer + len instead of c_str
let c_str = {
assert!(!filename.is_null());

CStr::from_ptr(filename)
};

ng.write_v4(&mut std::fs::File::create(c_str.to_str()?)?)?;

Ok(())
}
}

ffi_fn! {
unsafe fn nodegraph_to_buffer(ptr: *const SourmashNodegraph, compression: u8, size: *mut usize) -> Result<*const u8> {
let ng = SourmashNodegraph::as_rust(ptr);
Expand Down
1 change: 1 addition & 0 deletions src/core/src/sketch/mod.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
pub mod hyperloglog;
pub mod minhash;

#[cfg(not(target_arch = "wasm32"))]
pub mod nodegraph;

use serde::{Deserialize, Serialize};
Expand Down
Loading