Skip to content

Latest commit

 

History

History
331 lines (247 loc) · 20.5 KB

README.md

File metadata and controls

331 lines (247 loc) · 20.5 KB

fastfilter: Binary fuse & xor filters for Zig Hexops logo

CI

comparison

Binary fuse filters & xor filters are probabilistic data structures which allow for quickly checking whether an element is part of a set.

Both are faster and more concise than Bloom filters, and smaller than Cuckoo filters. Binary fuse filters are a bleeding-edge development and are competitive with Facebook's ribbon filters:

Benefits of Zig implementation

This is a Zig implementation, which provides many practical benefits:

  1. Iterator-based: you can populate xor or binary fuse filters using an iterator, without keeping your entire key set in-memory and without it being a contiguous array of keys. This can reduce memory usage when populating filters substantially.
  2. Distinct allocators: you can provide separate Zig std.mem.Allocator implementations for the filter itself and population, enabling interesting opportunities like mmap-backed population of filters with low physical memory usage.
  3. Generic implementation: use Xor(u8), Xor(u16), BinaryFuse(u8), BinaryFuse(u16), or experiment with more exotic variants like Xor(u4) thanks to Zig's bit-width integers and generic type system.

Zig's safety-checking and checked overflows has also enabled us to improve the upstream C/Go implementations where overflow and undefined behavior went unnoticed.[1]

Usage

Decide if xor or binary fuse filters fit your use case better: should I use binary fuse filters or xor filters?

Get your keys into u64 values. If you have strings, structs, etc. then use something like Zig's std.hash_map.getAutoHashFn to convert your keys to u64 first. ("It is not important to have a good hash function, but collisions should be unlikely (~1/2^64).")

Create a build.zig.zon file in your project (replace $LATEST_COMMIT with the latest commit hash):

.{
    .name = "mypkg",
    .version = "0.1.0",
    .dependencies = .{
        .fastfilter = .{
            .url = "https://github.com/hexops/fastfilter/archive/$LATEST_COMMIT.tar.gz",
        },
    },
}

Run zig build in your project, and the compiler instruct you to add a .hash = "..." field next to .url.

Then use the dependency in your build.zig:

pub fn build(b: *std.Build) void {
    ...
    exe.addModule("fastfilter", b.dependency("fastfilter", .{
        .target = target,
        .optimize = optimize,
    }).module("fastfilter"));
}

In your main.zig, make use of the library:

const std = @import("std");
const testing = std.testing;
const fastfilter = @import("fastfilter");

test "mytest" {
    const allocator = std.heap.page_allocator;

    // Initialize the binary fuse filter with room for 1 million keys.
    const size = 1_000_000;
    var filter = try fastfilter.BinaryFuse8.init(allocator, size);
    defer filter.deinit(allocator);

    // Generate some consecutive keys.
    var keys = try allocator.alloc(u64, size);
    defer allocator.free(keys);
    for (keys, 0..) |key, i| {
        _ = key;
        keys[i] = i;
    }

    // Populate the filter with our keys. You can't update a xor / binary fuse filter after the
    // fact, instead you should build a new one.
    try filter.populate(allocator, keys[0..]);

    // Now we can quickly test for containment. So fast!
    try testing.expect(filter.contain(1) == true);
}

(you can just add this project as a Git submodule in yours for now, as Zig's official package manager is still under way.)

Binary fuse filters automatically deduplicate any keys during population. If you are using a different filter type (you probably shouldn't be!) then keys must be unique or else filter population will fail. You can use the fastfilter.AutoUnique(u64)(keys) helper to deduplicate (in typically O(N) time complexity), see the tests in src/unique.zig for usage examples.

Serialization

To serialize the filters, you only need to encode these struct fields:

pub fn BinaryFuse(comptime T: type) type {
    return struct {
        ...
        seed: u64,
        segment_length: u32,
        segment_length_mask: u32,
        segment_count: u32,
        segment_count_length: u32,
        fingerprints: []T,
        ...

T will be the chosen fingerprint size, e.g. u8 for BinaryFuse8 or Xor8.

Look at std.io.Writer and std.io.BitWriter for ideas on actual serialization.

Similarly, for xor filters you only need these struct fields:

pub fn Xor(comptime T: type) type {
    return struct {
        seed: u64,
        blockLength: u64,
        fingerprints: []T,
        ...

Should I use binary fuse filters or xor filters?

If you're not sure, start with BinaryFuse8 filters. They're fast, and have a false-positive probability rate of 1/256 (or 0.4%).

There are many tradeoffs, primarily between:

  • Memory usage
  • Containment check time
  • Population / creation time & memory usage

See the benchmarks section for a comparison of the tradeoffs between binary fuse filters and xor filters, as well as how larger bit sizes (e.g. BinaryFuse(u16)) consume more memory in exchange for a lower false-positive probability rate.

Note that fuse filters are not to be confused with binary fuse filters, the former have issues with construction, often failing unless you have a large number of unique keys. Binary fuse filters do not suffer from this and are generally better than traditional ones in several ways. For this reason, we consider traditional fuse filters deprecated.

Note about extremely large datasets

This implementation supports key iterators, so you do not need to have all of your keys in-memory, see BinaryFuse8.populateIter and Xor8.populateIter.

If you intend to use a xor filter with datasets of 100m+ keys, there is a possible faster implementation for construction found in the C implementation xor8_buffered_populate which is not implemented here.

Changelog

The API is generally finalized, but we may make some adjustments as Zig changes or we learn of more idiomatic ways to express things. We will release v1.0 once Zig v1.0 is released.

v0.11.0

  • fastfilter is now available via the Zig package manager.
  • Updated to the latest version of Zig nightly 0.12.0-dev.706+62a0fbdae

v0.10.3

  • Updated to the latest version of Zig 0.12.0-dev.706+62a0fbdae (build.zig .path -> .source change.)

v0.10.2

  • Fixed a few correctness / integer overflow/underflow possibilities where we were inconsistent with the Go/C implementations of binary fuse filters.
  • Added debug-mode checks for iterator correctness (wraparound behavior.)

v0.10.1

  • Updated to the latest version of Zig 0.12.0-dev.706+62a0fbdae

v0.10.0

  • All types are now unmanaged (allocator must be passed via parameters)
  • Renamed util.sliceIterator to fastfilter.SliceIterator
  • SliceIterator is now unmanaged / does not store an allocator.
  • SliceIterator now stores []const T instead of []T internally.
  • BinaryFuseFilter.max_iterations is now a constant.
  • Added fastfilter.MeasuredAllocator for measuring allocations.
  • Improved usage example.
  • Properly free xorfilter/fusefilter fingerprints.
  • Updated benchmark to latest Zig version.

v0.9.3

  • Fixed potential integer overflow.

v0.9.2

  • Handle duplicated keys automatically
  • Added a std.build.Pkg definition
  • Fixed an unlikely bug
  • Updated usage instructions
  • Updated to Zig v0.10.0-dev.1736

v0.9.1

  • Updated to Zig v0.10.0-dev.36

v0.9.0

  • Renamed repository github.com/hexops/xorfilter -> github.com/hexops/fastfilter to account for binary fuse filters.
  • Implemented bleeding-edge (paper not yet published) "Binary Fuse Filters: Fast and Smaller Than Xor Filters" algorithm by Thomas Mueller Graf, Daniel Lemire
  • BinaryFuse filters are now recommended by default, are generally better than Xor and Fuse filters.
  • Deprecated traditional Fuse filters (BinaryFuse are much better.)
  • Added much improved benchmarking suite with more details on memory consumption during filter population, etc.

v0.8.0

initial release with support for Xor and traditional Fuse filters of varying bit sizes, key iterators, serialization, and a slice de-duplication helper.

Benchmarks

Benchmarks were ran on both a 2019 Macbook Pro and Windows 10 desktop machine using e.g.:

zig run -O ReleaseFast src/benchmark.zig -- --xor 8 --num-keys 1000000
Benchmarks: 2019 Macbook Pro, Intel i9 (1M - 100M keys)
  • CPU: 2.3 GHz 8-Core Intel Core i9
  • Memory: 16 GB 2667 MHz DDR4
  • Zig version: 0.12.0-dev.706+62a0fbdae
Algorithm # of keys populate contains(k) false+ prob. bits per entry peak populate filter total
binaryfuse8 1000000 37.5ms 24.0ns 0.00391115 9.04 22 MiB 1 MiB
binaryfuse16 1000000 45.5ms 24.0ns 0.00001524 18.09 24 MiB 2 MiB
binaryfuse32 1000000 56.0ms 24.0ns 0 36.18 28 MiB 4 MiB
xor2 1000000 108.0ms 25.0ns 0.2500479 9.84 52 MiB 1 MiB
xor4 1000000 99.0ms 25.0ns 0.06253865 9.84 52 MiB 1 MiB
xor8 1000000 103.4ms 25.0ns 0.0039055 9.84 52 MiB 1 MiB
xor16 1000000 104.7ms 26.0ns 0.00001509 19.68 52 MiB 2 MiB
xor32 1000000 102.2ms 25.0ns 0 39.36 52 MiB 4 MiB
binaryfuse8 10000000 621.2ms 36.0ns 0.0039169 9.02 225 MiB 10 MiB
binaryfuse16 10000000 666.6ms 102.0ns 0.0000147 18.04 245 MiB 21 MiB
binaryfuse32 10000000 769.0ms 135.0ns 0 36.07 286 MiB 43 MiB
xor2 10000000 1.9s 43.0ns 0.2500703 9.84 527 MiB 11 MiB
xor4 10000000 2.0s 41.0ns 0.0626137 9.84 527 MiB 11 MiB
xor8 10000000 1.9s 42.0ns 0.0039369 9.84 527 MiB 11 MiB
xor16 10000000 2.2s 106.0ns 0.0000173 19.68 527 MiB 23 MiB
xor32 10000000 2.2s 140.0ns 0 39.36 527 MiB 46 MiB
binaryfuse8 100000000 7.4s 145.0ns 0.003989 9.01 2 GiB 107 MiB
binaryfuse16 100000000 8.4s 169.0ns 0.000016 18.01 2 GiB 214 MiB
binaryfuse32 100000000 10.2s 173.0ns 0 36.03 2 GiB 429 MiB
xor2 100000000 28.5s 144.0ns 0.249843 9.84 5 GiB 117 MiB
xor4 100000000 27.4s 154.0ns 0.062338 9.84 5 GiB 117 MiB
xor8 100000000 28.0s 153.0ns 0.004016 9.84 5 GiB 117 MiB
xor16 100000000 29.5s 161.0ns 0.000012 19.68 5 GiB 234 MiB
xor32 100000000 29.4s 157.0ns 0 39.36 5 GiB 469 MiB

Legend:

  • contains(k): The time taken to check if a key is in the filter
  • false+ prob.: False positive probability, the probability that a containment check will erroneously return true for a key that has not actually been added to the filter.
  • bits per entry: The amount of memory in bits the filter uses to store a single entry.
  • peak populate: Amount of memory consumed during filter population, excluding keys themselves (8 bytes * num_keys.)
  • filter total: Amount of memory consumed for filter itself in total (bits per entry * entries.)
Benchmarks: Windows 10, AMD Ryzen 9 3900X (1M - 100M keys)
  • CPU: 3.79Ghz AMD Ryzen 9 3900X
  • Memory: 32 GB 2133 MHz DDR4
  • Zig version: 0.12.0-dev.706+62a0fbdae
Algorithm # of keys populate contains(k) false+ prob. bits per entry peak populate filter total
binaryfuse8 1000000 44.6ms 24.0ns 0.00390796 9.04 22 MiB 1 MiB
binaryfuse16 1000000 48.9ms 25.0ns 0.00001553 18.09 24 MiB 2 MiB
binaryfuse32 1000000 49.9ms 25.0ns 0.00000001 36.18 28 MiB 4 MiB
xor2 1000000 77.3ms 25.0ns 0.25000163 9.84 52 MiB 1 MiB
xor4 1000000 80.0ms 25.0ns 0.06250427 9.84 52 MiB 1 MiB
xor8 1000000 76.0ms 25.0ns 0.00391662 9.84 52 MiB 1 MiB
xor16 1000000 83.7ms 26.0ns 0.00001536 19.68 52 MiB 2 MiB
xor32 1000000 79.1ms 27.0ns 0 39.36 52 MiB 4 MiB
fuse8 1000000 69.4ms 25.0ns 0.00390663 9.10 49 MiB 1 MiB
fuse16 1000000 71.5ms 27.0ns 0.00001516 18.20 49 MiB 2 MiB
fuse32 1000000 71.1ms 27.0ns 0 36.40 49 MiB 4 MiB
binaryfuse8 10000000 572.3ms 33.0ns 0.0038867 9.02 225 MiB 10 MiB
binaryfuse16 10000000 610.6ms 108.0ns 0.0000127 18.04 245 MiB 21 MiB
binaryfuse32 10000000 658.2ms 144.0ns 0 36.07 286 MiB 43 MiB
xor2 10000000 1.2s 39.0ns 0.249876 9.84 527 MiB 11 MiB
xor4 10000000 1.2s 39.0ns 0.0625026 9.84 527 MiB 11 MiB
xor8 10000000 1.2s 41.0ns 0.0038881 9.84 527 MiB 11 MiB
xor16 10000000 1.3s 117.0ns 0.0000134 19.68 527 MiB 23 MiB
xor32 10000000 1.3s 147.0ns 0 39.36 527 MiB 46 MiB
fuse8 10000000 1.1s 36.0ns 0.0039089 9.10 499 MiB 10 MiB
fuse16 10000000 1.1s 112.0ns 0.0000172 18.20 499 MiB 21 MiB
fuse32 10000000 1.1s 145.0ns 0 36.40 499 MiB 43 MiB
binaryfuse8 100000000 6.9s 167.0ns 0.00381 9.01 2 GiB 107 MiB
binaryfuse16 100000000 7.2s 171.0ns 0.000009 18.01 2 GiB 214 MiB
binaryfuse32 100000000 8.5s 174.0ns 0 36.03 2 GiB 429 MiB
xor2 100000000 16.8s 166.0ns 0.249868 9.84 5 GiB 117 MiB
xor4 100000000 18.9s 183.0ns 0.062417 9.84 5 GiB 117 MiB
xor8 100000000 19.1s 168.0ns 0.003873 9.84 5 GiB 117 MiB
xor16 100000000 16.9s 171.0ns 0.000021 19.68 5 GiB 234 MiB
xor32 100000000 19.4s 189.0ns 0 39.36 5 GiB 469 MiB
fuse8 100000000 19.6s 167.0ns 0.003797 9.10 4 GiB 108 MiB
fuse16 100000000 20.8s 171.0ns 0.000015 18.20 4 GiB 216 MiB
fuse32 100000000 21.5s 176.0ns 0 36.40 4 GiB 433 MiB

Legend:

  • contains(k): The time taken to check if a key is in the filter
  • false+ prob.: False positive probability, the probability that a containment check will erroneously return true for a key that has not actually been added to the filter.
  • bits per entry: The amount of memory in bits the filter uses to store a single entry.
  • peak populate: Amount of memory consumed during filter population, excluding keys themselves (8 bytes * num_keys.)
  • filter total: Amount of memory consumed for filter itself in total (bits per entry * entries.)

Related readings

Special thanks

If it was not for the above people, I (@slimsag) would not have been able to write this implementation and learn from the excellent C implementation. Please credit the above people if you use this library.