feat(storage): add Bloom filter based event lookups #1679

kkovaacs · 2024-01-16T14:54:07Z

This PR implements a new algorithm for serving starknet_getEvents.

New Algorithm

Instead of relying on SQLite to evaluate the filter we're now simply iterating over the block range covered by the filter, fetch events for a block and then evaluate the filter in code.

To help with performance we're adding a per-block Bloom filter that contains event addresses and keys and use that to avoid scanning event data for blocks that are known not to contain matches. Bloom filters take 2 KiB per block and are stored compressed in the database. An LRU cache helps us avoid reloading recently used Bloom filters.

The algorithm also has configurable limits on:

the number of Bloom filters loaded from the database,
the number of blocks scanned for matching events.

Upon hitting either of those limits we return with the current result set and a continuation token that can be used to continue where we've left on a subsequent request. Both of these limits (along with the Bloom filter LRU cache size) are configurable. These limits effectively allow setting an upper limit on how long a single query might take.

Storage

A Bloom filter with the parameters we've chosen takes 2 KiB per block. Compression brings that down to an average of 1430 bytes per block, so that's roughly 700 MiB for current mainnet in the database. On the other hand, this allows us to drop the starknet_events table (and related indexes) which takes up slightly more than 20% of our database right now.

Migration

The migration step takes considerable time. There are two major steps:

Computing and storing the Bloom filters for events (using the starknet_events table): this step takes slightly less than half an hour with mainnet on my setup.
Dropping the starknet_events table (and related stuff). Unfortunately this step required us to switch to journal_mode=delete for the migration, because otherwise the tables being dropped are copied to the WAL which is very expensive for the ~180 GiB of data involved.

Network	Size before	Size after	Migration duration	Logs
sepolia-testnet	804 MiB	651 MiB	1s	sepolia-testnet.txt
goerli-testnet	213 GiB	164 GiB	10m 48s	goerli-testnet.txt
mainnet	790 GiB	602 GiB	59m 12s	mainnet.txt

"Size after" is after a vacuum (not included in migration).

Mirko-von-Leipzig

I honestly love this. Quite a bunch of comments, but most are just suggestions or questions.

Cargo.lock

crates/pathfinder/src/bin/pathfinder/config.rs

crates/storage/Cargo.toml

crates/storage/src/schema/revision_0046.rs

crates/storage/src/connection/event.rs

crates/storage/src/bloom.rs

CHr15F0x

Looks good, looking forward to the final version 🤞

Otherwise dropping large tables is prohibitively slow (and needs too much disk space).

This change adds a new table storing Bloom filters for events emitted in that block. The filter stores the keys (prefixed with the index) and the address of the contract that emitted the event and can be used to avoid loading the receipts for blocks that do not contain matching events.

…ons and receipt

FIXME: should remove bloom when purging a block # Please enter the commit message for your changes. Lines starting # with '#' will be ignored, and an empty message aborts the commit. # # On branch krisztian/events-intern-poc # Changes to be committed: # modified: Cargo.lock # modified: crates/storage/Cargo.toml # modified: crates/storage/src/bloom.rs # modified: crates/storage/src/connection/event.rs # # Changes not staged for commit: # modified: crates/crypto/src/signature/ecdsa.rs # # Untracked files: # .idea/ # TODO.md # TODO2.md # bincode-scan.txt # call-flamegraph.svg # call.json # commitment.py # crates/gateway-types/src/pending.rs # crates/rpc/fixtures/contracts/dummy_account.json # crates/stark_hash/ # crates/stark_poseidon/ # events-scan.txt # events.json # events2.json # feeder_gateway.rest # get_storage_roots.py # identity.json # json-rpc.rest # receipts-scan.txt # request.json # snapshots.txt # test.py # test1.sh # test2.sh # test3.json # test_sierra_call.sh # trace.json # trace2.json # trace3.json # trace4.json # trace_block.json #

…scalar function

When a filter fails to load from storage just treat it as matching.

…module

Similar to `block_id()` but can omit a DB lookup if the block id is already a hash.

This way we avoid having to propagate the cache object around and make the fact that we use a cache transparent to the rest of the codebase.

…ansactions and receipt

When hitting the page scan limit the code now properly returns a continuation token to the client (along with the events already collected). This makes it possible for clients to eventually collect all events they are interested in by making multiple requests -- while keeping a reasonable limit on how long a single `starknet_getEvents` request takes.

crates/pathfinder/src/bin/pathfinder/config.rs

crates/storage/src/connection.rs

crates/storage/src/connection/event.rs

CHANGELOG.md

CHr15F0x

👍

Co-authored-by: Mirko von Leipzig <[email protected]>

This reverts PR #1679 with commits 387663b 122a424 91f6694 7210b35 cad71fd ffdcc94 b30cece b14c1f8 d5d9d2c 31062ed 0757793 c027a8e 26238b0 6e8c25c 3479318 fb08760 76a781e 4e8cbcb 8a78cc1 601bd9d ce71db7 e7a70e9 9c2a39d 0e4ef7e 8cd0a77 900c492

This reverts PR #1679 with commits 387663b 122a424 91f6694 7210b35 cad71fd ffdcc94 b30cece b14c1f8 d5d9d2c 31062ed 0757793 c027a8e 26238b0 6e8c25c 3479318 fb08760 76a781e 4e8cbcb 8a78cc1 601bd9d ce71db7 e7a70e9 9c2a39d 0e4ef7e 8cd0a77 900c492 8b96de9

kkovaacs force-pushed the krisztian/bloom-poc branch from 5982362 to 039318c Compare January 16, 2024 16:49

Mirko-von-Leipzig reviewed Jan 16, 2024

View reviewed changes

CHr15F0x reviewed Jan 17, 2024

View reviewed changes

crates/storage/src/bloom.rs Outdated Show resolved Hide resolved

CHr15F0x reviewed Jan 17, 2024

View reviewed changes

kkovaacs added 22 commits January 22, 2024 15:01

feat(storage): do migration with journal_mode=delete

8b96de9

Otherwise dropping large tables is prohibitively slow (and needs too much disk space).

feat(storage): update per-block Bloom filter when inserting transacti…

8cd0a77

…ons and receipt

feat(storage): filter events using Bloom filter

0e4ef7e

fix(storage/event): add upper limit for blocks scanned

e7a70e9

feat(storage): add receipts_for_block getter

ce71db7

chore(storage): remove base64_felts_to_index_prefixed_base32_felts …

601bd9d

…scalar function

fix(storage/event): handle missing Bloom filters

8a78cc1

When a filter fails to load from storage just treat it as matching.

fix(storage): remove Bloom filter cache entry when purging a block

4e8cbcb

fix(storage/connection): don't re-export everything from the event …

76a781e

…module

feat(storage): add block_hash() getter

fb08760

Similar to `block_id()` but can omit a DB lookup if the block id is already a hash.

feat(storage): add reorg_counter

3479318

feat(storage): add Bloom filter cache to Storage/Connection/Transaction

6e8c25c

This way we avoid having to propagate the cache object around and make the fact that we use a cache transparent to the rest of the codebase.

chore: upgrade test database fixture

26238b0

fixup! feat(storage): update per-block Bloom filter when inserting tr…

c027a8e

…ansactions and receipt

feat(storage): filter events using Bloom filter

0757793

chore: make test-log a workspace dependency

d5d9d2c

feat(storage): make get_events MAX_BLOCKS_TO_SCAN configurable

b14c1f8

fixup! feat(storage): add per-block Bloom filter for events

b30cece

feat(storage/event): add limit for number of Bloom filters loaded

ffdcc94

kkovaacs force-pushed the krisztian/bloom-poc branch from b9a3480 to ffdcc94 Compare January 22, 2024 14:01

chore: update CHANGELOG

cad71fd

kkovaacs marked this pull request as ready for review January 23, 2024 07:07

kkovaacs requested a review from a team as a code owner January 23, 2024 07:07

Mirko-von-Leipzig approved these changes Jan 23, 2024

View reviewed changes

CHr15F0x approved these changes Jan 23, 2024

View reviewed changes

kkovaacs and others added 4 commits January 23, 2024 11:28

Update crates/storage/src/connection/event.rs

7210b35

Co-authored-by: Mirko von Leipzig <[email protected]>

Update crates/storage/src/connection/event.rs

91f6694

Co-authored-by: Mirko von Leipzig <[email protected]>

fix(pathfinder/sync): add missing increment step of reorg counter

122a424

fix(pathfinder/config): rename Bloom filter load limit argument

387663b

kkovaacs merged commit a65a4af into main Jan 24, 2024
7 checks passed

kkovaacs deleted the krisztian/bloom-poc branch January 24, 2024 07:40

Mirko-von-Leipzig mentioned this pull request Feb 2, 2024

release: v0.10.5 #1726

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(storage): add Bloom filter based event lookups #1679

feat(storage): add Bloom filter based event lookups #1679

kkovaacs commented Jan 16, 2024 •

edited

Loading

Mirko-von-Leipzig left a comment

CHr15F0x left a comment

CHr15F0x left a comment

feat(storage): add Bloom filter based event lookups #1679

feat(storage): add Bloom filter based event lookups #1679

Conversation

kkovaacs commented Jan 16, 2024 • edited Loading

New Algorithm

Storage

Migration

Mirko-von-Leipzig left a comment

Choose a reason for hiding this comment

CHr15F0x left a comment

Choose a reason for hiding this comment

CHr15F0x left a comment

Choose a reason for hiding this comment

kkovaacs commented Jan 16, 2024 •

edited

Loading