-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage): add Bloom filter based event lookups #1679
Conversation
5982362
to
039318c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly love this. Quite a bunch of comments, but most are just suggestions or questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, looking forward to the final version 🤞
Otherwise dropping large tables is prohibitively slow (and needs too much disk space).
This change adds a new table storing Bloom filters for events emitted in that block. The filter stores the keys (prefixed with the index) and the address of the contract that emitted the event and can be used to avoid loading the receipts for blocks that do not contain matching events.
FIXME: should remove bloom when purging a block # Please enter the commit message for your changes. Lines starting # with '#' will be ignored, and an empty message aborts the commit. # # On branch krisztian/events-intern-poc # Changes to be committed: # modified: Cargo.lock # modified: crates/storage/Cargo.toml # modified: crates/storage/src/bloom.rs # modified: crates/storage/src/connection/event.rs # # Changes not staged for commit: # modified: crates/crypto/src/signature/ecdsa.rs # # Untracked files: # .idea/ # TODO.md # TODO2.md # bincode-scan.txt # call-flamegraph.svg # call.json # commitment.py # crates/gateway-types/src/pending.rs # crates/rpc/fixtures/contracts/dummy_account.json # crates/stark_hash/ # crates/stark_poseidon/ # events-scan.txt # events.json # events2.json # feeder_gateway.rest # get_storage_roots.py # identity.json # json-rpc.rest # receipts-scan.txt # request.json # snapshots.txt # test.py # test1.sh # test2.sh # test3.json # test_sierra_call.sh # trace.json # trace2.json # trace3.json # trace4.json # trace_block.json #
When a filter fails to load from storage just treat it as matching.
Similar to `block_id()` but can omit a DB lookup if the block id is already a hash.
This way we avoid having to propagate the cache object around and make the fact that we use a cache transparent to the rest of the codebase.
…ansactions and receipt
When hitting the page scan limit the code now properly returns a continuation token to the client (along with the events already collected). This makes it possible for clients to eventually collect all events they are interested in by making multiple requests -- while keeping a reasonable limit on how long a single `starknet_getEvents` request takes.
b9a3480
to
ffdcc94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Mirko von Leipzig <[email protected]>
Co-authored-by: Mirko von Leipzig <[email protected]>
This PR implements a new algorithm for serving
starknet_getEvents
.New Algorithm
Instead of relying on SQLite to evaluate the filter we're now simply iterating over the block range covered by the filter, fetch events for a block and then evaluate the filter in code.
To help with performance we're adding a per-block Bloom filter that contains event addresses and keys and use that to avoid scanning event data for blocks that are known not to contain matches. Bloom filters take 2 KiB per block and are stored compressed in the database. An LRU cache helps us avoid reloading recently used Bloom filters.
The algorithm also has configurable limits on:
Upon hitting either of those limits we return with the current result set and a continuation token that can be used to continue where we've left on a subsequent request. Both of these limits (along with the Bloom filter LRU cache size) are configurable. These limits effectively allow setting an upper limit on how long a single query might take.
Storage
A Bloom filter with the parameters we've chosen takes 2 KiB per block. Compression brings that down to an average of 1430 bytes per block, so that's roughly 700 MiB for current mainnet in the database. On the other hand, this allows us to drop the
starknet_events
table (and related indexes) which takes up slightly more than 20% of our database right now.Migration
The migration step takes considerable time. There are two major steps:
starknet_events
table): this step takes slightly less than half an hour with mainnet on my setup.starknet_events
table (and related stuff). Unfortunately this step required us to switch tojournal_mode=delete
for the migration, because otherwise the tables being dropped are copied to the WAL which is very expensive for the ~180 GiB of data involved."Size after" is after a vacuum (not included in migration).