feat: Create initial Block Streamer service #428

morgsmccauley · 2023-11-23T01:56:15Z

This PR introduces a new Rust service called Block Streamer. This is essentially a refactored version of Historical Backfill, which does not stop the 'manual filtering' portion of the backfill, continuing indefinitely. Eventually, this service will be able to serve requests, allowing control over these Block Streams. I've outlined the major changes below.

`trait S3Client`

Low level S3 requests have been abstracted behind a trait. This allows us to use the trait as an argument, allowing the injection of mock implementations in tests. We no longer need to use real access keys, and can make more concrete assertions.

`DeltaLakeClient`

The majority of the S3 related logic, queryapi_coordinator/src/s3.rs included, has been refactored in to DeltaLakeClient. This client encapsulates all logic relating the interaction/consumption of near-delta-lake S3. Now, only a single method needs to be called from BlockStream in order to get the relevant block heights for a given contract_id/pattern. There's been a fair amount of refactor across all the methods themselves, I've added tests to ensure the still behave as expected.

`BlockStream`

The refactored version of Historical Backfill, not much has changed here in this version, the main change is that it now used DeltaLakeClient. This will eventually be expanded to provide more control over the Stream.

`indexer_rules_*` & `storage`

Currently these exist as separate crates within the indexer/ workspace. Rather than creating a workspace for block-streamer I've added the respective files for each to the single crate. This is probably fine for storage, now called redis, but given indexer_rules_* is also used in the Registry Contract, I'll probably extract this in a later PR, to avoid it being duplicated.

morgsmccauley · 2023-11-24T03:58:44Z

block-streamer/src/delta_lake_client.rs

+        // TODO do in parallel?
+        // TODO only list objects without .json extension


We can speed this up by only listing folders/common_prefixes. Currently, we list all matching objects, and if that object is a file, it just returns itself.

I'm a tiny bit confused about the application of this function. What kinds of objects are we looking for? By "common_prefixes", do you mean "block." or "shard."? I believe ListObject allows matching the entire prefix against some pattern, using *, ., and ? symbols.

common_prefixes is an S3 concept, It is essentially the 'sub-folders'. list_all_objects will return both 'files' and 'folders', and in the original implementation (which has mostly been carried over to here), we would then call list on each file/folder. Listing a file is pointless as it just returns itself, we only need to list folders - hence the comment.

morgsmccauley · 2023-11-24T03:59:40Z

block-streamer/src/main.rs

+
+#[tokio::main]
+async fn main() -> anyhow::Result<()> {
+    tracing_subscriber::registry()


Starting a random Block Stream just to serve as an integration test. Eventually this will start an endpoint which provides control over the Block Streams.

morgsmccauley · 2023-11-24T04:00:24Z

block-streamer/src/rules/matcher.rs

@@ -0,0 +1,141 @@
+use near_lake_framework::near_indexer_primitives::{


All rules/ code is the same as what we have currently, just copied over to this new crate.

morgsmccauley · 2023-11-24T04:02:08Z

block-streamer/src/rules/mod.rs

@@ -0,0 +1,79 @@
+pub mod matcher;


Ah, expect this file which differs from the original. indexer_rules_type has been merged in here. This will eventually be extracted.

morgsmccauley · 2023-11-24T08:14:32Z

block-streamer/src/delta_lake_client.rs

+            contract_pattern,
+        );
+
+        // TODO Remove all block heights after start_block_height


This is probably a bug - we fetch all block heights within the given day that the specified block height occurs. If the block height is finalised in the middle of the day, we unnecessarily include all block heights prior.

We should probably remove all block heights which are lower than start_block_height.

The code comment says all block heights after start. I assume you meant before? Also, is the list of block heights in order? Maybe we can get a slice of the list starting from start_date? Or when we flush to redis, we can filter there.

Ah yes, I mean before.

The list of block heights is in order. Yeah, we can slice it from the start_block_height, I just wanted to avoid changing the code too much, we can address in a future PR :).

morgsmccauley · 2023-11-24T08:14:59Z

block-streamer/src/delta_lake_client.rs

+        }
+    }
+
+    pub async fn get_nearest_block_date(


Previously used JSON RPC but has been refactored to use S3 to avoid maintaining multiple clients.

darunrs

Lot of code to look through. I believe I understand/see the changes outside of the rule engine, which I need to pour over some more. Some questions, but otherwise looks great! I salute you for your efforts in refactoring coordinator, sir.

darunrs · 2023-12-11T18:24:12Z

block-streamer/src/block_stream.rs

+    }?;
+
+    tracing::debug!(
+        "Flushing {} block heights from index files to historical Stream for indexer: {}",


Are we still calling it a historical stream?

Ah, no. We only have a single stream so don't need to distinguish between historical/real-time, I'll update these logs in a future PR

darunrs · 2023-12-11T18:32:25Z

block-streamer/src/block_stream.rs

+
+    let (sender, mut stream) = near_lake_framework::streamer(lake_config);
+
+    while let Some(streamer_message) = stream.recv().await {


So, I believe this is where you now continue "real-time" processing? When a new historical backfill request comes in, will we cancel and restart the task?

Yes, and no. This will eventually become 'real-time', but it continues from where the Databricks blocks finished which can be at most a day behind.

darunrs · 2023-12-11T18:33:13Z

block-streamer/src/delta_lake_client.rs

+    pub fn new(s3_client: T) -> Self {
+        DeltaLakeClient {
+            s3_client,
+            // hardcode to mainnet for


Incomplete comment? Why do we hardcode here?

"for now" lol - I fixed it in this PR https://github.com/near/queryapi/pull/430/files#diff-bd02e799d38dbea2e9e4c46adfbcc107693ec11fb4bffbed84d442a7191b88bbR50

darunrs · 2023-12-11T19:02:49Z

block-streamer/src/delta_lake_client.rs

+                            "Block {} not found on S3, attempting to fetch next block",
+                            current_block_height
+                        );
+                        current_block_height += 1;


So the retry count here isn't that we're retrying to get the same block but instead retrying to find the closest matching block by trying to get the number 1 over? What scenario would we need to try finding the next block over, up to 20 times?

I can't remember the exact reason, but protocol can sometimes skip blocks, so they aren't guaranteed to be consecutive. In that case we just find the closest block thereafter.

darunrs · 2023-12-11T19:06:59Z

block-streamer/src/delta_lake_client.rs

+        // TODO do in parallel?
+        // TODO only list objects without .json extension


I'm a tiny bit confused about the application of this function. What kinds of objects are we looking for? By "common_prefixes", do you mean "block." or "shard."? I believe ListObject allows matching the entire prefix against some pattern, using *, ., and ? symbols.

darunrs · 2023-12-11T19:11:45Z

block-streamer/src/delta_lake_client.rs

+            .flat_map(|index_file| index_file.heights)
+            .collect();
+
+        let pattern_has_multiple_contracts = contract_pattern.chars().any(|c| c == ',' || c == '*');


I had a feeling we supported multiple contracts since I've seen code related to it, but I never tried it. I don't think the UI let's you put multiple. Is this a functionality we just have some partial support for?

The UI does/should let you input multiple :)

darunrs · 2023-12-11T19:20:09Z

block-streamer/src/delta_lake_client.rs

+            contract_pattern,
+        );
+
+        // TODO Remove all block heights after start_block_height


The code comment says all block heights after start. I assume you meant before? Also, is the list of block heights in order? Maybe we can get a slice of the list starting from start_date? Or when we flush to redis, we can filter there.

darunrs · 2023-12-11T19:24:19Z

block-streamer/src/redis.rs

+    redis::Client::open(redis_connection_str).expect("can create redis client")
+}
+
+pub fn generate_real_time_stream_key(prefix: &str) -> String {


Are these still used in block-streamer?

No, I left them here in-case, but decided to remove in a future PR https://github.com/near/queryapi/pull/442/files#diff-e09d65b6630cb04bb5d37db13b936ce372b768d6ca0f1eed271d3de305d21003

morgsmccauley linked an issue Nov 23, 2023 that may be closed by this pull request

Created dedicated block stream per Indexer #418

Closed

morgsmccauley force-pushed the 418-created-dedicated-block-stream-per-indexer branch 3 times, most recently from e6ff42e to fb6f0cf Compare November 24, 2023 03:33

morgsmccauley changed the title ~~418 created dedicated block stream per indexer~~ feat: Create initial Block Streamer service Nov 24, 2023

morgsmccauley commented Nov 24, 2023

View reviewed changes

morgsmccauley added 21 commits November 24, 2023 17:03

feat: Create new cargo project for block stream

0121e1c

feat: Add redis module

56fcd2a

feat: Add existing historical processing logic

50b413e

feat: Add supporting modules to make things compile

998c823

refactor: Rename historical processing to BlockStreamer

694f04d

refactor: Rename IndexerFunction > IndexerConfig

503a069

feat: Use default log target of module path

af9faf7

fix: Store stream storage under correct redis key

5682f61

refactor: Remove unnecessary logging target from redis mod

fde65a1

feat: Start dummy BlockStreamer to test

4c1b947

refactor: Remove unneeded redis storage

781f3f7

refactor: Return handle rather than awaiting implicitly

e63824f

feat: Start BlockStreamer from arbitrary block height

76641bd

refactor: Rename process_historical_messages to start_block_stream

93facb4

feat: Log and propogate errors in block stream task

34e5049

feat: Ignore error in block stream when cancelling

ae2eab1

refactor: Make decision between index/metadata more clear

57e0c4b

chore: Update crates to latest version

b0876b3

feat: Add mockable s3 client

11a07da

feat: Add DeltaLakeClient using s3 client mock

d8116ec

feat: Fetch latest_block metadata from delta lake

aed29b4

morgsmccauley added 16 commits November 24, 2023 17:03

feat: Add list_objects to s3_client

3409801

feat: Move s3 logic to DeltaLakeClient

82300c5

refactor: Use DeltaLakeClient to fetch index files

1f652d1

refactor: Inline use of bucket/prefix

95dd389

refactor: Simplify use of continuation_token

3c1838f

refactor: Extract S3 listing from DeltaLakeClient

1287b7c

test: Test DeltaLakeClient listing multi/wildcard

e48ba77

refactor: Rename functions to add clarity

cb318fc

test: Listing index files for multiple acccounts and wildcard

5431708

refactor: Use consistent naming for contract_id

cfdfb7a

refactor: Return BlockHeight instead of raw index files

d9100a7

refactor: Fetch start block date from s3 instead of rpc

242a997

chore: Remove unused code

187d187

refactor: Make unused methods private

a621487

chore: Add more logging

9569787

refactor: Rename BlockStreamer to BlockStream

9574e54

morgsmccauley force-pushed the 418-created-dedicated-block-stream-per-indexer branch from 57a8d0f to 9574e54 Compare November 24, 2023 04:03

morgsmccauley marked this pull request as ready for review November 24, 2023 04:04

morgsmccauley requested a review from a team as a code owner November 24, 2023 04:04

refactor: Encapsulate block date logic within DeltaLakeClient

0531b5f

morgsmccauley commented Nov 24, 2023

View reviewed changes

chore: Remove unused code

21706bd

darunrs approved these changes Dec 11, 2023

View reviewed changes

darunrs reviewed Dec 11, 2023

View reviewed changes

morgsmccauley merged commit 3a9cfd2 into main Dec 11, 2023
2 checks passed

morgsmccauley deleted the 418-created-dedicated-block-stream-per-indexer branch December 11, 2023 20:34

This was referenced Dec 14, 2023

feat: Expose endpoint to control streams #430

Merged

Prod Release 19/12 #457

Merged

morgsmccauley mentioned this pull request Apr 22, 2024

test stable branch git fix up #687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Create initial Block Streamer service #428

feat: Create initial Block Streamer service #428

morgsmccauley commented Nov 23, 2023 •

edited

Loading

morgsmccauley Nov 24, 2023

darunrs Dec 11, 2023

morgsmccauley Dec 11, 2023

morgsmccauley Nov 24, 2023

morgsmccauley Nov 24, 2023

morgsmccauley Nov 24, 2023

morgsmccauley Nov 24, 2023

darunrs Dec 11, 2023

morgsmccauley Dec 11, 2023

morgsmccauley Nov 24, 2023

darunrs left a comment

darunrs Dec 11, 2023

morgsmccauley Dec 11, 2023

darunrs Dec 11, 2023

morgsmccauley Dec 11, 2023

darunrs Dec 11, 2023

morgsmccauley Dec 11, 2023

darunrs Dec 11, 2023

morgsmccauley Dec 11, 2023

darunrs Dec 11, 2023

darunrs Dec 11, 2023

morgsmccauley Dec 11, 2023

darunrs Dec 11, 2023

darunrs Dec 11, 2023

morgsmccauley Dec 11, 2023

		// TODO do in parallel?
		// TODO only list objects without .json extension

		@@ -0,0 +1,141 @@
		use near_lake_framework::near_indexer_primitives::{


		let (sender, mut stream) = near_lake_framework::streamer(lake_config);

		while let Some(streamer_message) = stream.recv().await {

feat: Create initial Block Streamer service #428

feat: Create initial Block Streamer service #428

Conversation

morgsmccauley commented Nov 23, 2023 • edited Loading

trait S3Client

DeltaLakeClient

BlockStream

indexer_rules_* & storage

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darunrs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morgsmccauley commented Nov 23, 2023 •

edited

Loading

`trait S3Client`

`DeltaLakeClient`

`BlockStream`

`indexer_rules_*` & `storage`