DPLT-929 historical filtering #81

gabehamilton · 2023-05-30T15:37:47Z

Previous historical filtering added SQS messages for all blocks since start_block_height, up to 3600 blocks in the past.

This PR handles historical filtering for IndexerFunctions with an IndexerRule that matches Actions via an affected_account_id filter. First matching index files are retrieved from S3 and messages for the the blocks they list are added to the SQS queue. Second, blocks since the last block found in an index file are filtered, where each block is fetched from S3 and processed for matches.

Because historical processing occurs in a spawned thread, once the block is fetched the thread state can no longer be persisted (it is no longer Send), thus the following code must be synchronous (outcomes_reducer_sync).

More detailed error handling to come in https://pagodaplatform.atlassian.net/browse/DPLT-1012 after reviewing runtime errors and deciding what handling is appropriate.

…locks.

…nd current block when thread is spawned.

…te from rpc call.

gabehamilton · 2023-05-31T20:53:39Z

indexer/queryapi_coordinator/src/indexer_registry.rs

-                                        None
-                                    }
-                                },
+                Some(function_name) => match unescape(&args["filter_json"].to_string()) {


This block is cargo fmt changes.

gabehamilton · 2023-05-31T20:55:27Z

indexer/queryapi_coordinator/src/historical_block_processing.rs

+    }
+}
+
+// #[tokio::test]


Working on some new tests in another PR.

morgsmccauley · 2023-05-31T23:25:57Z

indexer/queryapi_coordinator/src/historical_block_processing.rs

+        indexer_function: indexer_function.clone(),
+    };
+
+    match opts::send_to_indexer_queue(queue_client, queue_url, vec![msg]).await {


This could be quite slow sending all these requests/messages individually, I wonder if there is a batch API?

Also, send_to_indexer_queue feels out of place in opts, why does it live there?

Batch is a good idea, looks like it accepts up to 10 at once.

I agree that send_to_indexer_queue should be separated. It's an artifact of Opts originally being shared code between alertexer queue handlers. It is likely to be shared again when indexer-js-queue-handler moves to rust. At that point I think send_to_indexer_queue can move into a shared aws related module.

morgsmccauley · 2023-06-01T01:16:39Z

indexer/indexer_rules_engine/src/outcomes_reducer_sync.rs

@@ -0,0 +1,115 @@
+use futures::future::try_join_all;


Can we just add reduce_indexer_rule_matches_from_outcomes_sync and build_indexer_rule_match_sync to outcomes_reducer.rs rather than creating an entirely new file? They are basically the same

morgsmccauley · 2023-06-01T01:20:17Z

indexer/queryapi_coordinator/src/historical_block_processing.rs

+
+pub fn spawn_historical_message_thread(
+    block_height: BlockHeight,
+    new_indexer_function: &mut IndexerFunction,


Suggested change

new_indexer_function: &mut IndexerFunction,

new_indexer_function: &IndexerFunction,

I don't think this needs to be mutable?

It's for the provisioning flag.

Ah, you're right, I was thinking of a different spot.

indexer/queryapi_coordinator/src/historical_block_processing.rs

morgsmccauley · 2023-06-01T01:56:15Z

indexer/queryapi_coordinator/src/historical_block_processing.rs

+    }
+}
+
+async fn fetch_text_file_from_s3(s3_bucket: &str, key: String, s3_client: S3Client) -> String {


Nit: the multiple levels of match make things really hard to read here. We can flatten this by using early returns like so:

async fn fetch_text_file_from_s3(s3_bucket: &str, key: String, s3_client: S3Client) -> String { let get_object_output = s3_client .get_object() .bucket(s3_bucket) .key(key.clone()) .send() .await; if get_object_output.is_err() { tracing::error!(target: crate::INDEXER, "Error fetching S3 file {}: {:?}", key.clone(), get_object_output.err()); return "".to_string(); } let get_object_output = get_object_output.unwrap(); let bytes = get_object_output.body.collect().await; if bytes.is_err() { tracing::error!(target: crate::INDEXER, "Error fetching index file {}: {:?}", key.clone(), bytes.err()); return "".to_string(); } let bytes = bytes.unwrap(); let file_contents = String::from_utf8(bytes.to_vec()); if file_contents.is_err() { tracing::error!(target: crate::INDEXER, "Error parsing index file {}: {:?}", key.clone(), file_contents.err()); return "".to_string(); } let file_contents = file_contents.unwrap(); tracing::debug!(target: crate::INDEXER, "Fetched S3 file {}", key.clone(),); file_contents }

Not asking you to refactor things in this PR, but we keep it in mind for future PRs please 🙏🏽

indexer/queryapi_coordinator/src/historical_block_processing.rs

morgsmccauley · 2023-06-01T02:09:05Z

indexer/queryapi_coordinator/src/historical_block_processing.rs

+        .collect::<Vec<u64>>()
+}
+
+async fn filter_matching_blocks_manually(


I wonder if we could add additional metadata to the index files, i.e. the method names, to avoid us having to pull down every block to inspect it. This seems very expensive to do here

The index files do have method names, that filter type isn't implemented in this PR.
However the "manual" filtering handles blocks between what is indexed and the latest block (when processing was spawned). We just added a new metadata file that tells us the latest indexed block, upcoming PR will use that. That will reduce our max manual filtering to an hour.

It's still an expensive operation.

We also repeat it on the execution side, pulling the block down again. Once we add data extraction to the indexing (pulling out the matching action for instance), we can put that data in the SQS message, avoiding the additional fetch for many IndexerFunctions.

Maybe this needs a better method name though, filter_unindexed_blocks

Co-authored-by: Morgan McCauley <[email protected]>

Chore/format and cleanup

Co-authored-by: Morgan McCauley <[email protected]>

gabehamilton added 2 commits May 25, 2023 14:13

Move historical processing code to new module, allow processing all b…

f4f6dd7

…locks.

Historical processing of index files and blocks between index files a…

ecd0216

…nd current block when thread is spawned.

gabehamilton marked this pull request as draft May 30, 2023 15:37

Only fetch index files on or after the start date, determine start da…

65b7d89

…te from rpc call.

gabehamilton marked this pull request as ready for review May 30, 2023 21:28

gabehamilton requested review from morgsmccauley, khorolets and roshaans May 30, 2023 21:28

gabehamilton mentioned this pull request May 31, 2023

DPLT-980 Tests for wildcard matching #86

Merged

gabehamilton commented May 31, 2023

View reviewed changes

gabehamilton added 2 commits May 31, 2023 16:26

cargo fmt

cfe7e20

Track lag in runner

fd7c01f

morgsmccauley requested changes Jun 1, 2023

View reviewed changes

gabehamilton and others added 7 commits June 1, 2023 09:18

Update indexer/queryapi_coordinator/src/historical_block_processing.rs

d35bd9c

Co-authored-by: Morgan McCauley <[email protected]>

Update indexer/queryapi_coordinator/src/historical_block_processing.rs

b4a4857

Co-authored-by: Morgan McCauley <[email protected]>

Updated all but one integration test.

fe337f8

Latest cargo lock

d812596

Minor PR cleanup

e3c5e09

Renamed function

bcdd536

Merge pull request #92 from near/chore/format_and_cleanup

b7ca94a

Chore/format and cleanup

gabehamilton merged commit 45633ba into main Jun 2, 2023

gabehamilton deleted the DPLT-929-historical-filtering branch June 2, 2023 17:44

roshaans mentioned this pull request Jun 6, 2023

DPLT-1027 Prod Release #97

Merged

gabehamilton added a commit that referenced this pull request Jun 26, 2023

DPLT-929 historical filtering (#81)

7485e5b

Co-authored-by: Morgan McCauley <[email protected]>

morgsmccauley mentioned this pull request Apr 22, 2024

test stable branch git fix up #687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPLT-929 historical filtering #81

DPLT-929 historical filtering #81

gabehamilton commented May 30, 2023 •

edited

Loading

gabehamilton May 31, 2023 •

edited

Loading

gabehamilton May 31, 2023

morgsmccauley May 31, 2023

morgsmccauley May 31, 2023

gabehamilton Jun 1, 2023

gabehamilton Jun 1, 2023

morgsmccauley Jun 1, 2023

morgsmccauley Jun 1, 2023

gabehamilton Jun 1, 2023

gabehamilton Jun 1, 2023

morgsmccauley Jun 1, 2023

morgsmccauley Jun 1, 2023

gabehamilton Jun 1, 2023

gabehamilton Jun 1, 2023

	new_indexer_function: &mut IndexerFunction,
	new_indexer_function: &IndexerFunction,

DPLT-929 historical filtering #81

DPLT-929 historical filtering #81

Conversation

gabehamilton commented May 30, 2023 • edited Loading

gabehamilton May 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabehamilton commented May 30, 2023 •

edited

Loading

gabehamilton May 31, 2023 •

edited

Loading