Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(indexer): Move state_changes from StreamerMessage root to IndexerShard instead #6255

Merged
merged 14 commits into from
Feb 16, 2022

Conversation

khorolets
Copy link
Member

@khorolets khorolets commented Feb 7, 2022

  • Extract indexer::streamer::types to separate crate indexer-primitives for publishing
  • Add GetStateChangesWithCauseInBlockForTrackedShards to client-primitives and corresponding view_client handler to return StateChangesWithCause for each shard separately
  • Breaking changes in StreamerMessage struct, there is no state_changes in the root anymore
  • Increase the streaming channel size from 16 to 100 (reindexing speed)

@khorolets khorolets requested a review from frol February 7, 2022 19:35
@khorolets khorolets force-pushed the refactor/indexer-state-changes-by-shard-ids branch from f10e529 to 8c9d7d6 Compare February 7, 2022 19:36
chain/client/src/view_client.rs Outdated Show resolved Hide resolved
chain/client/src/view_client.rs Outdated Show resolved Hide resolved
chain/client/src/view_client.rs Outdated Show resolved Hide resolved
chain/indexer-primitives/Cargo.toml Outdated Show resolved Hide resolved
chain/indexer/CHANGELOG.md Outdated Show resolved Hide resolved
chain/indexer/Cargo.toml Outdated Show resolved Hide resolved
chain/indexer/src/streamer/mod.rs Outdated Show resolved Hide resolved
chain/indexer/src/streamer/mod.rs Outdated Show resolved Hide resolved
chain/indexer/src/streamer/mod.rs Outdated Show resolved Hide resolved
chain/indexer/src/streamer/utils.rs Outdated Show resolved Hide resolved
chain/indexer-primitives/Cargo.toml Outdated Show resolved Hide resolved
@khorolets khorolets requested a review from matklad as a code owner February 8, 2022 09:09
@khorolets khorolets force-pushed the refactor/indexer-state-changes-by-shard-ids branch from 9e694b2 to 4916e1b Compare February 8, 2022 09:09
Copy link
Collaborator

@frol frol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

P.S. Run fmt to fix CI

chain/indexer/CHANGELOG.md Outdated Show resolved Hide resolved
chain/indexer/CHANGELOG.md Outdated Show resolved Hide resolved
chain/indexer/src/streamer/fetchers.rs Outdated Show resolved Hide resolved
chain/client/src/view_client.rs Outdated Show resolved Hide resolved
core/primitives/src/types.rs Outdated Show resolved Hide resolved
@@ -0,0 +1,19 @@
[package]
name = "near-indexer-primitives"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a readme explaining why this crates needs to exist, what should belong to this crate, and what shouldn't belong to this crate? "this crates holds types" doesn't answer those questions for me -- we might as well keep them in indexer, and we also might add some logic here, why not?

It seems to me that what actually happens is that "this crate defines over-the-wire schema for JSON messages used by indexer HTTP API". Ie, it seems that this is not about Rust, but rather about a network protocol/API we expose. That is, if true, a very important bit of information, and it currently is missing from the description.

🤔 wait, I think the above is not true. I've just tried removing Serialize, Deserialize from all intexer messages, and that seems to work -- nothing in this repository seems to rely on serialization. I guess, I am just confused as to why this particular bunch of types needs special treatment and is worth a separate crate :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matklad I'll add a little README.

We are making NEAR Lake - in a few words it's an indexer that is going to save all the data from the network to AWS S3 in JSON files. We are starting some NEAR Lake Framework which is going to read from AWS S3 and stream all the data. Both of the mentioned projects will use indexer primitives, in nearest future, there are going to be even more projects.

The first goal is to make it possible that projects like NEAR Lake Framework and the ones that use it as a dependency won't require an entire nearcore in dependencies in order to actually use only those primitives.

And we are going to extend the number of types in indexer primitives and adjust the serde_json parameters to improve the way the types are serialized.

I hope that makes sense. Let me know if you have questions regarding this extraction to a separate crate :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this makes sense! And yeah, I want to stress that the salient point of this thing then is not the Rust types, but rather the JSON schema, and all of the related backwards compatibility concerns.

And yeah, this totally needs to be a separate crate. I wonder though if -types or -views would be a better public name than -primitives. In today's usage primitives is essentially a kitchen sink, and it defines much more than just data schema.

I also wonder if, long term, this should live in some other repository. It feels odd that this defines a data schema, but the actual data isn't used by the repo (so, it's easy for nearcore developers to break this stuff without realizing this). Contrast this with views, which also do define the schema, but this schema is than used by the code in this repository (the node's RPC API). It seems nicer if indexer-framework defines just the Rust API (without Serialize), and let's the consumers define their own data schema.

🤔 yeah, I guess long-term I can see two worlds:

  • In one world, "indexer schema" is something defined completely outside of nearcore as a part of the lake framework. In this world, there's a separate indexer-types crate, but it lives in the repo which implements the actual indexer. We can also imagine two different lakes by two different providers with different schema
  • In another world, we say that "indexer schema" is something universal, and is as much a part of nearcore as JSON RPC API. In this world, I would expect this to be a module in the crate that defines JSON RPC API (so, near_primitives::views::indexer today). In this world, while there might be alternative implementations of the lake, we'd expect them to use the same schema.

But yeah, this is long-term, starting with moving this to a separate crate might make sense (but also do consider just creating a views submodule)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matklad TL;DR: There are several things at play, and you properly identified the long-term plans we actually have.

The types defined in this crate are Rust types for Indexer Framework, which is now tightly coupled with nearcore. You don't see any use of these APIs in nearcore repo because it is designed to be a library (you can find the usage in indexer-for-* projects out there), and we want to stabilize this API in the long term in terms of Rust types and while we already have our own top-level structure, there are still quite some types used in depth (e.g. Transaction and ExecutionOutcome types). Given there is an initiative to decouple JSON RPC types #5516, we decided to postpone solving all the problems.

Second, indeed, we want to be able to serialize the types for near-lake and deserialize it on the reader side without pulling the whole nearcore (we could have get away with pulling the whole nearcore, but due to #3803 it makes things a bit harder on our side, since we often receive questions about M1 support in our tooling and we use M1 ourselves)

Third, there is a plan to further solidify the APIs and extract near-indexer into a separate repo, but to make that happen, nearcore should be published as a crate, and its APIs also stable.

@matklad
Copy link
Contributor

matklad commented Feb 8, 2022

unrelated, but s/IndexerChunkView/IndexerChunk might make naming a bit more consistent.

@khorolets khorolets force-pushed the refactor/indexer-state-changes-by-shard-ids branch from 417e4a4 to f277032 Compare February 8, 2022 12:10
@khorolets khorolets force-pushed the refactor/indexer-state-changes-by-shard-ids branch from 130ce16 to abf99e1 Compare February 8, 2022 16:25
@khorolets khorolets force-pushed the refactor/indexer-state-changes-by-shard-ids branch from abf99e1 to 204fec7 Compare February 8, 2022 18:18
@@ -8,7 +8,6 @@ pub use near_primitives::{types, views};
pub struct StreamerMessage {
pub block: views::BlockView,
pub shards: Vec<IndexerShard>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have to worry about backwards compatibility ? is this message used in communication between two different binaries ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little bit above in this PR we have talked about it, let me know if that doesn't answer your question

#6255 (comment)

receipt_execution_outcomes: vec![],
state_changes: state_changes
.remove(&shard_id)
.expect("StateChanges for given shard should be present"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But some shards (chunks) might be missing in a block, right ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, really, my bad, thanks!

@@ -42,4 +41,5 @@ pub struct IndexerShard {
pub shard_id: types::ShardId,
pub chunk: Option<IndexerChunkView>,
pub receipt_execution_outcomes: Vec<IndexerExecutionOutcomeWithReceipt>,
pub state_changes: views::StateChangesView,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any tests for this ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a conversation re: indexer and tests with @matklad recently

Here's the start #6195 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so - no tests ? :-(

@khorolets khorolets force-pushed the refactor/indexer-state-changes-by-shard-ids branch from 204fec7 to 8670763 Compare February 9, 2022 07:30
@khorolets khorolets requested a review from mm-near February 9, 2022 07:30
Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine as is, but I'd like to note the following:

@khorolets
Copy link
Member Author

This is fine as is, but I'd like to note the following:

Thanks for the heads up @bowenwang1996! In #6250 you're saying it should be possible to retrieve StateChanges on the fly, so it means we would be able to return StateChanges from the NEAR Indexer side.

cc @frol

@khorolets khorolets force-pushed the refactor/indexer-state-changes-by-shard-ids branch from bd87154 to 5dbd4ac Compare February 15, 2022 11:17
pub fn streamer(&self) -> mpsc::Receiver<streamer::StreamerMessage> {
let (sender, receiver) = mpsc::channel(16);
pub fn streamer(&self) -> mpsc::Receiver<StreamerMessage> {
let (sender, receiver) = mpsc::channel(100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was it changed to 100? can you add a comment ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PR description

Increase the streaming channel size from 16 to 100 (reindexing speed)

@@ -42,4 +41,5 @@ pub struct IndexerShard {
pub shard_id: types::ShardId,
pub chunk: Option<IndexerChunkView>,
pub receipt_execution_outcomes: Vec<IndexerExecutionOutcomeWithReceipt>,
pub state_changes: views::StateChangesView,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so - no tests ? :-(

@near-bulldozer near-bulldozer bot merged commit ea731e3 into master Feb 16, 2022
@near-bulldozer near-bulldozer bot deleted the refactor/indexer-state-changes-by-shard-ids branch February 16, 2022 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants