From 746ea70dcb589e92ba6cc7836a743e656c7553b5 Mon Sep 17 00:00:00 2001 From: walnut-the-cat <122475853+walnut-the-cat@users.noreply.github.com> Date: Tue, 19 Sep 2023 11:30:14 -0700 Subject: [PATCH 01/28] Create nep-0508.md Create NEP template for resharding --- neps/nep-0508.md | 108 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 108 insertions(+) create mode 100644 neps/nep-0508.md diff --git a/neps/nep-0508.md b/neps/nep-0508.md new file mode 100644 index 000000000..6718ebc17 --- /dev/null +++ b/neps/nep-0508.md @@ -0,0 +1,108 @@ +--- +NEP: 508 +Title: Resharding phase 2 +Authors: Waclaw Banasik, Shreyan Gupta, Yoon Hong +Status: Approved +DiscussionsTo: https://github.com/nearprotocol/neps/pull/0000 +Type: Developer Tools +Version: 1.0.0 +Created: 2022-09-19 +LastUpdated: 2023-09-19 +--- + +## Note + +Please refer to [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md) for resharding phase 1, where we performed resharding from 1 shard to 4 shards. + +## Summary + +[Provide a short human-readable (~200 words) description of the proposal. A reader should get from this section a high-level understanding about the issue this NEP is addressing.] + +## Motivation + +[Explain why this proposal is necessary, how it will benefit the NEAR protocol or community, and what problems it solves. Also describe why the existing protocol specification is inadequate to address the problem that this NEP solves, and what potential use cases or outcomes.] + +## Specification + +[Explain the proposal as if you were teaching it to another developer. This generally means describing the syntax and semantics, naming new concepts, and providing clear examples. The specification needs to include sufficient detail to allow interoperable implementations getting built by following only the provided specification. In cases where it is infeasible to specify all implementation details upfront, broadly describe what they are.] + +## Reference Implementation + +[This technical section is required for Protocol proposals but optional for other categories. A draft implementation should demonstrate a minimal implementation that assists in understanding or implementing this proposal. Explain the design in sufficient detail that: + +- Its interaction with other features is clear. +- Where possible, include a Minimum Viable Interface subsection expressing the required behavior and types in a target programming language. (ie. traits and structs for rust, interfaces and classes for javascript, function signatures and structs for c, etc.) +- It is reasonably clear how the feature would be implemented. +- Corner cases are dissected by example. +- For protocol changes: A link to a draft PR on nearcore that shows how it can be integrated in the current code. It should at least solve the key technical challenges. + +The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work.] + +## Security Implications + +[Explicitly outline any security concerns in relation to the NEP, and potential ways to resolve or mitigate them. At the very least, well-known relevant threats must be covered, e.g. person-in-the-middle, double-spend, XSS, CSRF, etc.] + +## Alternatives + +[Explain any alternative designs that were considered and the rationale for not choosing them. Why your design is superior?] + +## Future possibilities + +[Describe any natural extensions and evolutions to the NEP proposal, and how they would impact the project. Use this section as a tool to help fully consider all possible interactions with the project in your proposal. This is also a good place to "dump ideas"; if they are out of scope for the NEP but otherwise related. Note that having something written down in the future-possibilities section is not a reason to accept the current or a future NEP. Such notes should be in the section on motivation or rationale in this or subsequent NEPs. The section merely provides additional information.] + +## Consequences + +[This section describes the consequences, after applying the decision. All consequences should be summarized here, not just the "positive" ones. Record any concerns raised throughout the NEP discussion.] + +### Positive + +- p1 + +### Neutral + +- n1 + +### Negative + +- n1 + +### Backwards Compatibility + +[All NEPs that introduce backwards incompatibilities must include a section describing these incompatibilities and their severity. Author must explain a proposes to deal with these incompatibilities. Submissions without a sufficient backwards compatibility treatise may be rejected outright.] + +## Unresolved Issues (Optional) + +[Explain any issues that warrant further discussion. Considerations + +- What parts of the design do you expect to resolve through the NEP process before this gets merged? +- What parts of the design do you expect to resolve through the implementation of this feature before stabilization? +- What related issues do you consider out of scope for this NEP that could be addressed in the future independently of the solution that comes out of this NEP?] + +## Changelog + +[The changelog section provides historical context for how the NEP developed over time. Initial NEP submission should start with version 1.0.0, and all subsequent NEP extensions must follow [Semantic Versioning](https://semver.org/). Every version should have the benefits and concerns raised during the review. The author does not need to fill out this section for the initial draft. Instead, the assigned reviewers (Subject Matter Experts) should create the first version during the first technical review. After the final public call, the author should then finalize the last version of the decision context.] + +### 1.0.0 - Initial Version + +> Placeholder for the context about when and who approved this NEP version. + +#### Benefits + +> List of benefits filled by the Subject Matter Experts while reviewing this version: + +- Benefit 1 +- Benefit 2 + +#### Concerns + +> Template for Subject Matter Experts review for this version: +> Status: New | Ongoing | Resolved + +| # | Concern | Resolution | Status | +| --: | :------ | :--------- | -----: | +| 1 | | | | +| 2 | | | | + +## Copyright + +Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). From 2d80281570769317f2cf6295b9962d934650ea45 Mon Sep 17 00:00:00 2001 From: walnut-the-cat <122475853+walnut-the-cat@users.noreply.github.com> Date: Tue, 19 Sep 2023 12:38:52 -0700 Subject: [PATCH 02/28] Update nep-0508.md Updating draft --- neps/nep-0508.md | 58 +++++++++++++++++++++++++++++++++++------------- 1 file changed, 42 insertions(+), 16 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 6718ebc17..70fd709b4 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -10,33 +10,53 @@ Created: 2022-09-19 LastUpdated: 2023-09-19 --- -## Note +## Summary -Please refer to [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md) for resharding phase 1, where we performed resharding from 1 shard to 4 shards. +In essence, this NEP is extension of [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md), which focused splitting one shard into multiple shards. -## Summary +We are introducing the second phase of resharding, which supports one shard splitting into two within one epoch at pre-determined split boundary. -[Provide a short human-readable (~200 words) description of the proposal. A reader should get from this section a high-level understanding about the issue this NEP is addressing.] +While the new approach addresses critical limitations left unsolved in NEP-40 and is expected to remain valid for foreseable future, it does not serve all usecases, such as dynamic resharding. ## Motivation -[Explain why this proposal is necessary, how it will benefit the NEAR protocol or community, and what problems it solves. Also describe why the existing protocol specification is inadequate to address the problem that this NEP solves, and what potential use cases or outcomes.] +Currently, NEAR protocol has four shards. With more partners onboarding, we started seeing that some shards occasionally become over-crowded. In addition, with state sync and stateless validation, validators do not need to track all shards and validator hardware requirements can be greatly reduced with smaller shard size. ## Specification -[Explain the proposal as if you were teaching it to another developer. This generally means describing the syntax and semantics, naming new concepts, and providing clear examples. The specification needs to include sufficient detail to allow interoperable implementations getting built by following only the provided specification. In cases where it is infeasible to specify all implementation details upfront, broadly describe what they are.] +### High level assumptions +* Some form of State sync (centralized or decentralized) is enabled. +* Flat state is enabled. +* Shard split boundary is predetermined. In other words, necessity of shard splitting is manually decided. +* Merkle Patricia Trie is undelying data structure for the protocol. +* Minimal epoch gap between two resharding events is X. -## Reference Implementation +### High level requirements +* Resharding should work even when validators stop tracking all shards. +* Resharding should work after stateless validation is enabled. +* Resharding should be fast enough so that both state sync and resharding can happen within one epoch. +* Resharding should not require additional hardware from nodes. +* Resharding should be fault tolerant + * Chain must not stall in case of resharding failure. + * A validator should be able to recover in case they go offline during resharding. +* No transaction or receipt should be lost during resharding. +* Resharding should work regardless of number of existing shards. + +### Required protocol changes + +TBD. e.g. configuration changes we have to introduce -[This technical section is required for Protocol proposals but optional for other categories. A draft implementation should demonstrate a minimal implementation that assists in understanding or implementing this proposal. Explain the design in sufficient detail that: +### Required state changes -- Its interaction with other features is clear. -- Where possible, include a Minimum Viable Interface subsection expressing the required behavior and types in a target programming language. (ie. traits and structs for rust, interfaces and classes for javascript, function signatures and structs for c, etc.) -- It is reasonably clear how the feature would be implemented. -- Corner cases are dissected by example. -- For protocol changes: A link to a draft PR on nearcore that shows how it can be integrated in the current code. It should at least solve the key technical challenges. +TBD. e.g. additional/updated data a node has to maintain -The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work.] +### Resharding flow + +TBD. how resharding happens at the high level + +## Reference Implementation + +TBD ## Security Implications @@ -48,7 +68,13 @@ The section should return to the examples given in the previous section, and exp ## Future possibilities -[Describe any natural extensions and evolutions to the NEP proposal, and how they would impact the project. Use this section as a tool to help fully consider all possible interactions with the project in your proposal. This is also a good place to "dump ideas"; if they are out of scope for the NEP but otherwise related. Note that having something written down in the future-possibilities section is not a reason to accept the current or a future NEP. Such notes should be in the section on motivation or rationale in this or subsequent NEPs. The section merely provides additional information.] +As noted above, dynamic resharding is out of scope for this NEP and should be implemented in the future. Dynamic resharding includes the following but not limited to: +* automatic determination of split boundary +* automatic shard splitting and merging based on traffic + +Other useful features that can be considered as a follow up: +* account colocation for low latency across account call +* shard on demand ## Consequences @@ -68,7 +94,7 @@ The section should return to the examples given in the previous section, and exp ### Backwards Compatibility -[All NEPs that introduce backwards incompatibilities must include a section describing these incompatibilities and their severity. Author must explain a proposes to deal with these incompatibilities. Submissions without a sufficient backwards compatibility treatise may be rejected outright.] +We do not expect anything to break with this change. Yet, shard splitting can introduce additional complexity on replayability. For instance, as target shard of a receipt and belonging shard of an account can change with shard splitting, shard splitting must be replayed along with transactions at the exact epoch boundary. ## Unresolved Issues (Optional) From c100504ea796fe93992e9a38897022a0a5de698f Mon Sep 17 00:00:00 2001 From: walnut-the-cat <122475853+walnut-the-cat@users.noreply.github.com> Date: Tue, 19 Sep 2023 12:40:14 -0700 Subject: [PATCH 03/28] Update nep-0508.md Update Status and DiscussionsTo --- neps/nep-0508.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 70fd709b4..2f77b8c49 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -2,9 +2,9 @@ NEP: 508 Title: Resharding phase 2 Authors: Waclaw Banasik, Shreyan Gupta, Yoon Hong -Status: Approved -DiscussionsTo: https://github.com/nearprotocol/neps/pull/0000 -Type: Developer Tools +Status: Draft +DiscussionsTo: https://github.com/near/nearcore/issues/8992 +Type: Protocol Version: 1.0.0 Created: 2022-09-19 LastUpdated: 2023-09-19 From 396c4269b63d63384c3544121561f0a4b614eacb Mon Sep 17 00:00:00 2001 From: walnut-the-cat <122475853+walnut-the-cat@users.noreply.github.com> Date: Tue, 19 Sep 2023 14:12:30 -0700 Subject: [PATCH 04/28] Update nep-0508.md Additional changes --- neps/nep-0508.md | 29 +++++++++++++++++++++++------ 1 file changed, 23 insertions(+), 6 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 2f77b8c49..3df4810cb 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -25,6 +25,7 @@ Currently, NEAR protocol has four shards. With more partners onboarding, we star ## Specification ### High level assumptions + * Some form of State sync (centralized or decentralized) is enabled. * Flat state is enabled. * Shard split boundary is predetermined. In other words, necessity of shard splitting is manually decided. @@ -32,6 +33,7 @@ Currently, NEAR protocol has four shards. With more partners onboarding, we star * Minimal epoch gap between two resharding events is X. ### High level requirements + * Resharding should work even when validators stop tracking all shards. * Resharding should work after stateless validation is enabled. * Resharding should be fast enough so that both state sync and resharding can happen within one epoch. @@ -42,6 +44,12 @@ Currently, NEAR protocol has four shards. With more partners onboarding, we star * No transaction or receipt should be lost during resharding. * Resharding should work regardless of number of existing shards. +### Out of scope + +* Dynamic resharding +* Shard determination logic (shard boundary is still determined by string value) +* TBD + ### Required protocol changes TBD. e.g. configuration changes we have to introduce @@ -64,7 +72,12 @@ TBD ## Alternatives -[Explain any alternative designs that were considered and the rationale for not choosing them. Why your design is superior?] +* Why is this design the best in the space of possible designs? + * TBD +* What other designs have been considered and what is the rationale for not choosing them? + * TBD +* What is the impact of not doing this? + * TBD ## Future possibilities @@ -78,19 +91,23 @@ Other useful features that can be considered as a follow up: ## Consequences -[This section describes the consequences, after applying the decision. All consequences should be summarized here, not just the "positive" ones. Record any concerns raised throughout the NEP discussion.] - ### Positive -- p1 +* Workload across shards will be more evenly distributed. +* Required space to maintain state (either in memory or in persistent disk) will be smaller. +* State sync overhead will be smaller. +* TBD ### Neutral -- n1 +* Number of shards is expected to increase. +* Underlying trie structure and data structure are not going to change. +* Resharding will create dependency on flat storage and state sync. ### Negative -- n1 +* The resharding process is still not fully automated. Analyzing shard data, determining the split boundary, and triggering an actual shard split all need to be manually curated by a person. +* During resharding, a node is expected to do more work as it will have to apply changes twice (for the current shard and future shard). ### Backwards Compatibility From 1f1d107fbb2239a91b882ca5644e825dda3bd376 Mon Sep 17 00:00:00 2001 From: walnut-the-cat <122475853+walnut-the-cat@users.noreply.github.com> Date: Thu, 21 Sep 2023 08:57:58 -0700 Subject: [PATCH 05/28] Update nep-0508.md Updating based on @wacban 's feedback --- neps/nep-0508.md | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 3df4810cb..be7c17137 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -1,6 +1,6 @@ --- NEP: 508 -Title: Resharding phase 2 +Title: Resharding v2 Authors: Waclaw Banasik, Shreyan Gupta, Yoon Hong Status: Draft DiscussionsTo: https://github.com/near/nearcore/issues/8992 @@ -14,9 +14,11 @@ LastUpdated: 2023-09-19 In essence, this NEP is extension of [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md), which focused splitting one shard into multiple shards. -We are introducing the second phase of resharding, which supports one shard splitting into two within one epoch at pre-determined split boundary. +We are introducing resharding v2, which supports one shard splitting into two within one epoch at pre-determined split boundary. The NEP includes performance improvement to make resharding feasible under the current state as well as actual resharding in mainnet and testnet (To be specific, spliting shard 3 into two). + +While the new approach addresses critical limitations left unsolved in NEP-40 and is expected to remain valid for foreseable future, it does not serve all usecases, such as dynamic resharding. + -While the new approach addresses critical limitations left unsolved in NEP-40 and is expected to remain valid for foreseable future, it does not serve all usecases, such as dynamic resharding. ## Motivation @@ -29,7 +31,7 @@ Currently, NEAR protocol has four shards. With more partners onboarding, we star * Some form of State sync (centralized or decentralized) is enabled. * Flat state is enabled. * Shard split boundary is predetermined. In other words, necessity of shard splitting is manually decided. -* Merkle Patricia Trie is undelying data structure for the protocol. +* Merkle Patricia Trie is the undelying data structure for the protocol state. * Minimal epoch gap between two resharding events is X. ### High level requirements @@ -37,17 +39,27 @@ Currently, NEAR protocol has four shards. With more partners onboarding, we star * Resharding should work even when validators stop tracking all shards. * Resharding should work after stateless validation is enabled. * Resharding should be fast enough so that both state sync and resharding can happen within one epoch. -* Resharding should not require additional hardware from nodes. +* ~~Resharding should not require additional hardware from nodes.~~ + * This needs to be assessed during test * Resharding should be fault tolerant * Chain must not stall in case of resharding failure. * A validator should be able to recover in case they go offline during resharding. + * For now, our aim is at least allowing a validator to join back after resharding is finished. * No transaction or receipt should be lost during resharding. * Resharding should work regardless of number of existing shards. +* There should be no more place (in any apps or tools) where number of shard is hardcoded. ### Out of scope * Dynamic resharding + * automatically scheduling resharding based on shard usage/capacity + * automatically determining the shard layout +* merging shards +* shard reshuffling +* shard boundary adjustment * Shard determination logic (shard boundary is still determined by string value) +* Advanced failure handling + * If a validator goes offline during resharding, it can join back immediately and move forward as long as enough time is left to reperform resharding. * TBD ### Required protocol changes @@ -108,6 +120,7 @@ Other useful features that can be considered as a follow up: * The resharding process is still not fully automated. Analyzing shard data, determining the split boundary, and triggering an actual shard split all need to be manually curated by a person. * During resharding, a node is expected to do more work as it will have to apply changes twice (for the current shard and future shard). +* Increased potential for apps and tools to break without proper shard layout change handling. ### Backwards Compatibility From 68a9b62fc1f92178b0e34f0f8640bcb7bef4b2f3 Mon Sep 17 00:00:00 2001 From: walnut-the-cat <122475853+walnut-the-cat@users.noreply.github.com> Date: Wed, 25 Oct 2023 07:22:25 -0700 Subject: [PATCH 06/28] Update nep-0508.md fix lint errors --- neps/nep-0508.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index be7c17137..c690f9b31 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -94,10 +94,12 @@ TBD ## Future possibilities As noted above, dynamic resharding is out of scope for this NEP and should be implemented in the future. Dynamic resharding includes the following but not limited to: + * automatic determination of split boundary * automatic shard splitting and merging based on traffic Other useful features that can be considered as a follow up: + * account colocation for low latency across account call * shard on demand @@ -130,9 +132,9 @@ We do not expect anything to break with this change. Yet, shard splitting can in [Explain any issues that warrant further discussion. Considerations -- What parts of the design do you expect to resolve through the NEP process before this gets merged? -- What parts of the design do you expect to resolve through the implementation of this feature before stabilization? -- What related issues do you consider out of scope for this NEP that could be addressed in the future independently of the solution that comes out of this NEP?] +* What parts of the design do you expect to resolve through the NEP process before this gets merged? +* What parts of the design do you expect to resolve through the implementation of this feature before stabilization? +* What related issues do you consider out of scope for this NEP that could be addressed in the future independently of the solution that comes out of this NEP?] ## Changelog @@ -146,8 +148,8 @@ We do not expect anything to break with this change. Yet, shard splitting can in > List of benefits filled by the Subject Matter Experts while reviewing this version: -- Benefit 1 -- Benefit 2 +* Benefit 1 +* Benefit 2 #### Concerns From 051e5aab0625329b7d0618c179229a37478ebaff Mon Sep 17 00:00:00 2001 From: wacban Date: Thu, 26 Oct 2023 15:11:42 +0100 Subject: [PATCH 07/28] Update nep-0508.md styling improvement --- neps/nep-0508.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index c690f9b31..b1c07b14e 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -12,7 +12,7 @@ LastUpdated: 2023-09-19 ## Summary -In essence, this NEP is extension of [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md), which focused splitting one shard into multiple shards. +In essence, this NEP is an extension of [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md), which focused splitting one shard into multiple shards. We are introducing resharding v2, which supports one shard splitting into two within one epoch at pre-determined split boundary. The NEP includes performance improvement to make resharding feasible under the current state as well as actual resharding in mainnet and testnet (To be specific, spliting shard 3 into two). From c0ae86db7e4ca52d62e5cad547b1f36a283959cf Mon Sep 17 00:00:00 2001 From: wacban Date: Fri, 27 Oct 2023 13:01:23 +0100 Subject: [PATCH 08/28] reference implementation description and minor nits --- neps/nep-0508.md | 62 ++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 52 insertions(+), 10 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index b1c07b14e..504cb0abd 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -12,27 +12,27 @@ LastUpdated: 2023-09-19 ## Summary -In essence, this NEP is an extension of [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md), which focused splitting one shard into multiple shards. +This proposal introduces a new implementation for resharding and a new shard layout for the production networks. -We are introducing resharding v2, which supports one shard splitting into two within one epoch at pre-determined split boundary. The NEP includes performance improvement to make resharding feasible under the current state as well as actual resharding in mainnet and testnet (To be specific, spliting shard 3 into two). - -While the new approach addresses critical limitations left unsolved in NEP-40 and is expected to remain valid for foreseable future, it does not serve all usecases, such as dynamic resharding. +In essence, this NEP is an extension of [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md), which was focused on splitting one shard into multiple shards. +We are introducing resharding v2, which supports one shard splitting into two within one epoch at a pre-determined split boundary. The NEP includes performance improvement to make resharding feasible under the current state as well as actual resharding in mainnet and testnet (To be specific, spliting shard 3 into two). +While the new approach addresses critical limitations left unsolved in NEP-40 and is expected to remain valid for foreseable future, it does not serve all usecases, such as dynamic resharding. ## Motivation -Currently, NEAR protocol has four shards. With more partners onboarding, we started seeing that some shards occasionally become over-crowded. In addition, with state sync and stateless validation, validators do not need to track all shards and validator hardware requirements can be greatly reduced with smaller shard size. +Currently, NEAR protocol has four shards. With more partners onboarding, we started seeing that some shards occasionally become over-crowded. In addition, with state sync and stateless validation, validators will not need to track all shards and validator hardware requirements can be greatly reduced with smaller shard size. ## Specification ### High level assumptions -* Some form of State sync (centralized or decentralized) is enabled. * Flat state is enabled. * Shard split boundary is predetermined. In other words, necessity of shard splitting is manually decided. * Merkle Patricia Trie is the undelying data structure for the protocol state. * Minimal epoch gap between two resharding events is X. +* Some form of State Sync (centralized or decentralized) is enabled. ### High level requirements @@ -42,12 +42,12 @@ Currently, NEAR protocol has four shards. With more partners onboarding, we star * ~~Resharding should not require additional hardware from nodes.~~ * This needs to be assessed during test * Resharding should be fault tolerant - * Chain must not stall in case of resharding failure. + * Chain must not stall in case of resharding failure. TODO - this seems impossible under current assumptions because the shard layout for an epoch is committed to the chain before resharding is fininished * A validator should be able to recover in case they go offline during resharding. * For now, our aim is at least allowing a validator to join back after resharding is finished. * No transaction or receipt should be lost during resharding. * Resharding should work regardless of number of existing shards. -* There should be no more place (in any apps or tools) where number of shard is hardcoded. +* There should be no more place (in any apps or tools) where the number of shards is hardcoded. ### Out of scope @@ -57,7 +57,7 @@ Currently, NEAR protocol has four shards. With more partners onboarding, we star * merging shards * shard reshuffling * shard boundary adjustment -* Shard determination logic (shard boundary is still determined by string value) +* Shard Layout determination logic (shard boundaries are still determined offline and hardcoded) * Advanced failure handling * If a validator goes offline during resharding, it can join back immediately and move forward as long as enough time is left to reperform resharding. * TBD @@ -66,17 +66,51 @@ Currently, NEAR protocol has four shards. With more partners onboarding, we star TBD. e.g. configuration changes we have to introduce +A new protocol version will be introduced specifying the new shard layout. + ### Required state changes TBD. e.g. additional/updated data a node has to maintain +* For the duration of the resharding the node will need to maintain a snapshot of the flat state and related columns. +* For the duration of the epoch before the new shard layout takes effect, the node will need to maintain the state and flat state of shards in the old and new layout at the same time. + ### Resharding flow TBD. how resharding happens at the high level +* The new shard layout will be agreed on offline by the protocol team and hardcoded in the neard reference implementation. +* In epoch T the protocol version upgrade date will pass and nodes will vote to switch to the new protocol version. The new protocol version will contain the new shard layout. +* In epoch T, in the last block of the epoch, the EpochConfig for epoch T+2 will be set. The EpochConfig for epoch T+2 will have the new shard layout. +* In epoch T + 1, all nodes will perform the state split. The child shards will be kept up to date with the blockchain up until the epoch end. +* In epoch T + 2, the chain will switch to the new shard layout. + ## Reference Implementation -TBD +The implementation heavily re-uses the implementation from [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md). Below are listed only the major differences and additions. + +### Flat Storage +The old implementaion of resharding relied on iterating over the full state of the parent shard in order to build the state for the children shards. This implementation was suitable at the time but since then the state has grown considerably and this implementation is now too slow to fit within a single epoch. The new implementation relies on the flat storage in order to build the children shards quicker. Based on benchmarks, splitting one shard by using flat storage can take up to 15min. + +The new implementation will also propagate the flat storage for the children shards and keep it up to the with the chain until the switch to the new shard layout. The old implementation didn't handle this case because the flat storage didn't exist back then. + +In order to ensure consistent view of the flat storage while splitting the state the node will maintain a snapshot of the flat state and related columns. The existing implementation of flat state snapshots used in State Sync will be adjusted for this purpose. + +### Handling receipts, gas burnt and balance burnt + +When resharding, extra care should be taken when handling receipts in order to ensure that no receipts are lost or duplicated. The gas burnt and balance burnt also need to be correclty handled. The old resharding implementation for handling receipts, gas burnt and balance burnt relied on the fact in the first resharding there was only a single parent shard to begin with. The new implementation will provide a more generic and robust way of reassigning the receipts, gas burnt and balance burnt that works for arbitrary splitting of shards, regardless of the previous shard layout. + +### New shard layout + +A new shard layout will be determined and will be scheduled and executed in the production networks. The new shard layout will maintain the same boundaries for shards 0, 1 and 2. The heaviest shard today - Shard 3 will be split by introducing a new boundary account. The new boundary account will be determined by analysis the storage and gas usage within the shard and selecting a point that will divide the shard roughly in half in accordance to the mentioned metrics. Other metrics can also be used. + +### Fixed shards + +Fixed shards is a feature of the protocol that allows for assigning specific accounts and all of their recursive sub accounts to a predetermined shard. This feature is only used for testing, it was never used in production and there is no need for it in production. This feature unfortunately breaks the contiguity of shards. A sub account of a fixed shard account can fall in the middle of account range that belongs to a different shard. This property of fixed shards makes it particularly hard to reason about and implement efficient resharding. In order to simplify the code and new resharding implementation the fixed shards feature was removed ahead of this NEP. + +### Transaction pool + +The transaction pool is sharded e.i. it groups transactions by the shard where each should be converted to a receipt. The transaction pool was previously sharded by the ShardId. Unfortunately ShardId is insufficient to correctly identify a shard across a resharding event as ShardIds change domain. The transaction pool was migrated to group transactions by ShardUId instead and a transaction pool resharding was implemented to reassign transaction from parent shard to children shards right before the new shard layout takes effect. This was implemented ahead of this NEP. ## Security Implications @@ -91,6 +125,14 @@ TBD * What is the impact of not doing this? * TBD +## Integration with State Sync + +TBD + +## Integration with Stateless Validation + +TBD + ## Future possibilities As noted above, dynamic resharding is out of scope for this NEP and should be implemented in the future. Dynamic resharding includes the following but not limited to: From 6a8fedc052b279f9447bec2246fa6ae78ab767d2 Mon Sep 17 00:00:00 2001 From: wacban Date: Fri, 27 Oct 2023 13:17:08 +0100 Subject: [PATCH 09/28] fix lint --- neps/nep-0508.md | 1 + 1 file changed, 1 insertion(+) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 504cb0abd..a46c09d17 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -90,6 +90,7 @@ TBD. how resharding happens at the high level The implementation heavily re-uses the implementation from [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md). Below are listed only the major differences and additions. ### Flat Storage + The old implementaion of resharding relied on iterating over the full state of the parent shard in order to build the state for the children shards. This implementation was suitable at the time but since then the state has grown considerably and this implementation is now too slow to fit within a single epoch. The new implementation relies on the flat storage in order to build the children shards quicker. Based on benchmarks, splitting one shard by using flat storage can take up to 15min. The new implementation will also propagate the flat storage for the children shards and keep it up to the with the chain until the switch to the new shard layout. The old implementation didn't handle this case because the flat storage didn't exist back then. From 3f850d25cd79f5d606e8f13efee5cb492ba216d8 Mon Sep 17 00:00:00 2001 From: wacban Date: Fri, 27 Oct 2023 15:00:56 +0100 Subject: [PATCH 10/28] alternatives --- neps/nep-0508.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index a46c09d17..b9fe6c776 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -120,11 +120,14 @@ The transaction pool is sharded e.i. it groups transactions by the shard where e ## Alternatives * Why is this design the best in the space of possible designs? - * TBD + * This design is the simplest, most robust and safe while meeting all of the requirements. * What other designs have been considered and what is the rationale for not choosing them? - * TBD + * Splitting the trie by iterating over the boundaries between children shards for each trie record type. This implementation has the potential to be faster but it is more complex and it would take longer to implement. We opted in for the much simpler one using flat storage given it is already quite performant. + * Changing the trie structure to have the account id first and type of record later. This change would allow for much faster resharding by only iterating over the nodes on the boundary. This approach has two major drawbacks. 1) It would require a massive migration. 2) We would need to maintain the old and the new trie structure forever. + * Changing the storage structure by having the storage key to have the format of account_id.node_hash. This structure would make it much easier to split the trie on storage level because the children shards are simple sub-ranges of the parent shard. Unfortunately we found that the migration would not be feasible. + * Changing the storage structure by having the key to have the format of only node_hash. This is a feasible approach but it adds complexity to the garbage collection and data deletion. We opted in for the much simpler one by using the existing scheme of prefixing storage entries by shard uid. * What is the impact of not doing this? - * TBD + * We need resharding in order to scale up the system. Without resharding eventually shards would grow so big (in either storage or cpu usage) that a single node would not be able to handle it. ## Integration with State Sync @@ -144,6 +147,7 @@ As noted above, dynamic resharding is out of scope for this NEP and should be im Other useful features that can be considered as a follow up: * account colocation for low latency across account call +* removal of shard uids and introducing globally unique shard ids * shard on demand ## Consequences @@ -164,7 +168,7 @@ Other useful features that can be considered as a follow up: ### Negative * The resharding process is still not fully automated. Analyzing shard data, determining the split boundary, and triggering an actual shard split all need to be manually curated by a person. -* During resharding, a node is expected to do more work as it will have to apply changes twice (for the current shard and future shard). +* During resharding, a node is expected to do more work as it will first need to copy a lot of data around the then will have to apply changes twice (for the current shard and future shard). * Increased potential for apps and tools to break without proper shard layout change handling. ### Backwards Compatibility From b621aab999025026686a8f4349ca96e1ea170941 Mon Sep 17 00:00:00 2001 From: wacban Date: Tue, 31 Oct 2023 13:51:52 +0000 Subject: [PATCH 11/28] state sync and stateless validation --- neps/nep-0508.md | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index b9fe6c776..a176abebe 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -42,7 +42,7 @@ Currently, NEAR protocol has four shards. With more partners onboarding, we star * ~~Resharding should not require additional hardware from nodes.~~ * This needs to be assessed during test * Resharding should be fault tolerant - * Chain must not stall in case of resharding failure. TODO - this seems impossible under current assumptions because the shard layout for an epoch is committed to the chain before resharding is fininished + * Chain must not stall in case of resharding failure. TODO - this seems impossible under current assumptions because the shard layout for an epoch is committed to the chain before resharding is finished * A validator should be able to recover in case they go offline during resharding. * For now, our aim is at least allowing a validator to join back after resharding is finished. * No transaction or receipt should be lost during resharding. @@ -103,15 +103,19 @@ When resharding, extra care should be taken when handling receipts in order to e ### New shard layout -A new shard layout will be determined and will be scheduled and executed in the production networks. The new shard layout will maintain the same boundaries for shards 0, 1 and 2. The heaviest shard today - Shard 3 will be split by introducing a new boundary account. The new boundary account will be determined by analysis the storage and gas usage within the shard and selecting a point that will divide the shard roughly in half in accordance to the mentioned metrics. Other metrics can also be used. +A new shard layout will be determined and will be scheduled and executed in the production networks. The new shard layout will maintain the same boundaries for shards 0, 1 and 2. The heaviest shard today - Shard 3 - will be split by introducing a new boundary account. The new boundary account will be determined by analysing the storage and gas usage within the shard and selecting a point that will divide the shard roughly in half in accordance to the mentioned metrics. Other metrics can also be used. ### Fixed shards -Fixed shards is a feature of the protocol that allows for assigning specific accounts and all of their recursive sub accounts to a predetermined shard. This feature is only used for testing, it was never used in production and there is no need for it in production. This feature unfortunately breaks the contiguity of shards. A sub account of a fixed shard account can fall in the middle of account range that belongs to a different shard. This property of fixed shards makes it particularly hard to reason about and implement efficient resharding. In order to simplify the code and new resharding implementation the fixed shards feature was removed ahead of this NEP. +Fixed shards is a feature of the protocol that allows for assigning specific accounts and all of their recursive sub accounts to a predetermined shard. This feature is only used for testing, it was never used in production and there is no need for it in production. This feature unfortunately breaks the contiguity of shards. A sub account of a fixed shard account can fall in the middle of account range that belongs to a different shard. This property of fixed shards makes it particularly hard to reason about and implement efficient resharding. + +This was implemented ahead of this NEP. ### Transaction pool -The transaction pool is sharded e.i. it groups transactions by the shard where each should be converted to a receipt. The transaction pool was previously sharded by the ShardId. Unfortunately ShardId is insufficient to correctly identify a shard across a resharding event as ShardIds change domain. The transaction pool was migrated to group transactions by ShardUId instead and a transaction pool resharding was implemented to reassign transaction from parent shard to children shards right before the new shard layout takes effect. This was implemented ahead of this NEP. +The transaction pool is sharded e.i. it groups transactions by the shard where each should be converted to a receipt. The transaction pool was previously sharded by the ShardId. Unfortunately ShardId is insufficient to correctly identify a shard across a resharding event as ShardIds change domain. The transaction pool was migrated to group transactions by ShardUId instead and a transaction pool resharding was implemented to reassign transaction from parent shard to children shards right before the new shard layout takes effect. + +This was implemented ahead of this NEP. ## Security Implications @@ -131,11 +135,17 @@ The transaction pool is sharded e.i. it groups transactions by the shard where e ## Integration with State Sync -TBD +There are two known issues in the integration of resharding and state sync: +* When syncing the state for the first epoch where the new shard layout is used. In this case the node would need to apply the last block of the previous epoch. It cannot be done on the children shard as on chain the block was applied on the parent shards and the trie related gas costs would be different. +* When generating proofs for incoming receipts. The proof for each of the children shards contains only the receipts of the shard but it's generated on the parent shard layout and so may not be verified. + +In this NEP we propose that resharding should be rolled out first, before any real dependency on state sync is added. We can then safely roll out the resharding logic and solve the abovementioned issues separately. ## Integration with Stateless Validation -TBD +The Stateless Validation requires that chunk producers provide proofs of correctness of the transition function from one state root to another. That proof for the first block after the new shard layout takes place will need to prove that the entire state split was correct as well as the state transition. + +In this NEP we propose that resharding should be rolled out first, before stateless validation. We can then safely roll out the resharding logic and solve the abovementioned issues separately. ## Future possibilities @@ -163,12 +173,12 @@ Other useful features that can be considered as a follow up: * Number of shards is expected to increase. * Underlying trie structure and data structure are not going to change. -* Resharding will create dependency on flat storage and state sync. +* Resharding will create dependency on flat storage, flat state snapshots and state sync. TODO - what dependency on state sync? ### Negative * The resharding process is still not fully automated. Analyzing shard data, determining the split boundary, and triggering an actual shard split all need to be manually curated by a person. -* During resharding, a node is expected to do more work as it will first need to copy a lot of data around the then will have to apply changes twice (for the current shard and future shard). +* During resharding, a node is expected to do more work as it will first need to copy a lot of data around the then will have to apply changes twice (for the current shard and the future shard). * Increased potential for apps and tools to break without proper shard layout change handling. ### Backwards Compatibility From 1a0ffee3b5daee6284b338b116a6a12afea87fda Mon Sep 17 00:00:00 2001 From: wacban Date: Tue, 31 Oct 2023 13:58:00 +0000 Subject: [PATCH 12/28] fix lint --- neps/nep-0508.md | 1 + 1 file changed, 1 insertion(+) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index a176abebe..6e204eba8 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -136,6 +136,7 @@ This was implemented ahead of this NEP. ## Integration with State Sync There are two known issues in the integration of resharding and state sync: + * When syncing the state for the first epoch where the new shard layout is used. In this case the node would need to apply the last block of the previous epoch. It cannot be done on the children shard as on chain the block was applied on the parent shards and the trie related gas costs would be different. * When generating proofs for incoming receipts. The proof for each of the children shards contains only the receipts of the shard but it's generated on the parent shard layout and so may not be verified. From 99a3845525b329e4cf98a295b1ed4f912d50a1ef Mon Sep 17 00:00:00 2001 From: wacban Date: Wed, 1 Nov 2023 17:39:11 +0000 Subject: [PATCH 13/28] remove state sync from dependencies --- neps/nep-0508.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 6e204eba8..7be8f698e 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -32,7 +32,6 @@ Currently, NEAR protocol has four shards. With more partners onboarding, we star * Shard split boundary is predetermined. In other words, necessity of shard splitting is manually decided. * Merkle Patricia Trie is the undelying data structure for the protocol state. * Minimal epoch gap between two resharding events is X. -* Some form of State Sync (centralized or decentralized) is enabled. ### High level requirements @@ -174,7 +173,7 @@ Other useful features that can be considered as a follow up: * Number of shards is expected to increase. * Underlying trie structure and data structure are not going to change. -* Resharding will create dependency on flat storage, flat state snapshots and state sync. TODO - what dependency on state sync? +* Resharding will create dependency on flat state snapshots. ### Negative From 26ec5a5d0c39e3f8c25bd70fd9aff5e74cff0cff Mon Sep 17 00:00:00 2001 From: wacban Date: Wed, 1 Nov 2023 18:02:06 +0000 Subject: [PATCH 14/28] per comments --- neps/nep-0508.md | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 7be8f698e..c551a54dd 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -41,7 +41,6 @@ Currently, NEAR protocol has four shards. With more partners onboarding, we star * ~~Resharding should not require additional hardware from nodes.~~ * This needs to be assessed during test * Resharding should be fault tolerant - * Chain must not stall in case of resharding failure. TODO - this seems impossible under current assumptions because the shard layout for an epoch is committed to the chain before resharding is finished * A validator should be able to recover in case they go offline during resharding. * For now, our aim is at least allowing a validator to join back after resharding is finished. * No transaction or receipt should be lost during resharding. @@ -71,15 +70,15 @@ A new protocol version will be introduced specifying the new shard layout. TBD. e.g. additional/updated data a node has to maintain -* For the duration of the resharding the node will need to maintain a snapshot of the flat state and related columns. -* For the duration of the epoch before the new shard layout takes effect, the node will need to maintain the state and flat state of shards in the old and new layout at the same time. +* For the duration of the resharding the node will need to maintain a snapshot of the flat state and related columns. As the main database and the snapshot diverge this will cause some extent of storage overhead. +* For the duration of the epoch before the new shard layout takes effect, the node will need to maintain the state and flat state of shards in the old and new layout at the same time. The State and FlatState columns will grow up to 2x. The processing overhead should be minimal as the chunks will still be executed only on the parent shards. There will be increased load on the database while applying changes to both the parent and the children shards. ### Resharding flow TBD. how resharding happens at the high level * The new shard layout will be agreed on offline by the protocol team and hardcoded in the neard reference implementation. -* In epoch T the protocol version upgrade date will pass and nodes will vote to switch to the new protocol version. The new protocol version will contain the new shard layout. +* In epoch T, past the protocol version upgrade date, nodes will vote to switch to the new protocol version.. The new protocol version will contain the new shard layout. * In epoch T, in the last block of the epoch, the EpochConfig for epoch T+2 will be set. The EpochConfig for epoch T+2 will have the new shard layout. * In epoch T + 1, all nodes will perform the state split. The child shards will be kept up to date with the blockchain up until the epoch end. * In epoch T + 2, the chain will switch to the new shard layout. @@ -88,6 +87,12 @@ TBD. how resharding happens at the high level The implementation heavily re-uses the implementation from [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md). Below are listed only the major differences and additions. +### Code pointers to the proposed implementation + +* [new shard layout](https://github.com/near/nearcore/blob/bc0eb3c6607f4ae865526a25b80706ec4e081fdc/core/primitives/src/shard_layout.rs#L161) +* [the main logic for splitting states](https://github.com/near/nearcore/blob/bc0eb3c6607f4ae865526a25b80706ec4e081fdc/chain/chain/src/resharding.rs#L248) +* [the main logic for applying chunks to split states](https://github.com/near/nearcore/blob/bc0eb3c6607f4ae865526a25b80706ec4e081fdc/chain/chain/src/chain.rs#L5180) + ### Flat Storage The old implementaion of resharding relied on iterating over the full state of the parent shard in order to build the state for the children shards. This implementation was suitable at the time but since then the state has grown considerably and this implementation is now too slow to fit within a single epoch. The new implementation relies on the flat storage in order to build the children shards quicker. Based on benchmarks, splitting one shard by using flat storage can take up to 15min. @@ -102,19 +107,19 @@ When resharding, extra care should be taken when handling receipts in order to e ### New shard layout -A new shard layout will be determined and will be scheduled and executed in the production networks. The new shard layout will maintain the same boundaries for shards 0, 1 and 2. The heaviest shard today - Shard 3 - will be split by introducing a new boundary account. The new boundary account will be determined by analysing the storage and gas usage within the shard and selecting a point that will divide the shard roughly in half in accordance to the mentioned metrics. Other metrics can also be used. +The first release of the resharding v2 will contain a new shard layout where one of the existing shard will be split into two smaller shards. Furthermore additional reshardings can be scheduled in subsequent neard releases without additional NEPs unless the need for it arises. A new shard layout will be determined and will be scheduled and executed in the production networks. Resharding will typically happen by splitting one of the existing shards into two smaller shards. The new shard layout will be created by adding a new boundary account. The new boundary account will be determined by analysing the storage and gas usage within the shard and selecting a point that will divide the shard roughly in half in accordance to the mentioned metrics. Other metrics can also be used. ### Fixed shards Fixed shards is a feature of the protocol that allows for assigning specific accounts and all of their recursive sub accounts to a predetermined shard. This feature is only used for testing, it was never used in production and there is no need for it in production. This feature unfortunately breaks the contiguity of shards. A sub account of a fixed shard account can fall in the middle of account range that belongs to a different shard. This property of fixed shards makes it particularly hard to reason about and implement efficient resharding. -This was implemented ahead of this NEP. +This was implemented ahead of this NEP and the fixed shards feature was **removed**. ### Transaction pool -The transaction pool is sharded e.i. it groups transactions by the shard where each should be converted to a receipt. The transaction pool was previously sharded by the ShardId. Unfortunately ShardId is insufficient to correctly identify a shard across a resharding event as ShardIds change domain. The transaction pool was migrated to group transactions by ShardUId instead and a transaction pool resharding was implemented to reassign transaction from parent shard to children shards right before the new shard layout takes effect. +The transaction pool is sharded e.i. it groups transactions by the shard where each should be converted to a receipt. The transaction pool was previously sharded by the ShardId. Unfortunately ShardId is insufficient to correctly identify a shard across a resharding event as ShardIds change domain. The transaction pool was migrated to group transactions by ShardUId instead and a transaction pool resharding was implemented to reassign transaction from parent shard to children shards right before the new shard layout takes effect. The ShardUId contains the version of the shard layout which allows differentiating between shards in different shard layouts. -This was implemented ahead of this NEP. +This was implemented ahead of this NEP and the transaction pool is now fully **migrated** to ShardUId. ## Security Implications From 0c3b123947df9c1274324847ddc665dd261a60df Mon Sep 17 00:00:00 2001 From: Shreyan Gupta Date: Tue, 14 Nov 2023 20:38:32 +0530 Subject: [PATCH 15/28] Update nep-0508.md (#517) Creating a PR so that we can discuss changes... --- neps/nep-0508.md | 143 +++++++++++++++++++++++------------------------ 1 file changed, 69 insertions(+), 74 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index c551a54dd..9706f60d6 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -22,182 +22,177 @@ While the new approach addresses critical limitations left unsolved in NEP-40 an ## Motivation -Currently, NEAR protocol has four shards. With more partners onboarding, we started seeing that some shards occasionally become over-crowded. In addition, with state sync and stateless validation, validators will not need to track all shards and validator hardware requirements can be greatly reduced with smaller shard size. +Currently, NEAR protocol has four shards. With more partners onboarding, we started seeing that some shards occasionally become over-crowded with respect to total state size and number of transactions. In addition, with state sync and stateless validation, validators will not need to track all shards and validator hardware requirements can be greatly reduced with smaller shard size. With future in-memory tries, it's also important to limit the size of individual shards. ## Specification ### High level assumptions -* Flat state is enabled. -* Shard split boundary is predetermined. In other words, necessity of shard splitting is manually decided. +* Flat storage is enabled. +* Shard split boundary is predetermined and hardcoded. In other words, necessity of shard splitting is manually decided. +* For the time being resharding as an event is only going to happen once but we would still like to have the infrastrcture in place to handle future resharding events with ease. * Merkle Patricia Trie is the undelying data structure for the protocol state. -* Minimal epoch gap between two resharding events is X. +* Epoch is at least 6 hrs long for resharding to complete. ### High level requirements * Resharding should work even when validators stop tracking all shards. * Resharding should work after stateless validation is enabled. * Resharding should be fast enough so that both state sync and resharding can happen within one epoch. -* ~~Resharding should not require additional hardware from nodes.~~ - * This needs to be assessed during test -* Resharding should be fault tolerant - * A validator should be able to recover in case they go offline during resharding. - * For now, our aim is at least allowing a validator to join back after resharding is finished. +* Resharding should work efficiently within the limits of the current hardware requirements for nodes. +* Potential failures in resharding may require intervention from node operator to recover. * No transaction or receipt should be lost during resharding. * Resharding should work regardless of number of existing shards. -* There should be no more place (in any apps or tools) where the number of shards is hardcoded. +* No apps, tools or code should hardcode the number of shards to 4. ### Out of scope * Dynamic resharding * automatically scheduling resharding based on shard usage/capacity * automatically determining the shard layout -* merging shards -* shard reshuffling -* shard boundary adjustment -* Shard Layout determination logic (shard boundaries are still determined offline and hardcoded) -* Advanced failure handling - * If a validator goes offline during resharding, it can join back immediately and move forward as long as enough time is left to reperform resharding. -* TBD +* Merging shards or boundary adjustments +* Shard reshuffling +* Shard Layout determination logic. Shard boundaries are still determined offline and hardcoded. ### Required protocol changes -TBD. e.g. configuration changes we have to introduce - -A new protocol version will be introduced specifying the new shard layout. +A new protocol version will be introduced specifying the new shard layout which would be picked up by the resharding logic to split the shard. ### Required state changes -TBD. e.g. additional/updated data a node has to maintain - -* For the duration of the resharding the node will need to maintain a snapshot of the flat state and related columns. As the main database and the snapshot diverge this will cause some extent of storage overhead. -* For the duration of the epoch before the new shard layout takes effect, the node will need to maintain the state and flat state of shards in the old and new layout at the same time. The State and FlatState columns will grow up to 2x. The processing overhead should be minimal as the chunks will still be executed only on the parent shards. There will be increased load on the database while applying changes to both the parent and the children shards. +* For the duration of the resharding the node will need to maintain a snapshot of the flat state and related columns. As the main database and the snapshot diverge this will cause some extent of storage overhead. +* For the duration of the epoch before the new shard layout takes effect, the node will need to maintain the state and flat state of shards in the old and new layout at the same time. The State and FlatState columns will grow up to approx 2x the size. The processing overhead should be minimal as the chunks will still be executed only on the parent shards. There will be increased load on the database while applying changes to both the parent and the children shards. ### Resharding flow -TBD. how resharding happens at the high level - * The new shard layout will be agreed on offline by the protocol team and hardcoded in the neard reference implementation. -* In epoch T, past the protocol version upgrade date, nodes will vote to switch to the new protocol version.. The new protocol version will contain the new shard layout. +* In epoch T, past the protocol version upgrade date, nodes will vote to switch to the new protocol version. The new protocol version will contain the new shard layout. * In epoch T, in the last block of the epoch, the EpochConfig for epoch T+2 will be set. The EpochConfig for epoch T+2 will have the new shard layout. -* In epoch T + 1, all nodes will perform the state split. The child shards will be kept up to date with the blockchain up until the epoch end. +* In epoch T + 1, all nodes will perform the state split. The child shards will be kept up to date with the blockchain up until the epoch end first via catchup, and later as part of block postprocessing state application. * In epoch T + 2, the chain will switch to the new shard layout. ## Reference Implementation -The implementation heavily re-uses the implementation from [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md). Below are listed only the major differences and additions. +The implementation heavily re-uses the implementation from [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md). Below are listed the major differences and additions. ### Code pointers to the proposed implementation * [new shard layout](https://github.com/near/nearcore/blob/bc0eb3c6607f4ae865526a25b80706ec4e081fdc/core/primitives/src/shard_layout.rs#L161) * [the main logic for splitting states](https://github.com/near/nearcore/blob/bc0eb3c6607f4ae865526a25b80706ec4e081fdc/chain/chain/src/resharding.rs#L248) * [the main logic for applying chunks to split states](https://github.com/near/nearcore/blob/bc0eb3c6607f4ae865526a25b80706ec4e081fdc/chain/chain/src/chain.rs#L5180) +* [the main logic for garbage collecting state from parent shard](https://github.com/near/nearcore/blob/eb824087fdd0763f00e695e303d7ad6c56f96538/chain/chain/src/store.rs#L2325) ### Flat Storage -The old implementaion of resharding relied on iterating over the full state of the parent shard in order to build the state for the children shards. This implementation was suitable at the time but since then the state has grown considerably and this implementation is now too slow to fit within a single epoch. The new implementation relies on the flat storage in order to build the children shards quicker. Based on benchmarks, splitting one shard by using flat storage can take up to 15min. +The old implementaion of resharding relied on iterating over the full trie state of the parent shard in order to build the state for the children shards. This implementation was suitable at the time but since then the state has grown considerably and this implementation is now too slow to fit within a single epoch. The new implementation relies on iterating through the flat storage in order to build the children shards quicker. Based on benchmarks, splitting shard 3 by using flat storage can take around 15 min without throttling and around 3 hours with throttling to maintain block production rate. -The new implementation will also propagate the flat storage for the children shards and keep it up to the with the chain until the switch to the new shard layout. The old implementation didn't handle this case because the flat storage didn't exist back then. +The new implementation will also propagate the flat storage for the children shards and keep it up to the with the chain until the switch to the new shard layout in the next epoch. The old implementation didn't handle this case because the flat storage didn't exist back then. -In order to ensure consistent view of the flat storage while splitting the state the node will maintain a snapshot of the flat state and related columns. The existing implementation of flat state snapshots used in State Sync will be adjusted for this purpose. +In order to ensure consistent view of the flat storage while splitting the state the node will maintain a snapshot of the flat state and related columns as of the last block of the epoch prior to resharding. The existing implementation of flat state snapshots used in State Sync will be used for this purpose. ### Handling receipts, gas burnt and balance burnt -When resharding, extra care should be taken when handling receipts in order to ensure that no receipts are lost or duplicated. The gas burnt and balance burnt also need to be correclty handled. The old resharding implementation for handling receipts, gas burnt and balance burnt relied on the fact in the first resharding there was only a single parent shard to begin with. The new implementation will provide a more generic and robust way of reassigning the receipts, gas burnt and balance burnt that works for arbitrary splitting of shards, regardless of the previous shard layout. +When resharding, extra care should be taken when handling receipts in order to ensure that no receipts are lost or duplicated. The gas burnt and balance burnt also need to be correclty handled. The old resharding implementation for handling receipts, gas burnt and balance burnt relied on the fact in the first resharding there was only a single parent shard to begin with. The new implementation will provide a more generic and robust way of reassigning the receipts to the child shards, gas burnt, and balance burnt, that works for arbitrary splitting of shards, regardless of the previous shard layout. ### New shard layout -The first release of the resharding v2 will contain a new shard layout where one of the existing shard will be split into two smaller shards. Furthermore additional reshardings can be scheduled in subsequent neard releases without additional NEPs unless the need for it arises. A new shard layout will be determined and will be scheduled and executed in the production networks. Resharding will typically happen by splitting one of the existing shards into two smaller shards. The new shard layout will be created by adding a new boundary account. The new boundary account will be determined by analysing the storage and gas usage within the shard and selecting a point that will divide the shard roughly in half in accordance to the mentioned metrics. Other metrics can also be used. +The first release of the resharding v2 will contain a new shard layout where one of the existing shards will be split into two smaller shards. Furthermore additional reshardings can be scheduled in subsequent neard releases without additional NEPs unless the need for it arises. A new shard layout can be determined and will be scheduled and executed with the next protocol upgrade. Resharding will typically happen by splitting one of the existing shards into two smaller shards. The new shard layout will be created by adding a new boundary account that will be determined by analysing the storage and gas usage metrics within the shard and selecting a point that will divide the shard roughly in half in accordance to the mentioned metrics. Other metrics can also be used based on requirements. -### Fixed shards +### Removal of Fixed shards -Fixed shards is a feature of the protocol that allows for assigning specific accounts and all of their recursive sub accounts to a predetermined shard. This feature is only used for testing, it was never used in production and there is no need for it in production. This feature unfortunately breaks the contiguity of shards. A sub account of a fixed shard account can fall in the middle of account range that belongs to a different shard. This property of fixed shards makes it particularly hard to reason about and implement efficient resharding. +Fixed shards was a feature of the protocol that allowed for assigning specific accounts and all of their recursive sub accounts to a predetermined shard. This feature was only used for testing and was never used in production. Fixed shards fature unfortunately breaks the contiguity of shards and is not compatible with the new resharding flow. A sub account of a fixed shard account can fall in the middle of account range that belongs to a different shard. This property of fixed shards made it particularly hard to reason about and implement efficient resharding. This was implemented ahead of this NEP and the fixed shards feature was **removed**. +### Garbage collection + +In epoch T+2 once resharding is completed, we can delete the trie state and the flat state related to the parent shard. In practice, this is handled as part of the garbage collection code. While garbage collecting the last block of epoch T+1, we go ahead and clear all the data associated with the parent shard from the trie cache, flat storage, and RocksDB state associated with trie state and flat storage. + ### Transaction pool -The transaction pool is sharded e.i. it groups transactions by the shard where each should be converted to a receipt. The transaction pool was previously sharded by the ShardId. Unfortunately ShardId is insufficient to correctly identify a shard across a resharding event as ShardIds change domain. The transaction pool was migrated to group transactions by ShardUId instead and a transaction pool resharding was implemented to reassign transaction from parent shard to children shards right before the new shard layout takes effect. The ShardUId contains the version of the shard layout which allows differentiating between shards in different shard layouts. +The transaction pool is sharded i.e. it groups transactions by the shard where each transaction should be converted to a receipt. The transaction pool was previously sharded by the ShardId. Unfortunately ShardId is insufficient to correctly identify a shard across a resharding event as ShardIds change domain. The transaction pool was migrated to group transactions by ShardUId instead, and a transaction pool resharding was implemented to reassign transaction from parent shard to children shards right before the new shard layout takes effect. The ShardUId contains the version of the shard layout which allows differentiating between shards in different shard layouts. This was implemented ahead of this NEP and the transaction pool is now fully **migrated** to ShardUId. -## Security Implications +## Alternatives -[Explicitly outline any security concerns in relation to the NEP, and potential ways to resolve or mitigate them. At the very least, well-known relevant threats must be covered, e.g. person-in-the-middle, double-spend, XSS, CSRF, etc.] +### Why is this design the best in the space of possible designs? -## Alternatives +This design is the simplest, most robust and safe while meeting all of the requirements. + +### What other designs have been considered and what is the rationale for not choosing them? + +* Splitting the trie by iterating over the boundaries between children shards for each trie record type. This implementation has the potential to be faster but it is more complex and it would take longer to implement. We opted in for the much simpler one using flat storage given it is already quite performant. +* Changing the trie structure to have the account id first and type of record later. This change would allow for much faster resharding by only iterating over the nodes on the boundary. This approach has two major drawbacks without providing too many benefits over the previous approach of splitting by each trie record type. + 1) It would require a massive migration of trie. + 2) We would need to maintain the old and the new trie structure forever. +* Changing the storage structure by having the storage key to have the format of `account_id.node_hash`. This structure would make it much easier to split the trie on storage level because the children shards are simple sub-ranges of the parent shard. Unfortunately we found that the migration would not be feasible. +* Changing the storage structure by having the key format as only node_hash and dropping the ShardUId prefix. This is a feasible approach but it adds complexity to the garbage collection and data deletion, specially when nodes would start tracking only one shard. We opted in for the much simpler one by using the existing scheme of prefixing storage entries by shard uid. -* Why is this design the best in the space of possible designs? - * This design is the simplest, most robust and safe while meeting all of the requirements. -* What other designs have been considered and what is the rationale for not choosing them? - * Splitting the trie by iterating over the boundaries between children shards for each trie record type. This implementation has the potential to be faster but it is more complex and it would take longer to implement. We opted in for the much simpler one using flat storage given it is already quite performant. - * Changing the trie structure to have the account id first and type of record later. This change would allow for much faster resharding by only iterating over the nodes on the boundary. This approach has two major drawbacks. 1) It would require a massive migration. 2) We would need to maintain the old and the new trie structure forever. - * Changing the storage structure by having the storage key to have the format of account_id.node_hash. This structure would make it much easier to split the trie on storage level because the children shards are simple sub-ranges of the parent shard. Unfortunately we found that the migration would not be feasible. - * Changing the storage structure by having the key to have the format of only node_hash. This is a feasible approach but it adds complexity to the garbage collection and data deletion. We opted in for the much simpler one by using the existing scheme of prefixing storage entries by shard uid. -* What is the impact of not doing this? - * We need resharding in order to scale up the system. Without resharding eventually shards would grow so big (in either storage or cpu usage) that a single node would not be able to handle it. +### What is the impact of not doing this? + +We need resharding in order to scale up the system. Without resharding eventually shards would grow so big (in either storage or cpu usage) that a single node would not be able to handle it. Additionally, this clears up the path to implement in-memory tries as we need to store the whole trie structure in limited RAM. In the future smaller shard size would lead to faster syncing of shard data when nodes start tracking just one shard. ## Integration with State Sync There are two known issues in the integration of resharding and state sync: -* When syncing the state for the first epoch where the new shard layout is used. In this case the node would need to apply the last block of the previous epoch. It cannot be done on the children shard as on chain the block was applied on the parent shards and the trie related gas costs would be different. +* When syncing the state for the first epoch where the new shard layout is used. In this case the node would need to apply the last block of the previous epoch. It cannot be done on the children shard as on chain the block was applied on the parent shards and the trie related gas costs would be different. * When generating proofs for incoming receipts. The proof for each of the children shards contains only the receipts of the shard but it's generated on the parent shard layout and so may not be verified. -In this NEP we propose that resharding should be rolled out first, before any real dependency on state sync is added. We can then safely roll out the resharding logic and solve the abovementioned issues separately. +In this NEP we propose that resharding should be rolled out first, before any real dependency on state sync is added. We can then safely roll out the resharding logic and solve the above mentioned issues separately. We believe atleast some of the issues can be mitigated by the implementation of new pre-state root and chunk execution design. ## Integration with Stateless Validation -The Stateless Validation requires that chunk producers provide proofs of correctness of the transition function from one state root to another. That proof for the first block after the new shard layout takes place will need to prove that the entire state split was correct as well as the state transition. +The Stateless Validation requires that chunk producers provide proof of correctness of the transition function from one state root to another. That proof for the first block after the new shard layout takes place will need to prove that the entire state split was correct as well as the state transition. -In this NEP we propose that resharding should be rolled out first, before stateless validation. We can then safely roll out the resharding logic and solve the abovementioned issues separately. +In this NEP we propose that resharding should be rolled out first, before stateless validation. We can then safely roll out the resharding logic and solve the above mentioned issues separately. ## Future possibilities +### Dynamic resharding + As noted above, dynamic resharding is out of scope for this NEP and should be implemented in the future. Dynamic resharding includes the following but not limited to: -* automatic determination of split boundary -* automatic shard splitting and merging based on traffic +* Automatic determination of split boundary based on parameters like traffic, gas usage, state size, etc. +* Automatic shard splitting and merging + +### Localization of resharding event to specific shard + +As of today, at the RocksDB storage layer, we have the ShardUId, i.e. the ShardId along with the ShardVersion, as a prefix in the key of trie state and flat state. During a resharding event, we increment the ShardVersion by one, and effectively remap all the current parent shards to new child shards. This implies we can't use the same underlying key value pairs for store and instead would need to duplicate the values with the new ShardUId prefix, even if a shard is unaffected and not split. -Other useful features that can be considered as a follow up: +In the future, we would like to potentially change the schema in a way such that only the shard that is splitting is impacted by a resharding event, so as to avoid additonal work done by nodes tracking other shards. -* account colocation for low latency across account call -* removal of shard uids and introducing globally unique shard ids -* shard on demand +### Other useful features + +* Account colocation for low latency across account call +* Removal of shard uids and introducing globally unique shard ids +* Shard on demand ## Consequences ### Positive * Workload across shards will be more evenly distributed. -* Required space to maintain state (either in memory or in persistent disk) will be smaller. -* State sync overhead will be smaller. -* TBD +* Required space to maintain state (either in memory or in persistent disk) will be smaller. This is useful for in-memory tries. +* State sync overhead will be smaller with smaller state size. ### Neutral -* Number of shards is expected to increase. +* Number of shards would increase. * Underlying trie structure and data structure are not going to change. -* Resharding will create dependency on flat state snapshots. +* Resharding will create dependency on flat state snapshots. ### Negative -* The resharding process is still not fully automated. Analyzing shard data, determining the split boundary, and triggering an actual shard split all need to be manually curated by a person. -* During resharding, a node is expected to do more work as it will first need to copy a lot of data around the then will have to apply changes twice (for the current shard and the future shard). +* The resharding process, as of now, is not fully automated. Analyzing shard data, determining the split boundary, and triggering an actual shard split all need to be manually curated and tracked. +* During resharding, a node is expected to require more resources as it will first need to copy state data from the parent shard to the child shard, and then will have to apply trie and flat state changes twice, once for the parent shard and once for the child shards. * Increased potential for apps and tools to break without proper shard layout change handling. ### Backwards Compatibility We do not expect anything to break with this change. Yet, shard splitting can introduce additional complexity on replayability. For instance, as target shard of a receipt and belonging shard of an account can change with shard splitting, shard splitting must be replayed along with transactions at the exact epoch boundary. -## Unresolved Issues (Optional) - -[Explain any issues that warrant further discussion. Considerations - -* What parts of the design do you expect to resolve through the NEP process before this gets merged? -* What parts of the design do you expect to resolve through the implementation of this feature before stabilization? -* What related issues do you consider out of scope for this NEP that could be addressed in the future independently of the solution that comes out of this NEP?] - ## Changelog [The changelog section provides historical context for how the NEP developed over time. Initial NEP submission should start with version 1.0.0, and all subsequent NEP extensions must follow [Semantic Versioning](https://semver.org/). Every version should have the benefits and concerns raised during the review. The author does not need to fill out this section for the initial draft. Instead, the assigned reviewers (Subject Matter Experts) should create the first version during the first technical review. After the final public call, the author should then finalize the last version of the decision context.] From bfe52dbbcacf7be46d246f0f7c827e20d219c5e5 Mon Sep 17 00:00:00 2001 From: walnut-the-cat <122475853+walnut-the-cat@users.noreply.github.com> Date: Tue, 14 Nov 2023 11:19:25 -0800 Subject: [PATCH 16/28] Update nep-0508.md minor changes --- neps/nep-0508.md | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 9706f60d6..25266ade1 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -6,8 +6,8 @@ Status: Draft DiscussionsTo: https://github.com/near/nearcore/issues/8992 Type: Protocol Version: 1.0.0 -Created: 2022-09-19 -LastUpdated: 2023-09-19 +Created: 2023-09-19 +LastUpdated: 2023-11-14 --- ## Summary @@ -16,7 +16,7 @@ This proposal introduces a new implementation for resharding and a new shard lay In essence, this NEP is an extension of [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md), which was focused on splitting one shard into multiple shards. -We are introducing resharding v2, which supports one shard splitting into two within one epoch at a pre-determined split boundary. The NEP includes performance improvement to make resharding feasible under the current state as well as actual resharding in mainnet and testnet (To be specific, spliting shard 3 into two). +We are introducing resharding v2, which supports one shard splitting into two within one epoch at a pre-determined split boundary. The NEP includes performance improvement to make resharding feasible under the current state as well as actual resharding in mainnet and testnet (To be specific, spliting the largest shard into two). While the new approach addresses critical limitations left unsolved in NEP-40 and is expected to remain valid for foreseable future, it does not serve all usecases, such as dynamic resharding. @@ -36,13 +36,11 @@ Currently, NEAR protocol has four shards. With more partners onboarding, we star ### High level requirements -* Resharding should work even when validators stop tracking all shards. -* Resharding should work after stateless validation is enabled. -* Resharding should be fast enough so that both state sync and resharding can happen within one epoch. +* Resharding must be fast enough so that both state sync and resharding can happen within one epoch. * Resharding should work efficiently within the limits of the current hardware requirements for nodes. * Potential failures in resharding may require intervention from node operator to recover. -* No transaction or receipt should be lost during resharding. -* Resharding should work regardless of number of existing shards. +* No transaction or receipt must be lost during resharding. +* Resharding must work regardless of number of existing shards. * No apps, tools or code should hardcode the number of shards to 4. ### Out of scope @@ -84,7 +82,7 @@ The implementation heavily re-uses the implementation from [NEP-40](https://gith ### Flat Storage -The old implementaion of resharding relied on iterating over the full trie state of the parent shard in order to build the state for the children shards. This implementation was suitable at the time but since then the state has grown considerably and this implementation is now too slow to fit within a single epoch. The new implementation relies on iterating through the flat storage in order to build the children shards quicker. Based on benchmarks, splitting shard 3 by using flat storage can take around 15 min without throttling and around 3 hours with throttling to maintain block production rate. +The old implementaion of resharding relied on iterating over the full trie state of the parent shard in order to build the state for the children shards. This implementation was suitable at the time but since then the state has grown considerably and this implementation is now too slow to fit within a single epoch. The new implementation relies on iterating through the flat storage in order to build the children shards quicker. Based on benchmarks, splitting the largest shard by using flat storage can take around 15 min without throttling and around 3 hours with throttling to maintain block production rate. The new implementation will also propagate the flat storage for the children shards and keep it up to the with the chain until the switch to the new shard layout in the next epoch. The old implementation didn't handle this case because the flat storage didn't exist back then. @@ -148,6 +146,13 @@ The Stateless Validation requires that chunk producers provide proof of correctn In this NEP we propose that resharding should be rolled out first, before stateless validation. We can then safely roll out the resharding logic and solve the above mentioned issues separately. +## Future fast-followups +### Resharding should work even when validators stop tracking all shards. +As mentioned above under 'Integration with State Sync' section, initial release of resharding v2 will happen before the full implementation of state sync and we plan to tackle the integration between resharding and state sync after the next shard split (Won't need a separate NEP as the integration does not require protocol change.) + +### Resharding should work after stateless validation is enabled. +As mentioned above under 'Integration with Statelss Validation' section, initial release of resharding v2 will happen before the full implementation of stateless validation and we plan to tackle the integration between resharding and stateless validation after the next shard split (May need a separate NEP depending on implemetnation detail.) + ## Future possibilities ### Dynamic resharding From 26a88a3bcc2e7391c851b6f5652e2293f0c553ee Mon Sep 17 00:00:00 2001 From: walnut-the-cat <122475853+walnut-the-cat@users.noreply.github.com> Date: Wed, 15 Nov 2023 07:27:17 -0800 Subject: [PATCH 17/28] Update nep-0508.md fix lint errors --- neps/nep-0508.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 25266ade1..798acbd8a 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -147,10 +147,13 @@ The Stateless Validation requires that chunk producers provide proof of correctn In this NEP we propose that resharding should be rolled out first, before stateless validation. We can then safely roll out the resharding logic and solve the above mentioned issues separately. ## Future fast-followups -### Resharding should work even when validators stop tracking all shards. + +### Resharding should work even when validators stop tracking all shards + As mentioned above under 'Integration with State Sync' section, initial release of resharding v2 will happen before the full implementation of state sync and we plan to tackle the integration between resharding and state sync after the next shard split (Won't need a separate NEP as the integration does not require protocol change.) -### Resharding should work after stateless validation is enabled. +### Resharding should work after stateless validation is enabled + As mentioned above under 'Integration with Statelss Validation' section, initial release of resharding v2 will happen before the full implementation of stateless validation and we plan to tackle the integration between resharding and stateless validation after the next shard split (May need a separate NEP depending on implemetnation detail.) ## Future possibilities From 295339162554ab575facf2420fb84819b11bac40 Mon Sep 17 00:00:00 2001 From: wacban Date: Fri, 1 Dec 2023 10:54:33 +0000 Subject: [PATCH 18/28] Update nep-0508.md - per comments (#521) --- neps/nep-0508.md | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 798acbd8a..28a4ff006 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -50,7 +50,6 @@ Currently, NEAR protocol has four shards. With more partners onboarding, we star * automatically determining the shard layout * Merging shards or boundary adjustments * Shard reshuffling -* Shard Layout determination logic. Shard boundaries are still determined offline and hardcoded. ### Required protocol changes @@ -75,10 +74,10 @@ The implementation heavily re-uses the implementation from [NEP-40](https://gith ### Code pointers to the proposed implementation -* [new shard layout](https://github.com/near/nearcore/blob/bc0eb3c6607f4ae865526a25b80706ec4e081fdc/core/primitives/src/shard_layout.rs#L161) -* [the main logic for splitting states](https://github.com/near/nearcore/blob/bc0eb3c6607f4ae865526a25b80706ec4e081fdc/chain/chain/src/resharding.rs#L248) -* [the main logic for applying chunks to split states](https://github.com/near/nearcore/blob/bc0eb3c6607f4ae865526a25b80706ec4e081fdc/chain/chain/src/chain.rs#L5180) -* [the main logic for garbage collecting state from parent shard](https://github.com/near/nearcore/blob/eb824087fdd0763f00e695e303d7ad6c56f96538/chain/chain/src/store.rs#L2325) +* [new shard layout](https://github.com/near/nearcore/blob/c9836ab5b05c229da933d451fe8198d781f40509/core/primitives/src/shard_layout.rs#L161) +* [the main logic for splitting states](https://github.com/near/nearcore/blob/c9836ab5b05c229da933d451fe8198d781f40509/chain/chain/src/resharding.rs#L280) +* [the main logic for applying chunks to split states](https://github.com/near/nearcore/blob/c9836ab5b05c229da933d451fe8198d781f40509/chain/chain/src/update_shard.rs#L315) +* [the main logic for garbage collecting state from parent shard](https://github.com/near/nearcore/blob/c9836ab5b05c229da933d451fe8198d781f40509/chain/chain/src/store.rs#L2335) ### Flat Storage @@ -158,6 +157,10 @@ As mentioned above under 'Integration with Statelss Validation' section, initial ## Future possibilities +### Further reshardings + +This NEP introduces both an implementation of resharding and an actual resharding to be done in the production networks. Further reshardings can also be performed in the future by adding a new shard layout and setting the shard layout for the desired protocol version in the `AllEpochConfig`. + ### Dynamic resharding As noted above, dynamic resharding is out of scope for this NEP and should be implemented in the future. Dynamic resharding includes the following but not limited to: @@ -199,7 +202,9 @@ In the future, we would like to potentially change the schema in a way such that ### Backwards Compatibility -We do not expect anything to break with this change. Yet, shard splitting can introduce additional complexity on replayability. For instance, as target shard of a receipt and belonging shard of an account can change with shard splitting, shard splitting must be replayed along with transactions at the exact epoch boundary. +Any tooling or frameworks external to nearcore that have the current shard layout or the current number of shards hardcoded may break and will need to be adjusted in advance. The recommended way for fixing it is querying an RPC node for the shard layout of the relevant epoch and using that information in place of the previously hardcoded shard layout or number of shards. + +Within nearcore we do not expect anything to break with this change. Yet, shard splitting can introduce additional complexity on replayability. For instance, as target shard of a receipt and belonging shard of an account can change with shard splitting, shard splitting must be replayed along with transactions at the exact epoch boundary. ## Changelog From 003fdef0e0b35de683e5ae2f7286f9e4b0b5efef Mon Sep 17 00:00:00 2001 From: wacban Date: Fri, 1 Dec 2023 11:41:42 +0000 Subject: [PATCH 19/28] Update nep-0508.md --- neps/nep-0508.md | 35 ++++++++++++++++++++++++++++++----- 1 file changed, 30 insertions(+), 5 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 28a4ff006..d55d8428f 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -83,7 +83,7 @@ The implementation heavily re-uses the implementation from [NEP-40](https://gith The old implementaion of resharding relied on iterating over the full trie state of the parent shard in order to build the state for the children shards. This implementation was suitable at the time but since then the state has grown considerably and this implementation is now too slow to fit within a single epoch. The new implementation relies on iterating through the flat storage in order to build the children shards quicker. Based on benchmarks, splitting the largest shard by using flat storage can take around 15 min without throttling and around 3 hours with throttling to maintain block production rate. -The new implementation will also propagate the flat storage for the children shards and keep it up to the with the chain until the switch to the new shard layout in the next epoch. The old implementation didn't handle this case because the flat storage didn't exist back then. +The new implementation will also propagate the flat storage for the children shards and keep it up to date with the chain until the switch to the new shard layout in the next epoch. The old implementation didn't handle this case because the flat storage didn't exist back then. In order to ensure consistent view of the flat storage while splitting the state the node will maintain a snapshot of the flat state and related columns as of the last block of the epoch prior to resharding. The existing implementation of flat state snapshots used in State Sync will be used for this purpose. @@ -97,7 +97,16 @@ The first release of the resharding v2 will contain a new shard layout where one ### Removal of Fixed shards -Fixed shards was a feature of the protocol that allowed for assigning specific accounts and all of their recursive sub accounts to a predetermined shard. This feature was only used for testing and was never used in production. Fixed shards fature unfortunately breaks the contiguity of shards and is not compatible with the new resharding flow. A sub account of a fixed shard account can fall in the middle of account range that belongs to a different shard. This property of fixed shards made it particularly hard to reason about and implement efficient resharding. +Fixed shards was a feature of the protocol that allowed for assigning specific accounts and all of their recursive sub accounts to a predetermined shard. This feature was only used for testing and was never used in production. Fixed shards feature unfortunately breaks the contiguity of shards and is not compatible with the new resharding flow. A sub account of a fixed shard account can fall in the middle of account range that belongs to a different shard. This property of fixed shards made it particularly hard to reason about and implement efficient resharding. + +For example in a shard layout with boundary accounts [`b`, `d`] the account space is cleanly divided into three shards, each spanning a contiguous range and account ids: +* 0 - `:b` +* 1 - `b:d` +* 2 - `d:` + +Now if we add a fixed shard `f` to the same shard layout, then any we'll have 4 shards but neither is contiguous. Accounts such as `aaa.f`, `ccc.f`, `eee.f` that would otherwise belong to shards 0, 1 and 2 respectively are now all assigned to the fixed shard and create holes in the shard account ranges. + +It's also worth noting that there is no benefit to having accounts colocated in the same shard. Any transaction or receipts is treated the same way regardless of crossing shard boundary. This was implemented ahead of this NEP and the fixed shards feature was **removed**. @@ -119,6 +128,7 @@ This design is the simplest, most robust and safe while meeting all of the requi ### What other designs have been considered and what is the rationale for not choosing them? +#### Alternative implementations * Splitting the trie by iterating over the boundaries between children shards for each trie record type. This implementation has the potential to be faster but it is more complex and it would take longer to implement. We opted in for the much simpler one using flat storage given it is already quite performant. * Changing the trie structure to have the account id first and type of record later. This change would allow for much faster resharding by only iterating over the nodes on the boundary. This approach has two major drawbacks without providing too many benefits over the previous approach of splitting by each trie record type. 1) It would require a massive migration of trie. @@ -126,6 +136,14 @@ This design is the simplest, most robust and safe while meeting all of the requi * Changing the storage structure by having the storage key to have the format of `account_id.node_hash`. This structure would make it much easier to split the trie on storage level because the children shards are simple sub-ranges of the parent shard. Unfortunately we found that the migration would not be feasible. * Changing the storage structure by having the key format as only node_hash and dropping the ShardUId prefix. This is a feasible approach but it adds complexity to the garbage collection and data deletion, specially when nodes would start tracking only one shard. We opted in for the much simpler one by using the existing scheme of prefixing storage entries by shard uid. +#### Other considerations +* Dynamic Resharding - we have decided to not implement the full dynamic resharding at this time. Instead we hardcode the shard layout and schedule it manually. The reasons are as follows: + * We prefer incremental process of introducing resharding to make sure that it is robust and reliable, as well as give the community the time to adjust. + * Each resharding increases the potential total load on the system. We don't want to allow it to grow until full sharding is in place and we can handle that increase. +* Extended shard layout adjustments - we have decided to only implement shard splitting and not implement any other operations. The reasons are as follows: + * In this iteration we only want to perform splitting. + * The extended adjustments are currently not justified. Both merging and boundary moving may be useful in the future when the traffic patterns change and some shard become underutilized. In the nearest future we only predict needing to reduce the size of the heaviest shards. + ### What is the impact of not doing this? We need resharding in order to scale up the system. Without resharding eventually shards would grow so big (in either storage or cpu usage) that a single node would not be able to handle it. Additionally, this clears up the path to implement in-memory tries as we need to store the whole trie structure in limited RAM. In the future smaller shard size would lead to faster syncing of shard data when nodes start tracking just one shard. @@ -143,7 +161,7 @@ In this NEP we propose that resharding should be rolled out first, before any re The Stateless Validation requires that chunk producers provide proof of correctness of the transition function from one state root to another. That proof for the first block after the new shard layout takes place will need to prove that the entire state split was correct as well as the state transition. -In this NEP we propose that resharding should be rolled out first, before stateless validation. We can then safely roll out the resharding logic and solve the above mentioned issues separately. +In this NEP we propose that resharding should be rolled out first, before stateless validation. We can then safely roll out the resharding logic and solve the above mentioned issues separately. This issue was discussed with the stateless validation experts and we are cautiously optimistic that the integration will be possible. The most concerning part is the proof size and we believe that it should be small enough thanks to the resharding touching relatively small number of trie nodes - on the order of the depth of the trie. ## Future fast-followups @@ -166,7 +184,14 @@ This NEP introduces both an implementation of resharding and an actual reshardin As noted above, dynamic resharding is out of scope for this NEP and should be implemented in the future. Dynamic resharding includes the following but not limited to: * Automatic determination of split boundary based on parameters like traffic, gas usage, state size, etc. -* Automatic shard splitting and merging +* Automatic scheduling of resharding events + +### Extended shard layout adjustments + +In this NEP we only propose supporting splitting shards. This operation should be more than sufficient for the near future but eventually we may want to add support for more sophisticated adjustments such as: + +* Merging shards together +* Moving the boundary account between two shards ### Localization of resharding event to specific shard @@ -193,10 +218,10 @@ In the future, we would like to potentially change the schema in a way such that * Number of shards would increase. * Underlying trie structure and data structure are not going to change. * Resharding will create dependency on flat state snapshots. +* The resharding process, as of now, is not fully automated. Analyzing shard data, determining the split boundary, and triggering an actual shard split all need to be manually curated and tracked. ### Negative -* The resharding process, as of now, is not fully automated. Analyzing shard data, determining the split boundary, and triggering an actual shard split all need to be manually curated and tracked. * During resharding, a node is expected to require more resources as it will first need to copy state data from the parent shard to the child shard, and then will have to apply trie and flat state changes twice, once for the parent shard and once for the child shards. * Increased potential for apps and tools to break without proper shard layout change handling. From 3f8d42525bd59b8ff6193ca54a5a4d311fbdbe58 Mon Sep 17 00:00:00 2001 From: wacban Date: Fri, 1 Dec 2023 11:45:38 +0000 Subject: [PATCH 20/28] Update nep-0508.md --- neps/nep-0508.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index d55d8428f..561655184 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -227,7 +227,7 @@ In the future, we would like to potentially change the schema in a way such that ### Backwards Compatibility -Any tooling or frameworks external to nearcore that have the current shard layout or the current number of shards hardcoded may break and will need to be adjusted in advance. The recommended way for fixing it is querying an RPC node for the shard layout of the relevant epoch and using that information in place of the previously hardcoded shard layout or number of shards. +Any light clients, tooling or frameworks external to nearcore that have the current shard layout or the current number of shards hardcoded may break and will need to be adjusted in advance. The recommended way for fixing it is querying an RPC node for the shard layout of the relevant epoch and using that information in place of the previously hardcoded shard layout or number of shards. Within nearcore we do not expect anything to break with this change. Yet, shard splitting can introduce additional complexity on replayability. For instance, as target shard of a receipt and belonging shard of an account can change with shard splitting, shard splitting must be replayed along with transactions at the exact epoch boundary. From 2aa8dbe3c905b3231ef9885f504848d67e2d8678 Mon Sep 17 00:00:00 2001 From: wacban Date: Fri, 1 Dec 2023 11:47:50 +0000 Subject: [PATCH 21/28] lints --- neps/nep-0508.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 561655184..37c755734 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -100,6 +100,7 @@ The first release of the resharding v2 will contain a new shard layout where one Fixed shards was a feature of the protocol that allowed for assigning specific accounts and all of their recursive sub accounts to a predetermined shard. This feature was only used for testing and was never used in production. Fixed shards feature unfortunately breaks the contiguity of shards and is not compatible with the new resharding flow. A sub account of a fixed shard account can fall in the middle of account range that belongs to a different shard. This property of fixed shards made it particularly hard to reason about and implement efficient resharding. For example in a shard layout with boundary accounts [`b`, `d`] the account space is cleanly divided into three shards, each spanning a contiguous range and account ids: + * 0 - `:b` * 1 - `b:d` * 2 - `d:` @@ -129,6 +130,7 @@ This design is the simplest, most robust and safe while meeting all of the requi ### What other designs have been considered and what is the rationale for not choosing them? #### Alternative implementations + * Splitting the trie by iterating over the boundaries between children shards for each trie record type. This implementation has the potential to be faster but it is more complex and it would take longer to implement. We opted in for the much simpler one using flat storage given it is already quite performant. * Changing the trie structure to have the account id first and type of record later. This change would allow for much faster resharding by only iterating over the nodes on the boundary. This approach has two major drawbacks without providing too many benefits over the previous approach of splitting by each trie record type. 1) It would require a massive migration of trie. @@ -137,6 +139,7 @@ This design is the simplest, most robust and safe while meeting all of the requi * Changing the storage structure by having the key format as only node_hash and dropping the ShardUId prefix. This is a feasible approach but it adds complexity to the garbage collection and data deletion, specially when nodes would start tracking only one shard. We opted in for the much simpler one by using the existing scheme of prefixing storage entries by shard uid. #### Other considerations + * Dynamic Resharding - we have decided to not implement the full dynamic resharding at this time. Instead we hardcode the shard layout and schedule it manually. The reasons are as follows: * We prefer incremental process of introducing resharding to make sure that it is robust and reliable, as well as give the community the time to adjust. * Each resharding increases the potential total load on the system. We don't want to allow it to grow until full sharding is in place and we can handle that increase. From f447163929462fb144cfcc4e16296320da0d9c88 Mon Sep 17 00:00:00 2001 From: walnut-the-cat <122475853+walnut-the-cat@users.noreply.github.com> Date: Fri, 1 Dec 2023 09:24:31 -0800 Subject: [PATCH 22/28] Update nep-0508.md added explanation on potential features that can be introduced with resharding --- neps/nep-0508.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 37c755734..b7af1e49f 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -203,10 +203,9 @@ As of today, at the RocksDB storage layer, we have the ShardUId, i.e. the ShardI In the future, we would like to potentially change the schema in a way such that only the shard that is splitting is impacted by a resharding event, so as to avoid additonal work done by nodes tracking other shards. ### Other useful features - -* Account colocation for low latency across account call * Removal of shard uids and introducing globally unique shard ids -* Shard on demand +* Account colocation for low latency across account call - In case we start considering synchronous execution environment, colocating associated accounts (e.g. cross contract call between them) in the same shard can increase the efficiency +* Shard purchase/reservation - When someone wants to secure entirety of limitation on a single shard (e.g. state size limit), they can 'purchase/reserve' a shard so it can be dedicated for them (similar to how Aurora is set up) ## Consequences From 2253fc5e01d4b5b20734cb878ba018a74cc41d81 Mon Sep 17 00:00:00 2001 From: walnut-the-cat <122475853+walnut-the-cat@users.noreply.github.com> Date: Fri, 1 Dec 2023 09:39:08 -0800 Subject: [PATCH 23/28] Update nep-0508.md fix lint error... --- neps/nep-0508.md | 1 + 1 file changed, 1 insertion(+) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index b7af1e49f..3175216b2 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -203,6 +203,7 @@ As of today, at the RocksDB storage layer, we have the ShardUId, i.e. the ShardI In the future, we would like to potentially change the schema in a way such that only the shard that is splitting is impacted by a resharding event, so as to avoid additonal work done by nodes tracking other shards. ### Other useful features + * Removal of shard uids and introducing globally unique shard ids * Account colocation for low latency across account call - In case we start considering synchronous execution environment, colocating associated accounts (e.g. cross contract call between them) in the same shard can increase the efficiency * Shard purchase/reservation - When someone wants to secure entirety of limitation on a single shard (e.g. state size limit), they can 'purchase/reserve' a shard so it can be dedicated for them (similar to how Aurora is set up) From d372e54c20f855557822910cff47f627db507a90 Mon Sep 17 00:00:00 2001 From: wacban Date: Thu, 7 Dec 2023 10:24:09 +0000 Subject: [PATCH 24/28] Apply suggestions from code review by mfornet Co-authored-by: Marcelo Fornet --- neps/nep-0508.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 3175216b2..5e44547cd 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -16,9 +16,9 @@ This proposal introduces a new implementation for resharding and a new shard lay In essence, this NEP is an extension of [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md), which was focused on splitting one shard into multiple shards. -We are introducing resharding v2, which supports one shard splitting into two within one epoch at a pre-determined split boundary. The NEP includes performance improvement to make resharding feasible under the current state as well as actual resharding in mainnet and testnet (To be specific, spliting the largest shard into two). +We are introducing resharding v2, which supports one shard splitting into two within one epoch at a pre-determined split boundary. The NEP includes performance improvement to make resharding feasible under the current state as well as actual resharding in mainnet and testnet (To be specific, splitting the largest shard into two). -While the new approach addresses critical limitations left unsolved in NEP-40 and is expected to remain valid for foreseable future, it does not serve all usecases, such as dynamic resharding. +While the new approach addresses critical limitations left unsolved in NEP-40 and is expected to remain valid for foreseeable future, it does not serve all use cases, such as dynamic resharding. ## Motivation @@ -30,8 +30,8 @@ Currently, NEAR protocol has four shards. With more partners onboarding, we star * Flat storage is enabled. * Shard split boundary is predetermined and hardcoded. In other words, necessity of shard splitting is manually decided. -* For the time being resharding as an event is only going to happen once but we would still like to have the infrastrcture in place to handle future resharding events with ease. -* Merkle Patricia Trie is the undelying data structure for the protocol state. +* For the time being resharding as an event is only going to happen once but we would still like to have the infrastructure in place to handle future resharding events with ease. +* Merkle Patricia Trie is the underlying data structure for the protocol state. * Epoch is at least 6 hrs long for resharding to complete. ### High level requirements @@ -62,7 +62,7 @@ A new protocol version will be introduced specifying the new shard layout which ### Resharding flow -* The new shard layout will be agreed on offline by the protocol team and hardcoded in the neard reference implementation. +* The new shard layout will be agreed on offline by the protocol team and hardcoded in the reference implementation. * In epoch T, past the protocol version upgrade date, nodes will vote to switch to the new protocol version. The new protocol version will contain the new shard layout. * In epoch T, in the last block of the epoch, the EpochConfig for epoch T+2 will be set. The EpochConfig for epoch T+2 will have the new shard layout. * In epoch T + 1, all nodes will perform the state split. The child shards will be kept up to date with the blockchain up until the epoch end first via catchup, and later as part of block postprocessing state application. @@ -81,7 +81,7 @@ The implementation heavily re-uses the implementation from [NEP-40](https://gith ### Flat Storage -The old implementaion of resharding relied on iterating over the full trie state of the parent shard in order to build the state for the children shards. This implementation was suitable at the time but since then the state has grown considerably and this implementation is now too slow to fit within a single epoch. The new implementation relies on iterating through the flat storage in order to build the children shards quicker. Based on benchmarks, splitting the largest shard by using flat storage can take around 15 min without throttling and around 3 hours with throttling to maintain block production rate. +The old implementation of resharding relied on iterating over the full trie state of the parent shard in order to build the state for the children shards. This implementation was suitable at the time but since then the state has grown considerably and this implementation is now too slow to fit within a single epoch. The new implementation relies on iterating through the flat storage in order to build the children shards quicker. Based on benchmarks, splitting the largest shard by using flat storage can take around 15 min without throttling and around 3 hours with throttling to maintain the block production rate. The new implementation will also propagate the flat storage for the children shards and keep it up to date with the chain until the switch to the new shard layout in the next epoch. The old implementation didn't handle this case because the flat storage didn't exist back then. @@ -89,11 +89,11 @@ In order to ensure consistent view of the flat storage while splitting the state ### Handling receipts, gas burnt and balance burnt -When resharding, extra care should be taken when handling receipts in order to ensure that no receipts are lost or duplicated. The gas burnt and balance burnt also need to be correclty handled. The old resharding implementation for handling receipts, gas burnt and balance burnt relied on the fact in the first resharding there was only a single parent shard to begin with. The new implementation will provide a more generic and robust way of reassigning the receipts to the child shards, gas burnt, and balance burnt, that works for arbitrary splitting of shards, regardless of the previous shard layout. +When resharding, extra care should be taken when handling receipts in order to ensure that no receipts are lost or duplicated. The gas burnt and balance burnt also need to be correctly handled. The old resharding implementation for handling receipts, gas burnt and balance burnt relied on the fact in the first resharding there was only a single parent shard to begin with. The new implementation will provide a more generic and robust way of reassigning the receipts to the child shards, gas burnt, and balance burnt, that works for arbitrary splitting of shards, regardless of the previous shard layout. ### New shard layout -The first release of the resharding v2 will contain a new shard layout where one of the existing shards will be split into two smaller shards. Furthermore additional reshardings can be scheduled in subsequent neard releases without additional NEPs unless the need for it arises. A new shard layout can be determined and will be scheduled and executed with the next protocol upgrade. Resharding will typically happen by splitting one of the existing shards into two smaller shards. The new shard layout will be created by adding a new boundary account that will be determined by analysing the storage and gas usage metrics within the shard and selecting a point that will divide the shard roughly in half in accordance to the mentioned metrics. Other metrics can also be used based on requirements. +The first release of the resharding v2 will contain a new shard layout where one of the existing shards will be split into two smaller shards. Furthermore additional reshardings can be scheduled in subsequent releases without additional NEPs unless the need for it arises. A new shard layout can be determined and will be scheduled and executed with the next protocol upgrade. Resharding will typically happen by splitting one of the existing shards into two smaller shards. The new shard layout will be created by adding a new boundary account that will be determined by analysing the storage and gas usage metrics within the shard and selecting a point that will divide the shard roughly in half in accordance to the mentioned metrics. Other metrics can also be used based on requirements. ### Removal of Fixed shards @@ -107,7 +107,7 @@ For example in a shard layout with boundary accounts [`b`, `d`] the account spac Now if we add a fixed shard `f` to the same shard layout, then any we'll have 4 shards but neither is contiguous. Accounts such as `aaa.f`, `ccc.f`, `eee.f` that would otherwise belong to shards 0, 1 and 2 respectively are now all assigned to the fixed shard and create holes in the shard account ranges. -It's also worth noting that there is no benefit to having accounts colocated in the same shard. Any transaction or receipts is treated the same way regardless of crossing shard boundary. +It's also worth noting that there is no benefit to having accounts colocated in the same shard. Any transaction or receipt is treated the same way regardless of crossing shard boundary. This was implemented ahead of this NEP and the fixed shards feature was **removed**. @@ -125,7 +125,7 @@ This was implemented ahead of this NEP and the transaction pool is now fully **m ### Why is this design the best in the space of possible designs? -This design is the simplest, most robust and safe while meeting all of the requirements. +This design is simple, robust, safe, and meets all requirements. ### What other designs have been considered and what is the rationale for not choosing them? @@ -158,7 +158,7 @@ There are two known issues in the integration of resharding and state sync: * When syncing the state for the first epoch where the new shard layout is used. In this case the node would need to apply the last block of the previous epoch. It cannot be done on the children shard as on chain the block was applied on the parent shards and the trie related gas costs would be different. * When generating proofs for incoming receipts. The proof for each of the children shards contains only the receipts of the shard but it's generated on the parent shard layout and so may not be verified. -In this NEP we propose that resharding should be rolled out first, before any real dependency on state sync is added. We can then safely roll out the resharding logic and solve the above mentioned issues separately. We believe atleast some of the issues can be mitigated by the implementation of new pre-state root and chunk execution design. +In this NEP we propose that resharding should be rolled out first, before any real dependency on state sync is added. We can then safely roll out the resharding logic and solve the above mentioned issues separately. We believe at least some of the issues can be mitigated by the implementation of new pre-state root and chunk execution design. ## Integration with Stateless Validation @@ -174,7 +174,7 @@ As mentioned above under 'Integration with State Sync' section, initial release ### Resharding should work after stateless validation is enabled -As mentioned above under 'Integration with Statelss Validation' section, initial release of resharding v2 will happen before the full implementation of stateless validation and we plan to tackle the integration between resharding and stateless validation after the next shard split (May need a separate NEP depending on implemetnation detail.) +As mentioned above under 'Integration with Stateless Validation' section, the initial release of resharding v2 will happen before the full implementation of stateless validation and we plan to tackle the integration between resharding and stateless validation after the next shard split (May need a separate NEP depending on implementation detail.) ## Future possibilities From be0ce88ff3e57a8e589ff4f6f6ecf826223f18f5 Mon Sep 17 00:00:00 2001 From: wacban Date: Mon, 11 Dec 2023 15:09:40 +0000 Subject: [PATCH 25/28] Update nep-0508.md added storage overhead numbers --- neps/nep-0508.md | 1 + 1 file changed, 1 insertion(+) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 5e44547cd..97bc8c408 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -59,6 +59,7 @@ A new protocol version will be introduced specifying the new shard layout which * For the duration of the resharding the node will need to maintain a snapshot of the flat state and related columns. As the main database and the snapshot diverge this will cause some extent of storage overhead. * For the duration of the epoch before the new shard layout takes effect, the node will need to maintain the state and flat state of shards in the old and new layout at the same time. The State and FlatState columns will grow up to approx 2x the size. The processing overhead should be minimal as the chunks will still be executed only on the parent shards. There will be increased load on the database while applying changes to both the parent and the children shards. +* The total storage overhead is estimated to be on the order of 100GB for mainnet RPC nodes and 2TB for mainnet archival nodes. For testnet the overhead is expected to be much smaller. ### Resharding flow From 95b6d29a2aed07284113fae6d3018242e4a156bb Mon Sep 17 00:00:00 2001 From: wacban Date: Tue, 12 Dec 2023 10:14:50 +0000 Subject: [PATCH 26/28] added the new shard layout and a note that more reshardings will happen --- neps/nep-0508.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index 97bc8c408..b9b84fc32 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -64,6 +64,8 @@ A new protocol version will be introduced specifying the new shard layout which ### Resharding flow * The new shard layout will be agreed on offline by the protocol team and hardcoded in the reference implementation. + * The first resharding will be scheduled soon after this NEP is merged. The new shard layout boundary accounts will be: ["aurora", "aurora-0", "kkuuue2akv_1630967379.near", "tge-lockup.sweat"]. + * Subsequent reshardings will be scheduled as needed, without further NEPs, unless significant changes are introduced. * In epoch T, past the protocol version upgrade date, nodes will vote to switch to the new protocol version. The new protocol version will contain the new shard layout. * In epoch T, in the last block of the epoch, the EpochConfig for epoch T+2 will be set. The EpochConfig for epoch T+2 will have the new shard layout. * In epoch T + 1, all nodes will perform the state split. The child shards will be kept up to date with the blockchain up until the epoch end first via catchup, and later as part of block postprocessing state application. From b2ceb5ea59a51b542de29b9d72382e0f10c8c6f7 Mon Sep 17 00:00:00 2001 From: wacban Date: Tue, 12 Dec 2023 10:15:43 +0000 Subject: [PATCH 27/28] better formatting --- neps/nep-0508.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index b9b84fc32..d79f69fc2 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -64,7 +64,7 @@ A new protocol version will be introduced specifying the new shard layout which ### Resharding flow * The new shard layout will be agreed on offline by the protocol team and hardcoded in the reference implementation. - * The first resharding will be scheduled soon after this NEP is merged. The new shard layout boundary accounts will be: ["aurora", "aurora-0", "kkuuue2akv_1630967379.near", "tge-lockup.sweat"]. + * The first resharding will be scheduled soon after this NEP is merged. The new shard layout boundary accounts will be: ```["aurora", "aurora-0", "kkuuue2akv_1630967379.near", "tge-lockup.sweat"]```. * Subsequent reshardings will be scheduled as needed, without further NEPs, unless significant changes are introduced. * In epoch T, past the protocol version upgrade date, nodes will vote to switch to the new protocol version. The new protocol version will contain the new shard layout. * In epoch T, in the last block of the epoch, the EpochConfig for epoch T+2 will be set. The EpochConfig for epoch T+2 will have the new shard layout. From 4469e45ae4870b0020ea21909f799be409d3fe04 Mon Sep 17 00:00:00 2001 From: wacban Date: Tue, 12 Dec 2023 11:33:26 +0000 Subject: [PATCH 28/28] Added info about rpc to query the shard layout --- neps/nep-0508.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/neps/nep-0508.md b/neps/nep-0508.md index d79f69fc2..36fffe5ce 100644 --- a/neps/nep-0508.md +++ b/neps/nep-0508.md @@ -233,7 +233,7 @@ In the future, we would like to potentially change the schema in a way such that ### Backwards Compatibility -Any light clients, tooling or frameworks external to nearcore that have the current shard layout or the current number of shards hardcoded may break and will need to be adjusted in advance. The recommended way for fixing it is querying an RPC node for the shard layout of the relevant epoch and using that information in place of the previously hardcoded shard layout or number of shards. +Any light clients, tooling or frameworks external to nearcore that have the current shard layout or the current number of shards hardcoded may break and will need to be adjusted in advance. The recommended way for fixing it is querying an RPC node for the shard layout of the relevant epoch and using that information in place of the previously hardcoded shard layout or number of shards. The shard layout can be queried by using the `EXPERIMENTAL_protocol_config` rpc endpoint and reading the `shard_layout` field from the result. A dedicated endpoint may be added in the future as well. Within nearcore we do not expect anything to break with this change. Yet, shard splitting can introduce additional complexity on replayability. For instance, as target shard of a receipt and belonging shard of an account can change with shard splitting, shard splitting must be replayed along with transactions at the exact epoch boundary.