diff --git a/neps/nep-0508.md b/neps/nep-0508.md new file mode 100644 index 000000000..36fffe5ce --- /dev/null +++ b/neps/nep-0508.md @@ -0,0 +1,267 @@ +--- +NEP: 508 +Title: Resharding v2 +Authors: Waclaw Banasik, Shreyan Gupta, Yoon Hong +Status: Draft +DiscussionsTo: https://github.com/near/nearcore/issues/8992 +Type: Protocol +Version: 1.0.0 +Created: 2023-09-19 +LastUpdated: 2023-11-14 +--- + +## Summary + +This proposal introduces a new implementation for resharding and a new shard layout for the production networks. + +In essence, this NEP is an extension of [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md), which was focused on splitting one shard into multiple shards. + +We are introducing resharding v2, which supports one shard splitting into two within one epoch at a pre-determined split boundary. The NEP includes performance improvement to make resharding feasible under the current state as well as actual resharding in mainnet and testnet (To be specific, splitting the largest shard into two). + +While the new approach addresses critical limitations left unsolved in NEP-40 and is expected to remain valid for foreseeable future, it does not serve all use cases, such as dynamic resharding. + +## Motivation + +Currently, NEAR protocol has four shards. With more partners onboarding, we started seeing that some shards occasionally become over-crowded with respect to total state size and number of transactions. In addition, with state sync and stateless validation, validators will not need to track all shards and validator hardware requirements can be greatly reduced with smaller shard size. With future in-memory tries, it's also important to limit the size of individual shards. + +## Specification + +### High level assumptions + +* Flat storage is enabled. +* Shard split boundary is predetermined and hardcoded. In other words, necessity of shard splitting is manually decided. +* For the time being resharding as an event is only going to happen once but we would still like to have the infrastructure in place to handle future resharding events with ease. +* Merkle Patricia Trie is the underlying data structure for the protocol state. +* Epoch is at least 6 hrs long for resharding to complete. + +### High level requirements + +* Resharding must be fast enough so that both state sync and resharding can happen within one epoch. +* Resharding should work efficiently within the limits of the current hardware requirements for nodes. +* Potential failures in resharding may require intervention from node operator to recover. +* No transaction or receipt must be lost during resharding. +* Resharding must work regardless of number of existing shards. +* No apps, tools or code should hardcode the number of shards to 4. + +### Out of scope + +* Dynamic resharding + * automatically scheduling resharding based on shard usage/capacity + * automatically determining the shard layout +* Merging shards or boundary adjustments +* Shard reshuffling + +### Required protocol changes + +A new protocol version will be introduced specifying the new shard layout which would be picked up by the resharding logic to split the shard. + +### Required state changes + +* For the duration of the resharding the node will need to maintain a snapshot of the flat state and related columns. As the main database and the snapshot diverge this will cause some extent of storage overhead. +* For the duration of the epoch before the new shard layout takes effect, the node will need to maintain the state and flat state of shards in the old and new layout at the same time. The State and FlatState columns will grow up to approx 2x the size. The processing overhead should be minimal as the chunks will still be executed only on the parent shards. There will be increased load on the database while applying changes to both the parent and the children shards. +* The total storage overhead is estimated to be on the order of 100GB for mainnet RPC nodes and 2TB for mainnet archival nodes. For testnet the overhead is expected to be much smaller. + +### Resharding flow + +* The new shard layout will be agreed on offline by the protocol team and hardcoded in the reference implementation. + * The first resharding will be scheduled soon after this NEP is merged. The new shard layout boundary accounts will be: ```["aurora", "aurora-0", "kkuuue2akv_1630967379.near", "tge-lockup.sweat"]```. + * Subsequent reshardings will be scheduled as needed, without further NEPs, unless significant changes are introduced. +* In epoch T, past the protocol version upgrade date, nodes will vote to switch to the new protocol version. The new protocol version will contain the new shard layout. +* In epoch T, in the last block of the epoch, the EpochConfig for epoch T+2 will be set. The EpochConfig for epoch T+2 will have the new shard layout. +* In epoch T + 1, all nodes will perform the state split. The child shards will be kept up to date with the blockchain up until the epoch end first via catchup, and later as part of block postprocessing state application. +* In epoch T + 2, the chain will switch to the new shard layout. + +## Reference Implementation + +The implementation heavily re-uses the implementation from [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md). Below are listed the major differences and additions. + +### Code pointers to the proposed implementation + +* [new shard layout](https://github.com/near/nearcore/blob/c9836ab5b05c229da933d451fe8198d781f40509/core/primitives/src/shard_layout.rs#L161) +* [the main logic for splitting states](https://github.com/near/nearcore/blob/c9836ab5b05c229da933d451fe8198d781f40509/chain/chain/src/resharding.rs#L280) +* [the main logic for applying chunks to split states](https://github.com/near/nearcore/blob/c9836ab5b05c229da933d451fe8198d781f40509/chain/chain/src/update_shard.rs#L315) +* [the main logic for garbage collecting state from parent shard](https://github.com/near/nearcore/blob/c9836ab5b05c229da933d451fe8198d781f40509/chain/chain/src/store.rs#L2335) + +### Flat Storage + +The old implementation of resharding relied on iterating over the full trie state of the parent shard in order to build the state for the children shards. This implementation was suitable at the time but since then the state has grown considerably and this implementation is now too slow to fit within a single epoch. The new implementation relies on iterating through the flat storage in order to build the children shards quicker. Based on benchmarks, splitting the largest shard by using flat storage can take around 15 min without throttling and around 3 hours with throttling to maintain the block production rate. + +The new implementation will also propagate the flat storage for the children shards and keep it up to date with the chain until the switch to the new shard layout in the next epoch. The old implementation didn't handle this case because the flat storage didn't exist back then. + +In order to ensure consistent view of the flat storage while splitting the state the node will maintain a snapshot of the flat state and related columns as of the last block of the epoch prior to resharding. The existing implementation of flat state snapshots used in State Sync will be used for this purpose. + +### Handling receipts, gas burnt and balance burnt + +When resharding, extra care should be taken when handling receipts in order to ensure that no receipts are lost or duplicated. The gas burnt and balance burnt also need to be correctly handled. The old resharding implementation for handling receipts, gas burnt and balance burnt relied on the fact in the first resharding there was only a single parent shard to begin with. The new implementation will provide a more generic and robust way of reassigning the receipts to the child shards, gas burnt, and balance burnt, that works for arbitrary splitting of shards, regardless of the previous shard layout. + +### New shard layout + +The first release of the resharding v2 will contain a new shard layout where one of the existing shards will be split into two smaller shards. Furthermore additional reshardings can be scheduled in subsequent releases without additional NEPs unless the need for it arises. A new shard layout can be determined and will be scheduled and executed with the next protocol upgrade. Resharding will typically happen by splitting one of the existing shards into two smaller shards. The new shard layout will be created by adding a new boundary account that will be determined by analysing the storage and gas usage metrics within the shard and selecting a point that will divide the shard roughly in half in accordance to the mentioned metrics. Other metrics can also be used based on requirements. + +### Removal of Fixed shards + +Fixed shards was a feature of the protocol that allowed for assigning specific accounts and all of their recursive sub accounts to a predetermined shard. This feature was only used for testing and was never used in production. Fixed shards feature unfortunately breaks the contiguity of shards and is not compatible with the new resharding flow. A sub account of a fixed shard account can fall in the middle of account range that belongs to a different shard. This property of fixed shards made it particularly hard to reason about and implement efficient resharding. + +For example in a shard layout with boundary accounts [`b`, `d`] the account space is cleanly divided into three shards, each spanning a contiguous range and account ids: + +* 0 - `:b` +* 1 - `b:d` +* 2 - `d:` + +Now if we add a fixed shard `f` to the same shard layout, then any we'll have 4 shards but neither is contiguous. Accounts such as `aaa.f`, `ccc.f`, `eee.f` that would otherwise belong to shards 0, 1 and 2 respectively are now all assigned to the fixed shard and create holes in the shard account ranges. + +It's also worth noting that there is no benefit to having accounts colocated in the same shard. Any transaction or receipt is treated the same way regardless of crossing shard boundary. + +This was implemented ahead of this NEP and the fixed shards feature was **removed**. + +### Garbage collection + +In epoch T+2 once resharding is completed, we can delete the trie state and the flat state related to the parent shard. In practice, this is handled as part of the garbage collection code. While garbage collecting the last block of epoch T+1, we go ahead and clear all the data associated with the parent shard from the trie cache, flat storage, and RocksDB state associated with trie state and flat storage. + +### Transaction pool + +The transaction pool is sharded i.e. it groups transactions by the shard where each transaction should be converted to a receipt. The transaction pool was previously sharded by the ShardId. Unfortunately ShardId is insufficient to correctly identify a shard across a resharding event as ShardIds change domain. The transaction pool was migrated to group transactions by ShardUId instead, and a transaction pool resharding was implemented to reassign transaction from parent shard to children shards right before the new shard layout takes effect. The ShardUId contains the version of the shard layout which allows differentiating between shards in different shard layouts. + +This was implemented ahead of this NEP and the transaction pool is now fully **migrated** to ShardUId. + +## Alternatives + +### Why is this design the best in the space of possible designs? + +This design is simple, robust, safe, and meets all requirements. + +### What other designs have been considered and what is the rationale for not choosing them? + +#### Alternative implementations + +* Splitting the trie by iterating over the boundaries between children shards for each trie record type. This implementation has the potential to be faster but it is more complex and it would take longer to implement. We opted in for the much simpler one using flat storage given it is already quite performant. +* Changing the trie structure to have the account id first and type of record later. This change would allow for much faster resharding by only iterating over the nodes on the boundary. This approach has two major drawbacks without providing too many benefits over the previous approach of splitting by each trie record type. + 1) It would require a massive migration of trie. + 2) We would need to maintain the old and the new trie structure forever. +* Changing the storage structure by having the storage key to have the format of `account_id.node_hash`. This structure would make it much easier to split the trie on storage level because the children shards are simple sub-ranges of the parent shard. Unfortunately we found that the migration would not be feasible. +* Changing the storage structure by having the key format as only node_hash and dropping the ShardUId prefix. This is a feasible approach but it adds complexity to the garbage collection and data deletion, specially when nodes would start tracking only one shard. We opted in for the much simpler one by using the existing scheme of prefixing storage entries by shard uid. + +#### Other considerations + +* Dynamic Resharding - we have decided to not implement the full dynamic resharding at this time. Instead we hardcode the shard layout and schedule it manually. The reasons are as follows: + * We prefer incremental process of introducing resharding to make sure that it is robust and reliable, as well as give the community the time to adjust. + * Each resharding increases the potential total load on the system. We don't want to allow it to grow until full sharding is in place and we can handle that increase. +* Extended shard layout adjustments - we have decided to only implement shard splitting and not implement any other operations. The reasons are as follows: + * In this iteration we only want to perform splitting. + * The extended adjustments are currently not justified. Both merging and boundary moving may be useful in the future when the traffic patterns change and some shard become underutilized. In the nearest future we only predict needing to reduce the size of the heaviest shards. + +### What is the impact of not doing this? + +We need resharding in order to scale up the system. Without resharding eventually shards would grow so big (in either storage or cpu usage) that a single node would not be able to handle it. Additionally, this clears up the path to implement in-memory tries as we need to store the whole trie structure in limited RAM. In the future smaller shard size would lead to faster syncing of shard data when nodes start tracking just one shard. + +## Integration with State Sync + +There are two known issues in the integration of resharding and state sync: + +* When syncing the state for the first epoch where the new shard layout is used. In this case the node would need to apply the last block of the previous epoch. It cannot be done on the children shard as on chain the block was applied on the parent shards and the trie related gas costs would be different. +* When generating proofs for incoming receipts. The proof for each of the children shards contains only the receipts of the shard but it's generated on the parent shard layout and so may not be verified. + +In this NEP we propose that resharding should be rolled out first, before any real dependency on state sync is added. We can then safely roll out the resharding logic and solve the above mentioned issues separately. We believe at least some of the issues can be mitigated by the implementation of new pre-state root and chunk execution design. + +## Integration with Stateless Validation + +The Stateless Validation requires that chunk producers provide proof of correctness of the transition function from one state root to another. That proof for the first block after the new shard layout takes place will need to prove that the entire state split was correct as well as the state transition. + +In this NEP we propose that resharding should be rolled out first, before stateless validation. We can then safely roll out the resharding logic and solve the above mentioned issues separately. This issue was discussed with the stateless validation experts and we are cautiously optimistic that the integration will be possible. The most concerning part is the proof size and we believe that it should be small enough thanks to the resharding touching relatively small number of trie nodes - on the order of the depth of the trie. + +## Future fast-followups + +### Resharding should work even when validators stop tracking all shards + +As mentioned above under 'Integration with State Sync' section, initial release of resharding v2 will happen before the full implementation of state sync and we plan to tackle the integration between resharding and state sync after the next shard split (Won't need a separate NEP as the integration does not require protocol change.) + +### Resharding should work after stateless validation is enabled + +As mentioned above under 'Integration with Stateless Validation' section, the initial release of resharding v2 will happen before the full implementation of stateless validation and we plan to tackle the integration between resharding and stateless validation after the next shard split (May need a separate NEP depending on implementation detail.) + +## Future possibilities + +### Further reshardings + +This NEP introduces both an implementation of resharding and an actual resharding to be done in the production networks. Further reshardings can also be performed in the future by adding a new shard layout and setting the shard layout for the desired protocol version in the `AllEpochConfig`. + +### Dynamic resharding + +As noted above, dynamic resharding is out of scope for this NEP and should be implemented in the future. Dynamic resharding includes the following but not limited to: + +* Automatic determination of split boundary based on parameters like traffic, gas usage, state size, etc. +* Automatic scheduling of resharding events + +### Extended shard layout adjustments + +In this NEP we only propose supporting splitting shards. This operation should be more than sufficient for the near future but eventually we may want to add support for more sophisticated adjustments such as: + +* Merging shards together +* Moving the boundary account between two shards + +### Localization of resharding event to specific shard + +As of today, at the RocksDB storage layer, we have the ShardUId, i.e. the ShardId along with the ShardVersion, as a prefix in the key of trie state and flat state. During a resharding event, we increment the ShardVersion by one, and effectively remap all the current parent shards to new child shards. This implies we can't use the same underlying key value pairs for store and instead would need to duplicate the values with the new ShardUId prefix, even if a shard is unaffected and not split. + +In the future, we would like to potentially change the schema in a way such that only the shard that is splitting is impacted by a resharding event, so as to avoid additonal work done by nodes tracking other shards. + +### Other useful features + +* Removal of shard uids and introducing globally unique shard ids +* Account colocation for low latency across account call - In case we start considering synchronous execution environment, colocating associated accounts (e.g. cross contract call between them) in the same shard can increase the efficiency +* Shard purchase/reservation - When someone wants to secure entirety of limitation on a single shard (e.g. state size limit), they can 'purchase/reserve' a shard so it can be dedicated for them (similar to how Aurora is set up) + +## Consequences + +### Positive + +* Workload across shards will be more evenly distributed. +* Required space to maintain state (either in memory or in persistent disk) will be smaller. This is useful for in-memory tries. +* State sync overhead will be smaller with smaller state size. + +### Neutral + +* Number of shards would increase. +* Underlying trie structure and data structure are not going to change. +* Resharding will create dependency on flat state snapshots. +* The resharding process, as of now, is not fully automated. Analyzing shard data, determining the split boundary, and triggering an actual shard split all need to be manually curated and tracked. + +### Negative + +* During resharding, a node is expected to require more resources as it will first need to copy state data from the parent shard to the child shard, and then will have to apply trie and flat state changes twice, once for the parent shard and once for the child shards. +* Increased potential for apps and tools to break without proper shard layout change handling. + +### Backwards Compatibility + +Any light clients, tooling or frameworks external to nearcore that have the current shard layout or the current number of shards hardcoded may break and will need to be adjusted in advance. The recommended way for fixing it is querying an RPC node for the shard layout of the relevant epoch and using that information in place of the previously hardcoded shard layout or number of shards. The shard layout can be queried by using the `EXPERIMENTAL_protocol_config` rpc endpoint and reading the `shard_layout` field from the result. A dedicated endpoint may be added in the future as well. + +Within nearcore we do not expect anything to break with this change. Yet, shard splitting can introduce additional complexity on replayability. For instance, as target shard of a receipt and belonging shard of an account can change with shard splitting, shard splitting must be replayed along with transactions at the exact epoch boundary. + +## Changelog + +[The changelog section provides historical context for how the NEP developed over time. Initial NEP submission should start with version 1.0.0, and all subsequent NEP extensions must follow [Semantic Versioning](https://semver.org/). Every version should have the benefits and concerns raised during the review. The author does not need to fill out this section for the initial draft. Instead, the assigned reviewers (Subject Matter Experts) should create the first version during the first technical review. After the final public call, the author should then finalize the last version of the decision context.] + +### 1.0.0 - Initial Version + +> Placeholder for the context about when and who approved this NEP version. + +#### Benefits + +> List of benefits filled by the Subject Matter Experts while reviewing this version: + +* Benefit 1 +* Benefit 2 + +#### Concerns + +> Template for Subject Matter Experts review for this version: +> Status: New | Ongoing | Resolved + +| # | Concern | Resolution | Status | +| --: | :------ | :--------- | -----: | +| 1 | | | | +| 2 | | | | + +## Copyright + +Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).