Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ProjectTracking]: Stateless validation Mainnet Release #46

Open
27 of 52 tasks
Longarithm opened this issue Feb 2, 2024 · 12 comments
Open
27 of 52 tasks

[ProjectTracking]: Stateless validation Mainnet Release #46

Longarithm opened this issue Feb 2, 2024 · 12 comments

Comments

@Longarithm
Copy link
Member

Longarithm commented Feb 2, 2024

Goals

Stateless Validation is alternative design for phase 2 of sharding that does not require state rollback to mainnet. It involves significant changes to different parts of the protocol. The goal is to deliver it to mainnet, using StatelessNet to iterate on design.

See also #5.

All open issues for stateless validation are listed here.

Tasks

Ordered by priority. With all this completed, stateless validation can be released.
Also the work includes monitoring and fixing issues appearing in StatelessNet.

Timeline view

Task list

Side tasks

Links to external documentations and discussions

Estimated effort

30 weeks on Core team side

Assumptions

  • Community validators participate in StatelessNet which helps to ensure that stateless validation works properly.

Pre-requisites

Release order

Stateless validation is anticipated to be the most significant launch since the Near MainNet rollout, and it's crucial to keep it isolated from other feature releases. To ensure a smooth rollout, I'd like to gather input from the team on what absolutely needs to be deployed to MainNet prior to Stateless validation.

Currently, these are the projects that better be launched in MainNet prior to Stateless validation (ref), as they require changes in trie structure

  • Congestion control
  • Yield/resume

Decisions made

  • More than 2/3 of endorsement is required for a chunk to be included ref @mooori
  • State witness size hard limit is: TBD @jancionear
  • State witness size soft limit is: TBD @jancionear
  • Top 100 validators will be both BP and CP. The remaining validators will be chunk validators. @birchmd
  • Shard shuffling will be disabled during the initial release ref @bowenwang1996
  • Cloud based state sync will be used with initial launch ref] @tayfunelmas

Out of scope

@walnut-the-cat walnut-the-cat moved this to Ready to be prioritised in Near One project tracking Feb 8, 2024
@walnut-the-cat walnut-the-cat moved this from Ready to be prioritised to Ideas for projects in Near One project tracking Feb 8, 2024
@walnut-the-cat walnut-the-cat moved this from Ideas for projects to In Progress in Near One project tracking Feb 8, 2024
@walnut-the-cat
Copy link

Need to update the format to make sure it follows the tracking issue template

@walnut-the-cat
Copy link

February 26th status update

[Note] The team's effort is spread between this issue and 20. Items listed in this issue focus more on readiness for Stateless validation MainNet release, where as items listed in 20 focus more on success of Stake wars program.

  • [Completed] Orphan state witness pool
  • [Completed] StateTransitionData disk usage optimization
  • [Completed] Handle production of state witness for first chunk after genesis
  • [Completed] Fix improper panic when a Client receives ChunkStateWitness with invalid shard_id
  • [WIP] Network cost analysis
  • [Upcoming] 2x latency in critical path analysis
  • [Upcoming] State witness size limit
  • [Upcoming] Stateless validation NEP draft completion and schedule SME review

@Longarithm Longarithm self-assigned this Mar 8, 2024
@Longarithm
Copy link
Member Author

Longarithm commented Mar 8, 2024

March 8th status update

  • [Completed] State witness soft size limit
  • [Completed] Started state witness compression testing, 30-70% size reduction achievable
  • [Completed] Benchmarking showing 600 TPS on single shard with memtrie and no caches
  • [WIP] Proposal for validator roles and rewards - waiting to discuss
  • [WIP] TestLoop framework prepared and draft of memtrie + stateless validation test is made.
  • [Upcoming] Discussion on contract cache (large values cache)
  • [Upcoming] State witness hard size limit
  • [Upcoming] Stack overflow & memtrie internal failure investigation
  • [Upcoming] Stateless validation NEP draft completion and schedule SME review

@Longarithm
Copy link
Member Author

Longarithm commented Mar 22, 2024

March 22nd status update

  • [Completed] Discussed validator roles and rewards.
  • [Completed] Conclusion on reducing state witness size - will apply compression for now
  • [Completed] Investigation of losing all chunks on statelessnet - related to network.
  • [WIP] Enabling memtrie on statelessnet. Working on state sync integration & missing trie value fix
  • [WIP] Investigating how to resolve network concerns: buffer overflows, low bandwidth for distant nodes.
  • [Upcoming] Stateless validation NEP draft completion and schedule SME review
  • [Upcoming] Implementation of validator roles and rewards
  • [Upcoming] State witness hard size limit - including validation

@walnut-the-cat
Copy link

April 9th status update

@walnut-the-cat
Copy link

walnut-the-cat commented Apr 11, 2024

Stateless validation April 11th status update

  • Forknet is running with in-memtrie enabled, but without any traffic. Once @marcelo-gonzalez fixes traffic generator, we will start sending mainnet traffic and see what breaks (@staffik , @Longarithm , @robin-near )
  • @jancionear is currently taking the approach number 1 to address corner case that can happen during node deletion for soft limit and hard limit (ref). @jancionear will update with more analysis
  • Timeline view will be used to track of stateless validation MVP. The team should add any newly created issues that are required before MainNet launch to the view
  • With resharding, compression seems to have less impact (40% -> 15%) in general, but it's still significant for max witness size. Max witness size with compression in the last 24 hours is ~2.2MB
  • For Network optimization, we are investigating ways to improve node - node latency by using other than TCP (e.g. UDP, QUIC). At the same time, we will implement reed-solomon coding for state witness. (@shreyan-gupta , @saketh-are )
  • Reward calculation implementation requires 1 full engineering week to implement. @birchmd will drive it.
  • [P1] With the current assignment model, top validators are likely to validate/track multiple chunks. We may want to see how things can be optimized post MainNet launch.

@walnut-the-cat
Copy link

Stateless validation April 16th update

  • [DONE]Fix is made for incorrect read of non-existent key, which caused chunk misses in ForkNet
  • [DONE] State witness compression work is completed
  • [WIP] TCP optimization is being tested in Forknet.
  • [WIP] PRs for State witness hard limit and soft limit implementation are out
  • [WIP] State witness reed solomon encoding
  • [WIP] Address two security concerns
    • Sending witnesses for old epochs
    • Replaying same state witness to overwhelm validators
  • Forknet status
    • Forknet ran for more than 4 hours with reasonable performance with MainNet traffic. Shard 2 was suffering with 20-30% chunk misses and we observed several block misses. However, the chain never experienced 100% chunk misses.
    • TODOs
      • Integrate with State witness size limit and compression
      • Better chunk tracing system
      • Forknet-control & forknet-test set up for performance comparison
      • Merge code to send x% of mainnet traffic

@walnut-the-cat
Copy link

Stateless validation April 19th update

Tasks

  • [DONE] Replaying same state witness to overwhelm validators @Anton Puhach (pugachag)
  • [Code complete] State witness hard limit and soft limit implementation @Jan Ciołek (jancionear)
  • [In test] Improve network connection performance on burst traffic @Saketh Are
  • [WIP] State witness reed solomon encoding @Shreyan Gupta (shreyan-gupta)
  • [WIP] Missing main transition state proof for old block during shard shuffling @Adam Chudaś (staffik)
  • [PR review] Sending witnesses for old epochs @Anton Puhach (pugachag)
  • [TODO] State witness size limit for incoming receipts @Alex Logunov (Longarithm)
  • [TODO] Make contract code inclusion in state witness deterministic @Alex Logunov (Longarithm)

ForkNet

  • Shard shuffling is now enabled
  • Set maximum number of stored log files to 5 from 20 to avoid out of disk space issue
  • @Razvan Barbascu (vanbarbascu) is exploring ways for continuous back up of ForkNet ref
  • @Marcelo Diop-Gonzalez shared PR to set interval between transactions to control traffic volume ref
  • @Marcelo Diop-Gonzalez will work on the way to send synthetic traffic to Forknet
  • @Adam Chudaś (staffik) will verify the following features are working as expected:
    • TCP optimization
    • State witness size limit
    • State witness compression
  • Working with @Andrei Mustuc (andrei-near) to explore a way to completely turn off all ForkNet nodes outside of business hours ref
  • @Shreyan Gupta (shreyan-gupta) will investigate why we ended up with 5 shards instead of 6 shards in ForkNet

cc. @Tayfun Elmas (tayfunelmas) , @Bowen Wang

@walnut-the-cat
Copy link

Stateless validation April 23rd update

Tasks

  • [DONE] Replaying same state witness to overwhelm validators @Anton Puhach (pugachag)
  • [DONE] State witness hard limit and soft limit implementation @Jan Ciołek (jancionear)
  • [In test] Improve network connection performance on burst traffic @Saketh Are
  • [WIP] State witness reed solomon encoding @Shreyan Gupta (shreyan-gupta)
  • [ETA: April 26th] Missing main transition state proof for old block during shard shuffling @Adam Chudaś (staffik)
  • [Pending until Anton is back] Sending witnesses for old epochs @Anton Puhach (pugachag)
  • [WIP - part of congestion control] State witness size limit for incoming receipts @wacban
  • [TODO] Make contract code inclusion in state witness deterministic @Alex Logunov (Longarithm)

ForkNet

  • Shard shuffling is turned off while investigation in progress.
  • @Razvan Barbascu (vanbarbascu) is exploring ways for continuous back up of ForkNet ref
  • @Marcelo Diop-Gonzalez is working on
    • Better MainNet traffic pace control
    • longer MainNet traffic replay period
    • MainNet traffic replay period from congested hours
  • @Adam Chudaś (staffik) will verify the following features are working as expected:
    • TCP optimization
  • @jancionear will verify the following features are working as expected:
    • State witness size limit
    • State witness compression
  • @jancionear and @staffik to test a way to send synthetic traffic along with MainNet traffic
  • Working with @Andrei Mustuc (andrei-near) to explore a way to completely turn off all ForkNet nodes outside of business hours ref
  • @Shreyan Gupta (shreyan-gupta) will investigate why we ended up with 5 shards instead of 6 shards in ForkNet

cc. @Tayfun Elmas (tayfunelmas) , @Bowen Wang

@walnut-the-cat
Copy link

Stateless validation April 25rd update

Tasks

  • [DONE] Test improvement made with TCP optimization. (link)
  • [DONE] ChunkExtra missing due to lazy Flat head catch up (link)
  • [WIP] State witness reed solomon encoding @Shreyan Gupta (shreyan-gupta)
  • [WIP] Missing main transition state proof for old block during shard shuffling @Adam Chudaś (staffik)
  • [Pending until Anton is back] Sending witnesses for old epochs @Anton Puhach (pugachag)
  • [WIP - part of congestion control] State witness size limit for incoming receipts @wacban
  • [TODO] Make contract code inclusion in state witness deterministic @Alex Logunov (Longarithm)

ForkNet

  • Forknet ran from April 2tth 19:00 UTC to April 26th 01:00 UTC, then got stuck. @robin-near suspects its due to undeleted dump data during reset, but confirmation is needed. Once chain stall had happened, we stopped the nodes and stopped the traffic generator.
  • This week, we have spent significant amount of engineering time investigating that's less relevant to stateless validation itself, e.g. incorrect ForkNet set up, State sync failure, etc. To address issue, we plan to do the followings:
    • @VanBarbascu and @marcelo-gonzalez will look into simple automation of ForkNet so we can reliably start and tear down ForkNet without human error
    • Core team will focus on working on known issues and confirm the fix with more scoped tests, such as AdverseNet and Nayduck. ForkNet will be used only during the last step for final confirmation with more expansive network

cc. @tayfunelmas , @bowenwang1996

@walnut-the-cat
Copy link

Stateless validation April 30th update

Task

  • [DONE] Verify state witness compression works as expected @jancionear
  • [DONE] Store N latest invalid state witnesses @jancionear
  • [WIP] limit size of transaction @jancionear
  • [WIP] Missing transition state proof for old block @staffik
  • [WIP] More integration tests using Nayduck and testloop @pugachAG , @staffik , @robin-near
  • [WIP] security concern - sending state witness for old epoch @pugachAG
  • [WIP] Congestion control for stateless validation @wacban
  • [WIP] Make contract code inclusion deterministic @Longarithm
  • [Code complete, Testing] Reed solomon encoding for state witness @shreyan-gupta

ForkNet

  • There were confusion on what needs to be delivered first with automation effort. @tayfunelmas and @walnut-the-cat came together to come up with TODO list to clarify what to focus on first: ref @VanBarbascu , @marcelo-gonzalez
  • Currently, ForkNet is down and will be soon replaced with the next version using Marcelo's script for fast construction

cc. @bowenwang1996 , @tayfunelmas

@walnut-the-cat
Copy link

walnut-the-cat commented May 7, 2024

May 7th update

cc. @bowenwang1996 , @tayfunelmas

@walnut-the-cat walnut-the-cat moved this from In Progress to Done in Near One project tracking Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants