-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(db-tool): Tool to run DB migrations #9333
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f
- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)
…ves (#9295) This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability. But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)
Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba
The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration.
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads. Remove the gas estimation for it. More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined). Now we only need a single number reported. The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered. ``` thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5 stack backtrace: 0: rust_begin_unwind at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5 1: core::panicking::panic_fmt at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14 2: core::panicking::panic at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5 3: runtime_params_estimator::touching_trie_node_read 4: runtime_params_estimator::touching_trie_node 5: runtime_params_estimator::run_estimation 6: runtime_params_estimator::main ``` We "fix" it by removing the code.
This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals). In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details.
The metric has been deprecated since 1.30. Users should use near_peer_message_received_by_type_total instead.
There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit.
#9315) Users have had enough time to update their config files to no longer specify network.external_address. The comment dictates the warning should be removed by the end of 2022 which was half a year ago.
…it fns (#9313) the base on_locust_init() function sets `environment.master_funding_account`, and other init functions expect it to be set when they're run. When that isn't the case, you can get this sort of error: ``` Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire handler(**kwargs) File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init funding_account = environment.master_funding_account AttributeError: 'Environment' object has no attribute 'master_funding_account ``` This error can even happen in the master, before the workers have been started, and it might be related to this issue (which has been closed due to inactivity): locustio/locust#1730. That bug mentions that `User`s get started before on_locust_init() runs, but maybe for similar reasons, we can't guarantee the order in which each on_locust_init() function will run. This doesn't seem to happen every time, and it hasn't really been triggered on MacOS, only on Linux. But this makes it kind of a blocker for setting this test up on cloud VMs (where this bug has been observed)
…9289) No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x`
…s. (#9314) Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn: 1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code; 2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation. This PR fixes these. Also, re-enable two tests which are now fixed.
In #9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2.
This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details>
### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges
@frol I went through the related code, found this is the only required edit as we already set up logging services in the nearcore.
Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like #9121.
`list[...]` in type hints only works for python 3.9 and up. For older python versions, we should use `typing.List[...]`. I first thought we should require newer python for locust tests, also using `match` (see #9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version. This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5.
…9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.
This update brings a lot of new changes: - Update to RocksDB 8.1.1 - `io_uring` enabled which can be tested - Added `load_latest` to open RocksDB with the latest options file - and other fixes No degradation was seen using a `perf-state` tool
Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f
- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)
Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba
The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration.
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit.
This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details>
### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges
…9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.
nikurt
added a commit
to nikurt/nearcore
that referenced
this pull request
Jul 26, 2023
* fix(db-tool): Tool to run DB migrations * feat: simple nightshade v2 - shard layout with 5 shards (near#9274) Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f * refactor: small refactorings and improvements (near#9296) - Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68) * refactor: refactoring and commenting some resharding code (near#9299) * near-vm-runner: move protocol-sensitive error schemas to near-primitives (near#9295) This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability. But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.) * rust: 1.70.0 -> 1.71.0 (near#9302) Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba * fix(state-sync): Always use flat storage when catching up (near#9311) The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration. * fix(state-snapshot): Tool to make DB snapshots (near#9308) Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> * chore(estimator): remove TTN read estimation (near#9307) Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads. Remove the gas estimation for it. More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined). Now we only need a single number reported. The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered. ``` thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5 stack backtrace: 0: rust_begin_unwind at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5 1: core::panicking::panic_fmt at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14 2: core::panicking::panic at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5 3: runtime_params_estimator::touching_trie_node_read 4: runtime_params_estimator::touching_trie_node 5: runtime_params_estimator::run_estimation 6: runtime_params_estimator::main ``` We "fix" it by removing the code. * feat: expose more RocksDB properties (near#9279) This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals). In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details. * chain: remove deprecated near_peer_message_received_total metric (near#9312) The metric has been deprecated since 1.30. Users should use near_peer_message_received_by_type_total instead. * refactor: improvements to logging (near#9309) There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit. * nearcore: remove old deprecation notice about network.external_address (near#9315) Users have had enough time to update their config files to no longer specify network.external_address. The comment dictates the warning should be removed by the end of 2022 which was half a year ago. * fix(state-sync): Test showing that state sync can't always generate state parts (near#9294) Extracted a test from near#9237 . No fix is available yet. * fix(locust): wait for base on_locust_init() to finish before other init fns (near#9313) the base on_locust_init() function sets `environment.master_funding_account`, and other init functions expect it to be set when they're run. When that isn't the case, you can get this sort of error: ``` Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire handler(**kwargs) File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init funding_account = environment.master_funding_account AttributeError: 'Environment' object has no attribute 'master_funding_account ``` This error can even happen in the master, before the workers have been started, and it might be related to this issue (which has been closed due to inactivity): locustio/locust#1730. That bug mentions that `User`s get started before on_locust_init() runs, but maybe for similar reasons, we can't guarantee the order in which each on_locust_init() function will run. This doesn't seem to happen every time, and it hasn't really been triggered on MacOS, only on Linux. But this makes it kind of a blocker for setting this test up on cloud VMs (where this bug has been observed) * fix(state-sync): Simplify storage format of state sync dump progress (near#9289) No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x` * Fix proxy-based nayduck tests so that they can run on non-unix systems. (near#9314) Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn: 1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code; 2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation. This PR fixes these. Also, re-enable two tests which are now fixed. * fix: fixed nayduck test state_sync_fail.py for nightly build (near#9320) In near#9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2. * feat: add database tool subcommand for State read perf testing (near#9276) This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details> * RoutingTable V2: Distance Vector Routing (near#9187) ### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges * fix: use logging instead of print statements (near#9277) @frol I went through the related code, found this is the only required edit as we already set up logging services in the nearcore. * refactor: todo to remove flat storage creation parameters (near#9250) Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like near#9121. * refactor(loadtest): backwards compatible type hints (near#9323) `list[...]` in type hints only works for python 3.9 and up. For older python versions, we should use `typing.List[...]`. I first thought we should require newer python for locust tests, also using `match` (see near#9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version. This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5. * feat(state-sync): Add config for number of downloads during catchup (near#9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state. * chore: Update RocksDB to 0.21 (near#9298) This update brings a lot of new changes: - Update to RocksDB 8.1.1 - `io_uring` enabled which can be tested - Added `load_latest` to open RocksDB with the latest options file - and other fixes No degradation was seen using a `perf-state` tool * fix(db-tool): Tool to run DB migrations * fix(db-tool): Tool to run DB migrations * fix(db-tool): Tool to run DB migrations * fmt * fmt * fix(db-tool): Tool to run DB migrations * feat: simple nightshade v2 - shard layout with 5 shards (near#9274) Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f * refactor: small refactorings and improvements (near#9296) - Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68) * refactor: refactoring and commenting some resharding code (near#9299) * rust: 1.70.0 -> 1.71.0 (near#9302) Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba * fix(state-sync): Always use flat storage when catching up (near#9311) The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration. * fix(state-snapshot): Tool to make DB snapshots (near#9308) Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> * refactor: improvements to logging (near#9309) There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit. * fix(state-sync): Test showing that state sync can't always generate state parts (near#9294) Extracted a test from near#9237 . No fix is available yet. * feat: add database tool subcommand for State read perf testing (near#9276) This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details> * RoutingTable V2: Distance Vector Routing (near#9187) ### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges * feat(state-sync): Add config for number of downloads during catchup (near#9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state. * Merge * Merge * fmt * fmt * fmt * fmt * fmt * fmt --------- Co-authored-by: wacban <[email protected]> Co-authored-by: Simonas Kazlauskas <[email protected]> Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> Co-authored-by: Jakob Meier <[email protected]> Co-authored-by: Anton Puhach <[email protected]> Co-authored-by: Michal Nazarewicz <[email protected]> Co-authored-by: Marcelo Diop-Gonzalez <[email protected]> Co-authored-by: robin-near <[email protected]> Co-authored-by: Saketh Are <[email protected]> Co-authored-by: Yasir <[email protected]> Co-authored-by: Aleksandr Logunov <[email protected]> Co-authored-by: Razvan Barbascu <[email protected]> Co-authored-by: Jure Bajic <[email protected]>
nikurt
added a commit
that referenced
this pull request
Aug 24, 2023
* fix(db-tool): Tool to run DB migrations * feat: simple nightshade v2 - shard layout with 5 shards (#9274) Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f * refactor: small refactorings and improvements (#9296) - Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68) * refactor: refactoring and commenting some resharding code (#9299) * near-vm-runner: move protocol-sensitive error schemas to near-primitives (#9295) This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability. But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.) * rust: 1.70.0 -> 1.71.0 (#9302) Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba * fix(state-sync): Always use flat storage when catching up (#9311) The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration. * fix(state-snapshot): Tool to make DB snapshots (#9308) Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> * chore(estimator): remove TTN read estimation (#9307) Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads. Remove the gas estimation for it. More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined). Now we only need a single number reported. The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered. ``` thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5 stack backtrace: 0: rust_begin_unwind at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5 1: core::panicking::panic_fmt at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14 2: core::panicking::panic at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5 3: runtime_params_estimator::touching_trie_node_read 4: runtime_params_estimator::touching_trie_node 5: runtime_params_estimator::run_estimation 6: runtime_params_estimator::main ``` We "fix" it by removing the code. * feat: expose more RocksDB properties (#9279) This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals). In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details. * chain: remove deprecated near_peer_message_received_total metric (#9312) The metric has been deprecated since 1.30. Users should use near_peer_message_received_by_type_total instead. * refactor: improvements to logging (#9309) There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit. * nearcore: remove old deprecation notice about network.external_address (#9315) Users have had enough time to update their config files to no longer specify network.external_address. The comment dictates the warning should be removed by the end of 2022 which was half a year ago. * fix(state-sync): Test showing that state sync can't always generate state parts (#9294) Extracted a test from #9237 . No fix is available yet. * fix(locust): wait for base on_locust_init() to finish before other init fns (#9313) the base on_locust_init() function sets `environment.master_funding_account`, and other init functions expect it to be set when they're run. When that isn't the case, you can get this sort of error: ``` Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire handler(**kwargs) File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init funding_account = environment.master_funding_account AttributeError: 'Environment' object has no attribute 'master_funding_account ``` This error can even happen in the master, before the workers have been started, and it might be related to this issue (which has been closed due to inactivity): locustio/locust#1730. That bug mentions that `User`s get started before on_locust_init() runs, but maybe for similar reasons, we can't guarantee the order in which each on_locust_init() function will run. This doesn't seem to happen every time, and it hasn't really been triggered on MacOS, only on Linux. But this makes it kind of a blocker for setting this test up on cloud VMs (where this bug has been observed) * fix(state-sync): Simplify storage format of state sync dump progress (#9289) No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x` * Fix proxy-based nayduck tests so that they can run on non-unix systems. (#9314) Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn: 1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code; 2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation. This PR fixes these. Also, re-enable two tests which are now fixed. * fix: fixed nayduck test state_sync_fail.py for nightly build (#9320) In #9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2. * feat: add database tool subcommand for State read perf testing (#9276) This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details> * RoutingTable V2: Distance Vector Routing (#9187) ### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges * fix: use logging instead of print statements (#9277) @frol I went through the related code, found this is the only required edit as we already set up logging services in the nearcore. * refactor: todo to remove flat storage creation parameters (#9250) Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like #9121. * refactor(loadtest): backwards compatible type hints (#9323) `list[...]` in type hints only works for python 3.9 and up. For older python versions, we should use `typing.List[...]`. I first thought we should require newer python for locust tests, also using `match` (see #9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version. This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5. * feat(state-sync): Add config for number of downloads during catchup (#9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state. * chore: Update RocksDB to 0.21 (#9298) This update brings a lot of new changes: - Update to RocksDB 8.1.1 - `io_uring` enabled which can be tested - Added `load_latest` to open RocksDB with the latest options file - and other fixes No degradation was seen using a `perf-state` tool * fix(db-tool): Tool to run DB migrations * fix(db-tool): Tool to run DB migrations * fix(db-tool): Tool to run DB migrations * fmt * fmt * fix(db-tool): Tool to run DB migrations * feat: simple nightshade v2 - shard layout with 5 shards (#9274) Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f * refactor: small refactorings and improvements (#9296) - Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68) * refactor: refactoring and commenting some resharding code (#9299) * rust: 1.70.0 -> 1.71.0 (#9302) Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba * fix(state-sync): Always use flat storage when catching up (#9311) The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration. * fix(state-snapshot): Tool to make DB snapshots (#9308) Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> * refactor: improvements to logging (#9309) There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit. * fix(state-sync): Test showing that state sync can't always generate state parts (#9294) Extracted a test from #9237 . No fix is available yet. * feat: add database tool subcommand for State read perf testing (#9276) This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details> * RoutingTable V2: Distance Vector Routing (#9187) ### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges * feat(state-sync): Add config for number of downloads during catchup (#9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state. * Merge * Merge * fmt * fmt * fmt * fmt * fmt * fmt --------- Co-authored-by: wacban <[email protected]> Co-authored-by: Simonas Kazlauskas <[email protected]> Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> Co-authored-by: Jakob Meier <[email protected]> Co-authored-by: Anton Puhach <[email protected]> Co-authored-by: Michal Nazarewicz <[email protected]> Co-authored-by: Marcelo Diop-Gonzalez <[email protected]> Co-authored-by: robin-near <[email protected]> Co-authored-by: Saketh Are <[email protected]> Co-authored-by: Yasir <[email protected]> Co-authored-by: Aleksandr Logunov <[email protected]> Co-authored-by: Razvan Barbascu <[email protected]> Co-authored-by: Jure Bajic <[email protected]>
nikurt
added a commit
to nikurt/nearcore
that referenced
this pull request
Aug 24, 2023
* fix(db-tool): Tool to run DB migrations * feat: simple nightshade v2 - shard layout with 5 shards (near#9274) Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f * refactor: small refactorings and improvements (near#9296) - Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68) * refactor: refactoring and commenting some resharding code (near#9299) * near-vm-runner: move protocol-sensitive error schemas to near-primitives (near#9295) This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability. But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.) * rust: 1.70.0 -> 1.71.0 (near#9302) Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba * fix(state-sync): Always use flat storage when catching up (near#9311) The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration. * fix(state-snapshot): Tool to make DB snapshots (near#9308) Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> * chore(estimator): remove TTN read estimation (near#9307) Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads. Remove the gas estimation for it. More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined). Now we only need a single number reported. The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered. ``` thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5 stack backtrace: 0: rust_begin_unwind at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5 1: core::panicking::panic_fmt at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14 2: core::panicking::panic at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5 3: runtime_params_estimator::touching_trie_node_read 4: runtime_params_estimator::touching_trie_node 5: runtime_params_estimator::run_estimation 6: runtime_params_estimator::main ``` We "fix" it by removing the code. * feat: expose more RocksDB properties (near#9279) This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals). In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details. * chain: remove deprecated near_peer_message_received_total metric (near#9312) The metric has been deprecated since 1.30. Users should use near_peer_message_received_by_type_total instead. * refactor: improvements to logging (near#9309) There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit. * nearcore: remove old deprecation notice about network.external_address (near#9315) Users have had enough time to update their config files to no longer specify network.external_address. The comment dictates the warning should be removed by the end of 2022 which was half a year ago. * fix(state-sync): Test showing that state sync can't always generate state parts (near#9294) Extracted a test from near#9237 . No fix is available yet. * fix(locust): wait for base on_locust_init() to finish before other init fns (near#9313) the base on_locust_init() function sets `environment.master_funding_account`, and other init functions expect it to be set when they're run. When that isn't the case, you can get this sort of error: ``` Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire handler(**kwargs) File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init funding_account = environment.master_funding_account AttributeError: 'Environment' object has no attribute 'master_funding_account ``` This error can even happen in the master, before the workers have been started, and it might be related to this issue (which has been closed due to inactivity): locustio/locust#1730. That bug mentions that `User`s get started before on_locust_init() runs, but maybe for similar reasons, we can't guarantee the order in which each on_locust_init() function will run. This doesn't seem to happen every time, and it hasn't really been triggered on MacOS, only on Linux. But this makes it kind of a blocker for setting this test up on cloud VMs (where this bug has been observed) * fix(state-sync): Simplify storage format of state sync dump progress (near#9289) No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x` * Fix proxy-based nayduck tests so that they can run on non-unix systems. (near#9314) Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn: 1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code; 2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation. This PR fixes these. Also, re-enable two tests which are now fixed. * fix: fixed nayduck test state_sync_fail.py for nightly build (near#9320) In near#9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2. * feat: add database tool subcommand for State read perf testing (near#9276) This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details> * RoutingTable V2: Distance Vector Routing (near#9187) ### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges * fix: use logging instead of print statements (near#9277) @frol I went through the related code, found this is the only required edit as we already set up logging services in the nearcore. * refactor: todo to remove flat storage creation parameters (near#9250) Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like near#9121. * refactor(loadtest): backwards compatible type hints (near#9323) `list[...]` in type hints only works for python 3.9 and up. For older python versions, we should use `typing.List[...]`. I first thought we should require newer python for locust tests, also using `match` (see near#9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version. This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5. * feat(state-sync): Add config for number of downloads during catchup (near#9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state. * chore: Update RocksDB to 0.21 (near#9298) This update brings a lot of new changes: - Update to RocksDB 8.1.1 - `io_uring` enabled which can be tested - Added `load_latest` to open RocksDB with the latest options file - and other fixes No degradation was seen using a `perf-state` tool * fix(db-tool): Tool to run DB migrations * fix(db-tool): Tool to run DB migrations * fix(db-tool): Tool to run DB migrations * fmt * fmt * fix(db-tool): Tool to run DB migrations * feat: simple nightshade v2 - shard layout with 5 shards (near#9274) Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f * refactor: small refactorings and improvements (near#9296) - Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68) * refactor: refactoring and commenting some resharding code (near#9299) * rust: 1.70.0 -> 1.71.0 (near#9302) Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba * fix(state-sync): Always use flat storage when catching up (near#9311) The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration. * fix(state-snapshot): Tool to make DB snapshots (near#9308) Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> * refactor: improvements to logging (near#9309) There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit. * fix(state-sync): Test showing that state sync can't always generate state parts (near#9294) Extracted a test from near#9237 . No fix is available yet. * feat: add database tool subcommand for State read perf testing (near#9276) This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details> * RoutingTable V2: Distance Vector Routing (near#9187) ### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges * feat(state-sync): Add config for number of downloads during catchup (near#9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state. * Merge * Merge * fmt * fmt * fmt * fmt * fmt * fmt --------- Co-authored-by: wacban <[email protected]> Co-authored-by: Simonas Kazlauskas <[email protected]> Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> Co-authored-by: Jakob Meier <[email protected]> Co-authored-by: Anton Puhach <[email protected]> Co-authored-by: Michal Nazarewicz <[email protected]> Co-authored-by: Marcelo Diop-Gonzalez <[email protected]> Co-authored-by: robin-near <[email protected]> Co-authored-by: Saketh Are <[email protected]> Co-authored-by: Yasir <[email protected]> Co-authored-by: Aleksandr Logunov <[email protected]> Co-authored-by: Razvan Barbascu <[email protected]> Co-authored-by: Jure Bajic <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.