Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

events.init.add_listener runs after Users are created on worker #1730

Closed
jbarz1 opened this issue Mar 15, 2021 · 11 comments
Closed

events.init.add_listener runs after Users are created on worker #1730

jbarz1 opened this issue Mar 15, 2021 · 11 comments
Labels
bug stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it

Comments

@jbarz1
Copy link

jbarz1 commented Mar 15, 2021

Describe the bug

It appears that locust.User tasks are executed before all events.init.add_listener listeners are executed on the worker.

Expected behavior

locust.User tasks should execute only after all events.init.add_listener listeners have run.

Actual behavior

It appears that locust.User tasks are executed before all events.init.add_listener listeners are done.

Steps to reproduce

# start worker
python3 -m locust -f test_locustfile.py --headless --worker --master-host=<ip>
# start master
python3 -m locust -f test_locustfile.py --headless --master --expect-workers=1
@events.init.add_listener
def on_locust_init(environment, **_kwargs):
    time.sleep(2)
    log.info('Ran init')

class CustomUser(User):
    @task
    def test(self):
        log.info('Running test')

Output from worker

Spawning 1 users at the rate 1 users/s (0 users already running)...
All users spawned: CustomUser: 1 (1 total running)
Running test
Running test
...
Running test
Ran init
Starting Locust 1.4.3
Running test
...

Environment

  • OS:Linux
  • Python version: 3.7
  • Locust version: 1.4.3
  • Locust command line that you ran: given above
  • Locust file contents (anonymized if necessary): given above
@jbarz1 jbarz1 added the bug label Mar 15, 2021
@max-rocket-internet
Copy link
Contributor

Seems related to #1718 also?

@cyberw
Copy link
Collaborator

cyberw commented Mar 16, 2021

Maybe. We should absolutely wait until the init method has finished before starting the users. Anyone up for it?

@jbarz1
Copy link
Author

jbarz1 commented Mar 17, 2021

I'll see if I can take a crack at it next week. But if someone else wants to attempt a fix, please go ahead 😄

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it label May 17, 2021
@cyberw
Copy link
Collaborator

cyberw commented May 17, 2021

@jbarz1 Have you had any time to look at it?

@cyberw cyberw removed the stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it label May 17, 2021
@delulu
Copy link
Contributor

delulu commented May 26, 2021

The tricky thing is that the init event is fired after the initialization of runner, while in the initialization of WorkerRunner it will send a "client_ready" msg to the master right away for being ready to execute the tasks.

It broke the purpose of init event, and it seems there's no good way to monitor the completion of all init tasks then execute tasks later.

Maybe we can add another method to decouple the emitting of "client_ready" from the initialization of WorkerRunner.

@cyberw please let me know if it looks good to you, I can add a fix.

runner = environment.create_worker_runner(options.master_host, options.master_port)

environment.events.init.fire(environment=environment, runner=runner, web_ui=web_ui)

self.client.send(Message("client_ready", None, self.client_id))

@cyberw
Copy link
Collaborator

cyberw commented May 26, 2021

Hmm. I can't really tell what makes sense here (without doing a lot of digging), but what you are saying sounds reasonable.

If you do make a PR with a good unit test (one that fails in the current implementation and succeeds in the new one) then I'll definitely review it.

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it label Jul 26, 2021
@cyberw cyberw removed the stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it label Jul 26, 2021
cyberw added a commit that referenced this issue Aug 15, 2021
@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it label Sep 25, 2021
cyberw added a commit that referenced this issue Sep 25, 2021
@github-actions
Copy link

github-actions bot commented Oct 5, 2021

This issue was closed because it has been stalled for 10 days with no activity. This does not necessarily mean that the issue is bad, but it most likely means that nobody is willing to take the time to fix it. If you have found Locust useful, then consider contributing a fix yourself!

@github-actions github-actions bot closed this as completed Oct 5, 2021
@AMANSINGHC
Copy link

Any update on this issue guys? It will be great if we can fix this. Thanks.

marcelo-gonzalez added a commit to marcelo-gonzalez/nearcore that referenced this issue Jul 14, 2023
…it fns

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related of this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)
marcelo-gonzalez added a commit to marcelo-gonzalez/nearcore that referenced this issue Jul 14, 2023
…it fns

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)
marcelo-gonzalez added a commit to near/nearcore that referenced this issue Jul 17, 2023
…it fns (#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)
nikurt pushed a commit to near/nearcore that referenced this issue Jul 20, 2023
…it fns (#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)
near-bulldozer bot added a commit to near/nearcore that referenced this issue Jul 24, 2023
* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. 

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. 

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. 

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. 

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page 
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (#9299)

* near-vm-runner: move protocol-sensitive error schemas to near-primitives (#9295)

This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)

* rust: 1.70.0 -> 1.71.0 (#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* chore(estimator): remove TTN read estimation (#9307)

Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads.
Remove the gas estimation for it.

More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined).
Now we only need a single number reported.

The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered.

```
thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5
   3: runtime_params_estimator::touching_trie_node_read
   4: runtime_params_estimator::touching_trie_node
   5: runtime_params_estimator::run_estimation
   6: runtime_params_estimator::main
```

We "fix" it by removing the code.

* feat: expose more RocksDB properties (#9279)

This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals).
In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details.

* chain: remove deprecated near_peer_message_received_total metric (#9312)

The metric has been deprecated since 1.30.  Users should use
near_peer_message_received_by_type_total instead.

* refactor: improvements to logging (#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* nearcore: remove old deprecation notice about network.external_address (#9315)

Users have had enough time to update their config files to no longer
specify network.external_address.  The comment dictates the warning
should be removed by the end of 2022 which was half a year ago.

* fix(state-sync): Test showing that state sync can't always generate state parts (#9294)

Extracted a test from #9237 . No fix is available yet.

* fix(locust): wait for base on_locust_init() to finish before other init fns (#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)

* fix(state-sync): Simplify storage format of state sync dump progress (#9289)

No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x`

* Fix proxy-based nayduck tests so that they can run on non-unix systems. (#9314)

Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn:

1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code;
2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation.

This PR fixes these. Also, re-enable two tests which are now fixed.

* fix: fixed nayduck test state_sync_fail.py for nightly build (#9320)

In #9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. 

The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. 

The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2.

* feat: add database tool subcommand for State read perf testing (#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* fix: use logging instead of print statements (#9277)

@frol  I went through the related code, found this is the only required edit as we already set up logging services in the nearcore.

* refactor: todo to remove flat storage creation parameters (#9250)

Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like #9121.

* refactor(loadtest): backwards compatible type hints (#9323)

`list[...]` in type hints only works for python 3.9 and up.
For older python versions, we should use `typing.List[...]`.

I first thought we should require newer python for locust tests, also using `match` (see #9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version.

This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5.

* feat(state-sync): Add config for number of downloads during catchup (#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* chore: Update RocksDB to 0.21 (#9298)

This update brings a lot of new changes:
- Update to RocksDB 8.1.1
- `io_uring` enabled which can be tested
- Added `load_latest` to open RocksDB with the latest options file
- and other fixes

No degradation was seen using a `perf-state` tool

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fmt

* fmt

* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly.

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it.

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature.

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs.

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (#9299)

* rust: 1.70.0 -> 1.71.0 (#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* refactor: improvements to logging (#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* fix(state-sync): Test showing that state sync can't always generate state parts (#9294)

Extracted a test from #9237 . No fix is available yet.

* feat: add database tool subcommand for State read perf testing (#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* feat(state-sync): Add config for number of downloads during catchup (#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* Merge

* Merge

* fmt

* fmt

* fmt

* fmt

* fmt

* fmt

---------

Co-authored-by: wacban <[email protected]>
Co-authored-by: Simonas Kazlauskas <[email protected]>
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Jakob Meier <[email protected]>
Co-authored-by: Anton Puhach <[email protected]>
Co-authored-by: Michal Nazarewicz <[email protected]>
Co-authored-by: Marcelo Diop-Gonzalez <[email protected]>
Co-authored-by: robin-near <[email protected]>
Co-authored-by: Saketh Are <[email protected]>
Co-authored-by: Yasir <[email protected]>
Co-authored-by: Aleksandr Logunov <[email protected]>
Co-authored-by: Razvan Barbascu <[email protected]>
Co-authored-by: Jure Bajic <[email protected]>
nikurt pushed a commit to nikurt/nearcore that referenced this issue Jul 26, 2023
…it fns (near#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)
nikurt added a commit to nikurt/nearcore that referenced this issue Jul 26, 2023
* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (near#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. 

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. 

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. 

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. 

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page 
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (near#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (near#9299)

* near-vm-runner: move protocol-sensitive error schemas to near-primitives (near#9295)

This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)

* rust: 1.70.0 -> 1.71.0 (near#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (near#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (near#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* chore(estimator): remove TTN read estimation (near#9307)

Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads.
Remove the gas estimation for it.

More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined).
Now we only need a single number reported.

The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered.

```
thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5
   3: runtime_params_estimator::touching_trie_node_read
   4: runtime_params_estimator::touching_trie_node
   5: runtime_params_estimator::run_estimation
   6: runtime_params_estimator::main
```

We "fix" it by removing the code.

* feat: expose more RocksDB properties (near#9279)

This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals).
In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details.

* chain: remove deprecated near_peer_message_received_total metric (near#9312)

The metric has been deprecated since 1.30.  Users should use
near_peer_message_received_by_type_total instead.

* refactor: improvements to logging (near#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* nearcore: remove old deprecation notice about network.external_address (near#9315)

Users have had enough time to update their config files to no longer
specify network.external_address.  The comment dictates the warning
should be removed by the end of 2022 which was half a year ago.

* fix(state-sync): Test showing that state sync can't always generate state parts (near#9294)

Extracted a test from near#9237 . No fix is available yet.

* fix(locust): wait for base on_locust_init() to finish before other init fns (near#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)

* fix(state-sync): Simplify storage format of state sync dump progress (near#9289)

No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x`

* Fix proxy-based nayduck tests so that they can run on non-unix systems. (near#9314)

Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn:

1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code;
2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation.

This PR fixes these. Also, re-enable two tests which are now fixed.

* fix: fixed nayduck test state_sync_fail.py for nightly build (near#9320)

In near#9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. 

The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. 

The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2.

* feat: add database tool subcommand for State read perf testing (near#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (near#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* fix: use logging instead of print statements (near#9277)

@frol  I went through the related code, found this is the only required edit as we already set up logging services in the nearcore.

* refactor: todo to remove flat storage creation parameters (near#9250)

Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like near#9121.

* refactor(loadtest): backwards compatible type hints (near#9323)

`list[...]` in type hints only works for python 3.9 and up.
For older python versions, we should use `typing.List[...]`.

I first thought we should require newer python for locust tests, also using `match` (see near#9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version.

This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5.

* feat(state-sync): Add config for number of downloads during catchup (near#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* chore: Update RocksDB to 0.21 (near#9298)

This update brings a lot of new changes:
- Update to RocksDB 8.1.1
- `io_uring` enabled which can be tested
- Added `load_latest` to open RocksDB with the latest options file
- and other fixes

No degradation was seen using a `perf-state` tool

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fmt

* fmt

* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (near#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly.

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it.

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature.

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs.

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (near#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (near#9299)

* rust: 1.70.0 -> 1.71.0 (near#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (near#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (near#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* refactor: improvements to logging (near#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* fix(state-sync): Test showing that state sync can't always generate state parts (near#9294)

Extracted a test from near#9237 . No fix is available yet.

* feat: add database tool subcommand for State read perf testing (near#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (near#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* feat(state-sync): Add config for number of downloads during catchup (near#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* Merge

* Merge

* fmt

* fmt

* fmt

* fmt

* fmt

* fmt

---------

Co-authored-by: wacban <[email protected]>
Co-authored-by: Simonas Kazlauskas <[email protected]>
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Jakob Meier <[email protected]>
Co-authored-by: Anton Puhach <[email protected]>
Co-authored-by: Michal Nazarewicz <[email protected]>
Co-authored-by: Marcelo Diop-Gonzalez <[email protected]>
Co-authored-by: robin-near <[email protected]>
Co-authored-by: Saketh Are <[email protected]>
Co-authored-by: Yasir <[email protected]>
Co-authored-by: Aleksandr Logunov <[email protected]>
Co-authored-by: Razvan Barbascu <[email protected]>
Co-authored-by: Jure Bajic <[email protected]>
nikurt pushed a commit to near/nearcore that referenced this issue Aug 24, 2023
…it fns (#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)
nikurt added a commit to near/nearcore that referenced this issue Aug 24, 2023
* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. 

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. 

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. 

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. 

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page 
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (#9299)

* near-vm-runner: move protocol-sensitive error schemas to near-primitives (#9295)

This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)

* rust: 1.70.0 -> 1.71.0 (#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* chore(estimator): remove TTN read estimation (#9307)

Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads.
Remove the gas estimation for it.

More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined).
Now we only need a single number reported.

The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered.

```
thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5
   3: runtime_params_estimator::touching_trie_node_read
   4: runtime_params_estimator::touching_trie_node
   5: runtime_params_estimator::run_estimation
   6: runtime_params_estimator::main
```

We "fix" it by removing the code.

* feat: expose more RocksDB properties (#9279)

This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals).
In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details.

* chain: remove deprecated near_peer_message_received_total metric (#9312)

The metric has been deprecated since 1.30.  Users should use
near_peer_message_received_by_type_total instead.

* refactor: improvements to logging (#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* nearcore: remove old deprecation notice about network.external_address (#9315)

Users have had enough time to update their config files to no longer
specify network.external_address.  The comment dictates the warning
should be removed by the end of 2022 which was half a year ago.

* fix(state-sync): Test showing that state sync can't always generate state parts (#9294)

Extracted a test from #9237 . No fix is available yet.

* fix(locust): wait for base on_locust_init() to finish before other init fns (#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)

* fix(state-sync): Simplify storage format of state sync dump progress (#9289)

No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x`

* Fix proxy-based nayduck tests so that they can run on non-unix systems. (#9314)

Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn:

1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code;
2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation.

This PR fixes these. Also, re-enable two tests which are now fixed.

* fix: fixed nayduck test state_sync_fail.py for nightly build (#9320)

In #9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. 

The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. 

The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2.

* feat: add database tool subcommand for State read perf testing (#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* fix: use logging instead of print statements (#9277)

@frol  I went through the related code, found this is the only required edit as we already set up logging services in the nearcore.

* refactor: todo to remove flat storage creation parameters (#9250)

Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like #9121.

* refactor(loadtest): backwards compatible type hints (#9323)

`list[...]` in type hints only works for python 3.9 and up.
For older python versions, we should use `typing.List[...]`.

I first thought we should require newer python for locust tests, also using `match` (see #9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version.

This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5.

* feat(state-sync): Add config for number of downloads during catchup (#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* chore: Update RocksDB to 0.21 (#9298)

This update brings a lot of new changes:
- Update to RocksDB 8.1.1
- `io_uring` enabled which can be tested
- Added `load_latest` to open RocksDB with the latest options file
- and other fixes

No degradation was seen using a `perf-state` tool

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fmt

* fmt

* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly.

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it.

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature.

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs.

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (#9299)

* rust: 1.70.0 -> 1.71.0 (#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* refactor: improvements to logging (#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* fix(state-sync): Test showing that state sync can't always generate state parts (#9294)

Extracted a test from #9237 . No fix is available yet.

* feat: add database tool subcommand for State read perf testing (#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* feat(state-sync): Add config for number of downloads during catchup (#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* Merge

* Merge

* fmt

* fmt

* fmt

* fmt

* fmt

* fmt

---------

Co-authored-by: wacban <[email protected]>
Co-authored-by: Simonas Kazlauskas <[email protected]>
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Jakob Meier <[email protected]>
Co-authored-by: Anton Puhach <[email protected]>
Co-authored-by: Michal Nazarewicz <[email protected]>
Co-authored-by: Marcelo Diop-Gonzalez <[email protected]>
Co-authored-by: robin-near <[email protected]>
Co-authored-by: Saketh Are <[email protected]>
Co-authored-by: Yasir <[email protected]>
Co-authored-by: Aleksandr Logunov <[email protected]>
Co-authored-by: Razvan Barbascu <[email protected]>
Co-authored-by: Jure Bajic <[email protected]>
nikurt pushed a commit to nikurt/nearcore that referenced this issue Aug 24, 2023
…it fns (near#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)
nikurt added a commit to nikurt/nearcore that referenced this issue Aug 24, 2023
* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (near#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. 

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. 

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. 

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. 

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page 
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (near#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (near#9299)

* near-vm-runner: move protocol-sensitive error schemas to near-primitives (near#9295)

This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)

* rust: 1.70.0 -> 1.71.0 (near#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (near#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (near#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* chore(estimator): remove TTN read estimation (near#9307)

Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads.
Remove the gas estimation for it.

More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined).
Now we only need a single number reported.

The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered.

```
thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5
   3: runtime_params_estimator::touching_trie_node_read
   4: runtime_params_estimator::touching_trie_node
   5: runtime_params_estimator::run_estimation
   6: runtime_params_estimator::main
```

We "fix" it by removing the code.

* feat: expose more RocksDB properties (near#9279)

This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals).
In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details.

* chain: remove deprecated near_peer_message_received_total metric (near#9312)

The metric has been deprecated since 1.30.  Users should use
near_peer_message_received_by_type_total instead.

* refactor: improvements to logging (near#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* nearcore: remove old deprecation notice about network.external_address (near#9315)

Users have had enough time to update their config files to no longer
specify network.external_address.  The comment dictates the warning
should be removed by the end of 2022 which was half a year ago.

* fix(state-sync): Test showing that state sync can't always generate state parts (near#9294)

Extracted a test from near#9237 . No fix is available yet.

* fix(locust): wait for base on_locust_init() to finish before other init fns (near#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)

* fix(state-sync): Simplify storage format of state sync dump progress (near#9289)

No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x`

* Fix proxy-based nayduck tests so that they can run on non-unix systems. (near#9314)

Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn:

1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code;
2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation.

This PR fixes these. Also, re-enable two tests which are now fixed.

* fix: fixed nayduck test state_sync_fail.py for nightly build (near#9320)

In near#9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. 

The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. 

The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2.

* feat: add database tool subcommand for State read perf testing (near#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (near#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* fix: use logging instead of print statements (near#9277)

@frol  I went through the related code, found this is the only required edit as we already set up logging services in the nearcore.

* refactor: todo to remove flat storage creation parameters (near#9250)

Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like near#9121.

* refactor(loadtest): backwards compatible type hints (near#9323)

`list[...]` in type hints only works for python 3.9 and up.
For older python versions, we should use `typing.List[...]`.

I first thought we should require newer python for locust tests, also using `match` (see near#9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version.

This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5.

* feat(state-sync): Add config for number of downloads during catchup (near#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* chore: Update RocksDB to 0.21 (near#9298)

This update brings a lot of new changes:
- Update to RocksDB 8.1.1
- `io_uring` enabled which can be tested
- Added `load_latest` to open RocksDB with the latest options file
- and other fixes

No degradation was seen using a `perf-state` tool

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fmt

* fmt

* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (near#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly.

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it.

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature.

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs.

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (near#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (near#9299)

* rust: 1.70.0 -> 1.71.0 (near#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (near#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (near#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* refactor: improvements to logging (near#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* fix(state-sync): Test showing that state sync can't always generate state parts (near#9294)

Extracted a test from near#9237 . No fix is available yet.

* feat: add database tool subcommand for State read perf testing (near#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (near#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* feat(state-sync): Add config for number of downloads during catchup (near#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* Merge

* Merge

* fmt

* fmt

* fmt

* fmt

* fmt

* fmt

---------

Co-authored-by: wacban <[email protected]>
Co-authored-by: Simonas Kazlauskas <[email protected]>
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Jakob Meier <[email protected]>
Co-authored-by: Anton Puhach <[email protected]>
Co-authored-by: Michal Nazarewicz <[email protected]>
Co-authored-by: Marcelo Diop-Gonzalez <[email protected]>
Co-authored-by: robin-near <[email protected]>
Co-authored-by: Saketh Are <[email protected]>
Co-authored-by: Yasir <[email protected]>
Co-authored-by: Aleksandr Logunov <[email protected]>
Co-authored-by: Razvan Barbascu <[email protected]>
Co-authored-by: Jure Bajic <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants