-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weight concurrent streams by stake #25993
Weight concurrent streams by stake #25993
Conversation
Weight concurrent streams by stake for staked nodes Fixed some comp issues due to merge
Codecov Report
@@ Coverage Diff @@
## master #25993 +/- ##
=========================================
- Coverage 82.1% 82.1% -0.1%
=========================================
Files 628 631 +3
Lines 171471 173734 +2263
=========================================
+ Hits 140878 142650 +1772
- Misses 30593 31084 +491 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes themselves look great, but I'm very curious about these numbers were chosen
@@ -21,6 +25,8 @@ use { | |||
tokio::{task::JoinHandle, time::timeout}, | |||
}; | |||
|
|||
const QUIC_TOTAL_STAKED_CONCURRENT_STREAMS: f64 = 100_000f64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was this number chosen? With 383.3M SOL staked on mainnet, that means every 3833 SOL staked gives you a stream, which means that any node with less than that amount staked gets 0 streams and counts as unstaked. According to Solana Beach, on mainnet, the 1776th staked node has just about enough stake for one stream, which means that ~90% of validators on the network get a stream (including a long tail of delinquent or totally unstaked validators).
So this number might even be a little too high, unless we know that a validator can handle a total of 100k concurrent streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100k streams * 1280 bytes per packet would be ~128MB and ~81MB for unstaked (128 * 500 * 1280) so that sounds pretty conservative in terms of packet memory use. Each stream and connection has other additional overhead as well though.
@@ -2,7 +2,7 @@ pub const QUIC_PORT_OFFSET: u16 = 6; | |||
// Empirically found max number of concurrent streams | |||
// that seems to maximize TPS on GCE (higher values don't seem to | |||
// give significant improvement or seem to impact stability) | |||
pub const QUIC_MAX_CONCURRENT_STREAMS: usize = 2048; | |||
pub const QUIC_MAX_UNSTAKED_CONCURRENT_STREAMS: usize = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What impact does this have on TPU clients? Are there some stats about TPS or network performance with and without this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see a huge impact of w/o this change in my tpu-client test:
With this branch
2022-06-17T20:47:16.938332781Z INFO solana_bench_tps::bench] ---------------------+---------------+--------------------
[2022-06-17T20:47:16.938343148Z INFO solana_bench_tps::bench] http://35.247.120.58:8899 | 67938.42 | 1213244
[2022-06-17T20:47:16.938348907Z INFO solana_bench_tps::bench]
Average max TPS: 67938.42, 0 nodes had 0 TPS
[2022-06-17T20:47:16.938360601Z INFO solana_bench_tps::bench]
Highest TPS: 67938.42 sampling period 1s max transactions: 1213244 clients: 1 drop rate: 0.60
[2022-06-17T20:47:16.938367384Z INFO solana_bench_tps::bench] Average TPS: 19827.145
[2022-06-17T20:55:40.669179958Z INFO solana_bench_tps::bench] http://35.247.120.58:8899 | 55039.97 | 1311307
[2022-06-17T20:55:40.669184817Z INFO solana_bench_tps::bench]
Average max TPS: 55039.97, 0 nodes had 0 TPS
[2022-06-17T20:55:40.669194603Z INFO solana_bench_tps::bench]
Highest TPS: 55039.97 sampling period 1s max transactions: 1311307 clients: 1 drop rate: 0.29
[2022-06-17T20:55:40.669198742Z INFO solana_bench_tps::bench] Average TPS: 21661.62
[2022-06-18T01:07:46.093266884Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions
[2022-06-18T01:07:46.093280556Z INFO solana_bench_tps::bench] ---------------------+---------------+--------------------
[2022-06-18T01:07:46.093285691Z INFO solana_bench_tps::bench] http://35.247.120.58:8899 | 58969.85 | 1367411
[2022-06-18T01:07:46.093301930Z INFO solana_bench_tps::bench]
Average max TPS: 58969.85, 0 nodes had 0 TPS
[2022-06-18T01:07:46.093292778Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=14373355930776120i
[2022-06-18T01:07:46.093311329Z INFO solana_bench_tps::bench]
Highest TPS: 58969.85 sampling period 1s max transactions: 1367411 clients: 1 drop rate: 0.76
[2022-06-18T01:07:46.093349226Z INFO solana_bench_tps::bench] Average TPS: 22578.842
Master
[2022-06-17T21:33:02.909510265Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions
[2022-06-17T21:33:02.909517556Z INFO solana_bench_tps::bench] ---------------------+---------------+--------------------
[2022-06-17T21:33:02.909520144Z INFO solana_bench_tps::bench] http://35.247.120.58:8899 | 44006.95 | 1102674
[2022-06-17T21:33:02.909535908Z INFO solana_bench_tps::bench]
Average max TPS: 44006.95, 0 nodes had 0 TPS
[2022-06-17T21:33:02.909548109Z INFO solana_bench_tps::bench]
Highest TPS: 44006.95 sampling period 1s max transactions: 1102674 clients: 1 drop rate: 0.35
[2022-06-17T21:33:02.909526414Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=14373355930776120i
[2022-06-17T21:33:02.909555115Z INFO solana_bench_tps::bench] Average TPS: 18090.404
[2022-06-17T21:38:51.706520412Z INFO solana_bench_tps::bench] Token balance: 14373355930776120
[2022-06-17T21:38:51.706553141Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions
[2022-06-17T21:38:51.706559304Z INFO solana_bench_tps::bench] ---------------------+---------------+--------------------
[2022-06-17T21:38:51.706562073Z INFO solana_bench_tps::bench] http://35.247.120.58:8899 | 65728.62 | 1300790
[2022-06-17T21:38:51.706573521Z INFO solana_bench_tps::bench]
Average max TPS: 65728.62, 0 nodes had 0 TPS
[2022-06-17T21:38:51.706581605Z INFO solana_bench_tps::bench]
Highest TPS: 65728.62 sampling period 1s max transactions: 1300790 clients: 1 drop rate: 0.34
[2022-06-17T21:38:51.706595604Z INFO solana_bench_tps::bench] Average TPS: 21306.563
// Total stake and nodes => stake map | ||
#[derive(Default)] | ||
pub struct StakedNodes { | ||
pub total_stake: f64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why is this an f64
? Seems like a u64
would make more sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is mostly due to its float arithmetic? Like?
VarInt::from_u64(
((stake as f64 / total_stake as f64)
* QUIC_TOTAL_STAKED_CONCURRENT_STREAMS)
as u64,
)
.unwrap(),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should just flip the ops and drop the floats. floats are the devil
I have done simultaneous tests of using thin-client and rpc-client to target the cluster with bench-tps. The rpc-client trigger txns sends from staked nodes whereas the thin-client triggers txn sends from non-staked. Here are some results: Simutaneous thin-client and rpc-client Thin-client [2022-06-16T18:08:49.544484314Z INFO solana_bench_tps::bench] Token balance: 14373355930776120 [2022-06-17T18:01:54.980784127Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions Rpc-client [2022-06-16T18:08:49.526198869Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions [2022-06-17T18:01:57.359446137Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions |
* Weight concurrent streams by stake (#25993) Weight concurrent streams by stake for staked nodes Ported changes from #25056 after address merge conflicts and some refactoring (cherry picked from commit 61946a4) * Updated quinn version to fix the comp issue with merge * Fixed a missue Cargo.lock file Co-authored-by: Lijun Wang <[email protected]>
@@ -49,7 +49,7 @@ pub(crate) fn configure_server( | |||
let config = Arc::get_mut(&mut server_config.transport).unwrap(); | |||
|
|||
// QUIC_MAX_CONCURRENT_STREAMS doubled, which was found to improve reliability | |||
const MAX_CONCURRENT_UNI_STREAMS: u32 = (QUIC_MAX_CONCURRENT_STREAMS * 2) as u32; | |||
const MAX_CONCURRENT_UNI_STREAMS: u32 = (QUIC_MAX_UNSTAKED_CONCURRENT_STREAMS * 2) as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated. i can't help but ask why this doubling is here rather than SDK where the variable is declared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryleung-solana can you please remove the doubling now that streams don't have head-of-line blocking? #26086
(seems there's no tests in this pr...? ref: #26312) |
I don't think it makes sense for unstaked connections to be able to use 128 concurrent streams each because they should be less prioritized than staked connections. In order for a staked connection to be allowed to use 128 concurrent streams, it would need (128 / 100,000) * 391M = 500,480 SOL staked. Only the top 100 staked validators have that much SOL (source: https://www.validators.app/validators?locale=en&network=mainnet&order=stake&page=4) |
Weight concurrent streams by stake for staked nodes
Problem
Summary of Changes
Ported changes from #25056 after address merge conflicts and some refactoring
Fixes #26033