Reduce the default number of IP echo server threads #354

steviez · 2024-03-21T05:00:48Z

Problem

The IP echo server currently spins up as many worker threads as there are threads on the machine. This server is important for the services that entrypoint nodes provide. However, for just about every other node, this thread pool is very over-provisioned and a waste of resources. Moreso, with everything a validator has to do, we don't really want all the system's threads to be able to be tied up serving these requests.

See #105 and #35 for more general context

Summary of Changes

Plumb the number of IP echo server threads through from the CLI
Adjust the default to only create 2 worker threads by default

Normal Nodes

I've examined logs for unstaked nodes running against mainnet; logs would show that my node receives about ~200 of these requests a day. I got data from a highly staked validator who saw similar numbers.

Entrypoint Nodes

Examining the MNB nodes that are entrypointX.mainnet-beta.solana.com:8001, I'm seeing no more than 240k requests per day (more like 225k but 240k to keep rounder numbers). This can be determined from logs by counting the number of connection from logs.

240,000 * 1 / (24 * 60 * 60) = 2.77 requests per second

Between these low request rate and the relatively little work that is performed by the worker threads in process_connection(), this seemed appropriate.

So, for the general case, I think two threads (one listening on TCP port / one to process requests) is sufficient. For our entrypoint nodes, we can add the extra flag to bump up the thread pool size.

codecov-commenter · 2024-03-21T06:18:50Z

Codecov Report

Attention: Patch coverage is 45.45455% with 12 lines in your changes are missing coverage. Please review.

Project coverage is 81.9%. Comparing base (a916edb) to head (4ac1e04).
Report is 51 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master     #354     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         840      840             
  Lines      228105   228123     +18     
=========================================
+ Hits       186837   186842      +5     
- Misses      41268    41281     +13

gregcusack

code-wise looks good. Just wondering about thread usage. seems as though you've seen pretty low usage which is good. Are you looking at raw thread usage rather than some aggregate requests per second? Just want to make sure there isn't some random spikes getting smoothed out in the viewed data. Sad reality it seems is that you kind of have to allocate to peak usage rather than mean/typical usage even if it only hits the peak 5% of the time. Unless we're ok with the performance degredation that comes when the 2 IP echo threads get overloaded (but I'm not sure what this would look like). But if you're measuring actual, non aggregate usage and it's nowhere near saturating 2 threads then we're gtg.

Other thought: does this make it easier to dos the validator by reducing number of threads handling connections? may be a negligible difference as process_connection() doesn't do a ton as you mentioned.

net-utils/src/bin/ip_address_server.rs

steviez

Just want to make sure there isn't some random spikes getting smoothed out in the viewed data. Sad reality it seems is that you kind of have to allocate to peak usage rather than mean/typical usage even if it only hits the peak 5% of the time

So something that I failed to call out more clearly in the problem description is that this threadpool is most important for the gossip entrypoint nodes. These nodes (which we run) perform this special service; you might have seen the --entrypoint args that just about everyone has in their validator scripts.

On the other hand, validators / RPC nodes will not be serving nearly as many of these requests. For example, my unstaked node received less than 200 of these yesterday over the entire 24 hour span. I just pinged someone with a highly staked node to get a datapoint there, but I expect it to be similar to mine.

The idea is to reduce the default, which fits almost every node. For the entrypoint nodes (that again, we run), we can use the flag to specify a non-default value if we desire.

net-utils/src/bin/ip_address_server.rs

gregcusack · 2024-03-22T15:12:46Z

On the other hand, validators / RPC nodes will not be serving nearly as many of these requests. For example, my unstaked node received less than 200 of these yesterday over the entire 24 hour span. I just pinged someone with a highly staked node to get a datapoint there, but I expect it to be similar to mine.

ok ya would be interested to know what the highly staked node you pinged sees. but ya doesn't sounds like these threads will get saturated.

The idea is to reduce the default, which fits almost every node. For the entrypoint nodes (that again, we run), we can use the flag to specify a non-default value if we desire.

Ya saw this so ya that's good.

gregcusack · 2024-03-22T15:25:36Z

I honestly went a little back and forth on this, and as a result, was inconsistent 😅 . Given that I added this:

// There must be at least one worker so the value must be non-zero

static_assertions::const_assert!(DEFAULT_IP_ECHO_SERVER_THREADS > 0);
I'm actually leaning towards doing the .expect() directly in gossip_service as well.

ya I think the .expect() is ok as long as you have the assert. Do we have to worry about overflow? looks like NonZeroUsize::new(x) will return None if x is too big

steviez · 2024-03-22T15:51:41Z

Do we have to worry about overflow? looks like NonZeroUsize::new(x) will return None if x is too big

No, overflow is not a concern.

NonZeroUsize is guaranteed to have the same layout and bit validity as usize with the exception that 0 is not a valid instance. Option is guaranteed to be compatible with usize, including in FFI.

https://doc.rust-lang.org/std/num/type.NonZeroUsize.html#layout-1

gregcusack · 2024-03-22T16:05:16Z

Do we have to worry about overflow? looks like NonZeroUsize::new(x) will return None if x is too big

No, overflow is not a concern.

NonZeroUsize is guaranteed to have the same layout and bit validity as usize with the exception that 0 is not a valid instance. Option is guaranteed to be compatible with usize, including in FFI.

https://doc.rust-lang.org/std/num/type.NonZeroUsize.html#layout-1

cool ya usize won't accept a number > than what its formatted for. and ya looks like value_t_or_exit will exit if the command line arg passed in is too big. ok sounds good

gregcusack · 2024-03-22T16:06:09Z

ya maybe make that change to use .expect() in gossip and then lgtm

The previous behavior was to have as many worker as the machine has threads; this is excessive and 2 is sufficient

steviez · 2024-03-25T16:00:12Z

ya maybe make that change to use .expect() in gossip and then lgtm

Most recent push:

Rebased on tip of master
Addressed the .expect()
Enforced a minimum of two thread as well with a comment justifying the though process there

gregcusack

just that one question for my own understanding. but lgtm!

net-utils/src/ip_echo_server.rs

behzadnouri

lgtm

net-utils/src/ip_echo_server.rs

behzadnouri

lgtm

steviez changed the title ~~Ip echo num threads~~ Reduce the number of IP echo server threads by default Mar 21, 2024

steviez changed the title ~~Reduce the number of IP echo server threads by default~~ Reduce the default number of IP echo server threads Mar 21, 2024

steviez marked this pull request as ready for review March 21, 2024 06:50

steviez requested review from gregcusack and behzadnouri March 21, 2024 06:50

gregcusack reviewed Mar 22, 2024

View reviewed changes

net-utils/src/bin/ip_address_server.rs Outdated Show resolved Hide resolved

steviez commented Mar 22, 2024

View reviewed changes

net-utils/src/bin/ip_address_server.rs Outdated Show resolved Hide resolved

steviez added 5 commits March 25, 2024 10:58

Expose num thread for IP echo server to CLI

f6bc709

Adjust the default for IP echo server threads

5e95a56

The previous behavior was to have as many worker as the machine has threads; this is excessive and 2 is sufficient

Forgot to run script to update other Cargo.lock file

137f738

Use .expect() in gossip_service directly

311daa9

Enforce a minimum of two threads for the server

97b4ec8

steviez force-pushed the ip_echo_num_threads branch from 6ffafa5 to 97b4ec8 Compare March 25, 2024 15:58

steviez requested a review from gregcusack March 25, 2024 16:00

Comment tweaks

6b32dc0

gregcusack previously approved these changes Mar 25, 2024

View reviewed changes

net-utils/src/ip_echo_server.rs Outdated Show resolved Hide resolved

behzadnouri previously approved these changes Mar 30, 2024

View reviewed changes

net-utils/src/ip_echo_server.rs Outdated Show resolved Hide resolved

Make the constants NonZeroUsize's

4ac1e04

steviez dismissed stale reviews from behzadnouri and gregcusack via 4ac1e04 April 1, 2024 04:41

steviez requested a review from behzadnouri April 1, 2024 04:43

behzadnouri approved these changes Apr 1, 2024

View reviewed changes

steviez merged commit 79e316e into anza-xyz:master Apr 1, 2024
48 checks passed

steviez deleted the ip_echo_num_threads branch April 1, 2024 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the default number of IP echo server threads #354

Reduce the default number of IP echo server threads #354

steviez commented Mar 21, 2024 •

edited

Loading

codecov-commenter commented Mar 21, 2024 •

edited

Loading

gregcusack left a comment

steviez left a comment

gregcusack commented Mar 22, 2024

gregcusack commented Mar 22, 2024

steviez commented Mar 22, 2024

gregcusack commented Mar 22, 2024

gregcusack commented Mar 22, 2024

steviez commented Mar 25, 2024

gregcusack left a comment

behzadnouri left a comment

behzadnouri left a comment

Reduce the default number of IP echo server threads #354

Reduce the default number of IP echo server threads #354

Conversation

steviez commented Mar 21, 2024 • edited Loading

Problem

Summary of Changes

Normal Nodes

Entrypoint Nodes

codecov-commenter commented Mar 21, 2024 • edited Loading

Codecov Report

gregcusack left a comment

Choose a reason for hiding this comment

steviez left a comment

Choose a reason for hiding this comment

gregcusack commented Mar 22, 2024

gregcusack commented Mar 22, 2024

steviez commented Mar 22, 2024

gregcusack commented Mar 22, 2024

gregcusack commented Mar 22, 2024

steviez commented Mar 25, 2024

gregcusack left a comment

Choose a reason for hiding this comment

behzadnouri left a comment

Choose a reason for hiding this comment

behzadnouri left a comment

Choose a reason for hiding this comment

steviez commented Mar 21, 2024 •

edited

Loading

codecov-commenter commented Mar 21, 2024 •

edited

Loading