Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the default number of IP echo server threads #354

Merged
merged 7 commits into from
Apr 1, 2024

Conversation

steviez
Copy link

@steviez steviez commented Mar 21, 2024

Problem

The IP echo server currently spins up as many worker threads as there are threads on the machine. This server is important for the services that entrypoint nodes provide. However, for just about every other node, this thread pool is very over-provisioned and a waste of resources. Moreso, with everything a validator has to do, we don't really want all the system's threads to be able to be tied up serving these requests.

See #105 and #35 for more general context

Summary of Changes

  • Plumb the number of IP echo server threads through from the CLI
  • Adjust the default to only create 2 worker threads by default

Normal Nodes

I've examined logs for unstaked nodes running against mainnet; logs would show that my node receives about ~200 of these requests a day. I got data from a highly staked validator who saw similar numbers.

Entrypoint Nodes

Examining the MNB nodes that are entrypointX.mainnet-beta.solana.com:8001, I'm seeing no more than 240k requests per day (more like 225k but 240k to keep rounder numbers). This can be determined from logs by counting the number of connection from logs.

240,000 * 1 / (24 * 60 * 60) = 2.77 requests per second

Between these low request rate and the relatively little work that is performed by the worker threads in process_connection(), this seemed appropriate.

So, for the general case, I think two threads (one listening on TCP port / one to process requests) is sufficient. For our entrypoint nodes, we can add the extra flag to bump up the thread pool size.

@steviez steviez changed the title Ip echo num threads Reduce the number of IP echo server threads by default Mar 21, 2024
@steviez steviez changed the title Reduce the number of IP echo server threads by default Reduce the default number of IP echo server threads Mar 21, 2024
@codecov-commenter
Copy link

codecov-commenter commented Mar 21, 2024

Codecov Report

Attention: Patch coverage is 45.45455% with 12 lines in your changes are missing coverage. Please review.

Project coverage is 81.9%. Comparing base (a916edb) to head (4ac1e04).
Report is 51 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master     #354     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         840      840             
  Lines      228105   228123     +18     
=========================================
+ Hits       186837   186842      +5     
- Misses      41268    41281     +13     

@steviez steviez marked this pull request as ready for review March 21, 2024 06:50
Copy link

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code-wise looks good. Just wondering about thread usage. seems as though you've seen pretty low usage which is good. Are you looking at raw thread usage rather than some aggregate requests per second? Just want to make sure there isn't some random spikes getting smoothed out in the viewed data. Sad reality it seems is that you kind of have to allocate to peak usage rather than mean/typical usage even if it only hits the peak 5% of the time. Unless we're ok with the performance degredation that comes when the 2 IP echo threads get overloaded (but I'm not sure what this would look like). But if you're measuring actual, non aggregate usage and it's nowhere near saturating 2 threads then we're gtg.

Other thought: does this make it easier to dos the validator by reducing number of threads handling connections? may be a negligible difference as process_connection() doesn't do a ton as you mentioned.

net-utils/src/bin/ip_address_server.rs Outdated Show resolved Hide resolved
Copy link
Author

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure there isn't some random spikes getting smoothed out in the viewed data. Sad reality it seems is that you kind of have to allocate to peak usage rather than mean/typical usage even if it only hits the peak 5% of the time

So something that I failed to call out more clearly in the problem description is that this threadpool is most important for the gossip entrypoint nodes. These nodes (which we run) perform this special service; you might have seen the --entrypoint args that just about everyone has in their validator scripts.

On the other hand, validators / RPC nodes will not be serving nearly as many of these requests. For example, my unstaked node received less than 200 of these yesterday over the entire 24 hour span. I just pinged someone with a highly staked node to get a datapoint there, but I expect it to be similar to mine.

The idea is to reduce the default, which fits almost every node. For the entrypoint nodes (that again, we run), we can use the flag to specify a non-default value if we desire.

net-utils/src/bin/ip_address_server.rs Outdated Show resolved Hide resolved
@gregcusack
Copy link

On the other hand, validators / RPC nodes will not be serving nearly as many of these requests. For example, my unstaked node received less than 200 of these yesterday over the entire 24 hour span. I just pinged someone with a highly staked node to get a datapoint there, but I expect it to be similar to mine.

ok ya would be interested to know what the highly staked node you pinged sees. but ya doesn't sounds like these threads will get saturated.

The idea is to reduce the default, which fits almost every node. For the entrypoint nodes (that again, we run), we can use the flag to specify a non-default value if we desire.

Ya saw this so ya that's good.

@gregcusack
Copy link

I honestly went a little back and forth on this, and as a result, was inconsistent 😅 . Given that I added this:

// There must be at least one worker so the value must be non-zero

static_assertions::const_assert!(DEFAULT_IP_ECHO_SERVER_THREADS > 0);
I'm actually leaning towards doing the .expect() directly in gossip_service as well.

ya I think the .expect() is ok as long as you have the assert. Do we have to worry about overflow? looks like NonZeroUsize::new(x) will return None if x is too big

@steviez
Copy link
Author

steviez commented Mar 22, 2024

Do we have to worry about overflow? looks like NonZeroUsize::new(x) will return None if x is too big

No, overflow is not a concern.

NonZeroUsize is guaranteed to have the same layout and bit validity as usize with the exception that 0 is not a valid instance. Option is guaranteed to be compatible with usize, including in FFI.

https://doc.rust-lang.org/std/num/type.NonZeroUsize.html#layout-1

@gregcusack
Copy link

Do we have to worry about overflow? looks like NonZeroUsize::new(x) will return None if x is too big

No, overflow is not a concern.

NonZeroUsize is guaranteed to have the same layout and bit validity as usize with the exception that 0 is not a valid instance. Option is guaranteed to be compatible with usize, including in FFI.

https://doc.rust-lang.org/std/num/type.NonZeroUsize.html#layout-1

cool ya usize won't accept a number > than what its formatted for. and ya looks like value_t_or_exit will exit if the command line arg passed in is too big. ok sounds good

@gregcusack
Copy link

ya maybe make that change to use .expect() in gossip and then lgtm

@steviez steviez force-pushed the ip_echo_num_threads branch from 6ffafa5 to 97b4ec8 Compare March 25, 2024 15:58
@steviez
Copy link
Author

steviez commented Mar 25, 2024

ya maybe make that change to use .expect() in gossip and then lgtm

Most recent push:

  • Rebased on tip of master
  • Addressed the .expect()
  • Enforced a minimum of two thread as well with a comment justifying the though process there

@steviez steviez requested a review from gregcusack March 25, 2024 16:00
gregcusack
gregcusack previously approved these changes Mar 25, 2024
Copy link

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just that one question for my own understanding. but lgtm!

net-utils/src/ip_echo_server.rs Outdated Show resolved Hide resolved
behzadnouri
behzadnouri previously approved these changes Mar 30, 2024
Copy link

@behzadnouri behzadnouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

net-utils/src/ip_echo_server.rs Outdated Show resolved Hide resolved
@steviez steviez dismissed stale reviews from behzadnouri and gregcusack via 4ac1e04 April 1, 2024 04:41
@steviez steviez requested a review from behzadnouri April 1, 2024 04:43
Copy link

@behzadnouri behzadnouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@steviez steviez merged commit 79e316e into anza-xyz:master Apr 1, 2024
48 checks passed
@steviez steviez deleted the ip_echo_num_threads branch April 1, 2024 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

4 participants