-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the default number of IP echo server threads #354
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #354 +/- ##
=========================================
- Coverage 81.9% 81.9% -0.1%
=========================================
Files 840 840
Lines 228105 228123 +18
=========================================
+ Hits 186837 186842 +5
- Misses 41268 41281 +13 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code-wise looks good. Just wondering about thread usage. seems as though you've seen pretty low usage which is good. Are you looking at raw thread usage rather than some aggregate requests per second? Just want to make sure there isn't some random spikes getting smoothed out in the viewed data. Sad reality it seems is that you kind of have to allocate to peak usage rather than mean/typical usage even if it only hits the peak 5% of the time. Unless we're ok with the performance degredation that comes when the 2 IP echo threads get overloaded (but I'm not sure what this would look like). But if you're measuring actual, non aggregate usage and it's nowhere near saturating 2 threads then we're gtg.
Other thought: does this make it easier to dos the validator by reducing number of threads handling connections? may be a negligible difference as process_connection()
doesn't do a ton as you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure there isn't some random spikes getting smoothed out in the viewed data. Sad reality it seems is that you kind of have to allocate to peak usage rather than mean/typical usage even if it only hits the peak 5% of the time
So something that I failed to call out more clearly in the problem description is that this threadpool is most important for the gossip entrypoint nodes. These nodes (which we run) perform this special service; you might have seen the --entrypoint
args that just about everyone has in their validator scripts.
On the other hand, validators / RPC nodes will not be serving nearly as many of these requests. For example, my unstaked node received less than 200 of these yesterday over the entire 24 hour span. I just pinged someone with a highly staked node to get a datapoint there, but I expect it to be similar to mine.
The idea is to reduce the default, which fits almost every node. For the entrypoint nodes (that again, we run), we can use the flag to specify a non-default value if we desire.
ok ya would be interested to know what the highly staked node you pinged sees. but ya doesn't sounds like these threads will get saturated.
Ya saw this so ya that's good. |
ya I think the |
No, overflow is not a concern.
https://doc.rust-lang.org/std/num/type.NonZeroUsize.html#layout-1 |
cool ya usize won't accept a number > than what its formatted for. and ya looks like |
ya maybe make that change to use |
The previous behavior was to have as many worker as the machine has threads; this is excessive and 2 is sufficient
6ffafa5
to
97b4ec8
Compare
Most recent push:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just that one question for my own understanding. but lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Problem
The IP echo server currently spins up as many worker threads as there are threads on the machine. This server is important for the services that entrypoint nodes provide. However, for just about every other node, this thread pool is very over-provisioned and a waste of resources. Moreso, with everything a validator has to do, we don't really want all the system's threads to be able to be tied up serving these requests.
See #105 and #35 for more general context
Summary of Changes
Normal Nodes
I've examined logs for unstaked nodes running against mainnet; logs would show that my node receives about ~200 of these requests a day. I got data from a highly staked validator who saw similar numbers.
Entrypoint Nodes
Examining the MNB nodes that are
entrypointX.mainnet-beta.solana.com:8001
, I'm seeing no more than 240k requests per day (more like 225k but 240k to keep rounder numbers). This can be determined from logs by counting the number ofconnection from
logs.Between these low request rate and the relatively little work that is performed by the worker threads in
process_connection()
, this seemed appropriate.So, for the general case, I think two threads (one listening on TCP port / one to process requests) is sufficient. For our entrypoint nodes, we can add the extra flag to bump up the thread pool size.