Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getBlock spike triggers permanent catchup difficulty by rayon/crossbeam bug #22603

Closed
ryoqun opened this issue Jan 20, 2022 · 6 comments · Fixed by #26555
Closed

getBlock spike triggers permanent catchup difficulty by rayon/crossbeam bug #22603

ryoqun opened this issue Jan 20, 2022 · 6 comments · Fixed by #26555

Comments

@ryoqun
Copy link
Member

ryoqun commented Jan 20, 2022

Problem

also, i finally spotted a pretty bad bug in our deep-down crate (rayon => crossbeam), affecting rpc usage: https://github.com/crossbeam-rs/crossbeam/blob/cd453946becccd7e5b29f7305e0d45e8f6417347/crossbeam-epoch/src/internal.rs#L390 (in short, we need to bump this number like by x128 or needs proper adaptive amortization mechanism)

this is the root cause why blockstore-related rpc thread pool magically affects blockstore_processor thread pool, which has been mysteries for months. also this explains another odd thing that the effect persists even after rpc req burst.
also, this performance degradation is proportional to the system core count. irony here is that more cpu cores for rpc => more slowdown/difficulty to keep catchup to the cluster.
for the moment, i think i gotta focus on stake credit_observed auto rewind. so, please ping me if anyone want to take this one (to ship more sooner)

ps: so rocksdb is innocent this time too. lol This also explains the SOLANA_RAYON_THREADS=8 hack works.

technical mechanism of the problem

in short, rayon/crossbeam isn't written with the assumption of idling threads in various pools, which exceeds the system's core count by far. and we're doing that unfortunately.

firstly, rayon thread pool is backed by crossbeam-deque, in turn crossbeam-epoch (a garbage collection library). and each rayon worker thread periodically runs the gc bookeeping code.

and the bookkeeping code isn't well optimized for our usecase. it contains a periodic linear scan of (lockless) linked list of all rayon worker threads in the os process as a part of it. so, as the number of threads increases, each scanning process takes longer, burdening every rayon threads. and the PINNINGS_BETWEEN_COLLECT constant controls the frequency and it's too short (i.e. frequent) for us. so, seemingly independent thread pools affects each other. (that was the mystery part)

and solana-validator's rpc subsystem spawns new set of rayon thread pool if its blockstore related jsonrpc method is invoked.
that causes sudden increase of alive rayon threads, so the crossbeam's scanning is increased. finally, if a certain threshold is crossed depending on the system configuration, the catchup (the replay stage) related thread pools (blockstore_processor and entry ones) are too wasting cycles in the gc bookkeeping to process transactions/entries in timely manner.

Proposed Solution

  • fix the upstream (i think it will take some effort for creating demo code and the actual fix to be merged there)
  • monkey-patch the crossbeam-deque. (as a short term fix if the upstream method takes way long time)
@ryoqun
Copy link
Member Author

ryoqun commented Jan 20, 2022

also, this fix should contribute overall latency reduction across board (dunno about the how-much though). i'm guessing just few % decrease.

@linuskendall
Copy link

@ryoqun We've seen this spike a lot, though less recently (possibly with move to different hardware). We have some older hardware running the SOLANA_RAYON_THREADS=8 hack, do you see any potentially downsides to this param?

@ryoqun
Copy link
Member Author

ryoqun commented Feb 9, 2022

@ryoqun We've seen this spike a lot, though less recently (possibly with move to different hardware). We have some older hardware running the SOLANA_RAYON_THREADS=8 hack, do you see any potentially downsides to this param?

there should be none as long as your node is catching up to the cluster. (the hack slightly increases the likelihood of the inability of catching up).

@ryoqun
Copy link
Member Author

ryoqun commented Feb 11, 2022

@steviez hi, just want to know how's going to fix this. i might be able to work on this soon?

@steviez
Copy link
Contributor

steviez commented Feb 14, 2022

Hi @ryoqun - Some stuff came up so haven't spent as much time as I would have linked, but I'm currently working on a simplified, standalone test that will exhibit the issue (I agree with you that having this would be useful) before starting to tinker in rayon.

--

Out of curiosity, I dug through the codebase and wanted to see how many worker threads we make with rayon to understand what proportion of additional threads the getBlock requests were adding.

  • I'll define USER_RAYON = SOLANA_RAYON_THREADS (if set) || (num_cpu_threads / 2) per the logic in here.
  • I'll also add that if we don't specify the number of threads to rayon::ThreadPoolBuilder, it will use num_cpu_threads by default.

Here is a catalogue of where we spawn Rayon threads:

Adding all of these up and:

  • Assuming our nodes meets our min requirements such that min(USER_RAYON, 8) = 8
  • Counting thread_local! rayons separately (which includes threads that are spawned for getBlock) and as unknown since this is a bit harder to quantify given that the number of calling threads will dictate this
(2 * num_cpu_threads) + (8.5 * USER_RAYON) + 8 + NUM_THREAD_LOCAL

On a 16 physical core / 32 thread machine:

(2 * 32) + (8.5 * 32/2) + 8 = 208 + NUM_THREAD_LOCAL rayon worker threads

On that same 16 physical core / 32 thread machine with SOLANA_RAYON_THREADS=8:

(2 * 32) + (12.5 * 8) + 8 = 140 + NUM_THREAD_LOCAL rayon worker threads

Sticking with the 32-thread machine, one getBlock call adds 16 threads momentarily. Considering that number in comparison to the numbers above, this is a decent proportion.

Also, rayon worker cleanup issue aside, I wonder if 16 threads is overkill for fetching blocks from Rocks, especially on a validator under heavy load

@ryoqun
Copy link
Member Author

ryoqun commented Aug 4, 2022

(oops, actually this isn't not yet fixed with #26706. the real pr is #26555)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants