getBlock spike triggers permanent catchup difficulty by rayon/crossbeam bug #22603

ryoqun · 2022-01-20T13:23:19Z

Problem

also, i finally spotted a pretty bad bug in our deep-down crate (rayon => crossbeam), affecting rpc usage: https://github.com/crossbeam-rs/crossbeam/blob/cd453946becccd7e5b29f7305e0d45e8f6417347/crossbeam-epoch/src/internal.rs#L390 (in short, we need to bump this number like by x128 or needs proper adaptive amortization mechanism)

this is the root cause why blockstore-related rpc thread pool magically affects blockstore_processor thread pool, which has been mysteries for months. also this explains another odd thing that the effect persists even after rpc req burst.
also, this performance degradation is proportional to the system core count. irony here is that more cpu cores for rpc => more slowdown/difficulty to keep catchup to the cluster.
for the moment, i think i gotta focus on stake credit_observed auto rewind. so, please ping me if anyone want to take this one (to ship more sooner)

ps: so rocksdb is innocent this time too. lol This also explains the SOLANA_RAYON_THREADS=8 hack works.

technical mechanism of the problem

in short, rayon/crossbeam isn't written with the assumption of idling threads in various pools, which exceeds the system's core count by far. and we're doing that unfortunately.

firstly, rayon thread pool is backed by crossbeam-deque, in turn crossbeam-epoch (a garbage collection library). and each rayon worker thread periodically runs the gc bookeeping code.

and the bookkeeping code isn't well optimized for our usecase. it contains a periodic linear scan of (lockless) linked list of all rayon worker threads in the os process as a part of it. so, as the number of threads increases, each scanning process takes longer, burdening every rayon threads. and the PINNINGS_BETWEEN_COLLECT constant controls the frequency and it's too short (i.e. frequent) for us. so, seemingly independent thread pools affects each other. (that was the mystery part)

and solana-validator's rpc subsystem spawns new set of rayon thread pool if its blockstore related jsonrpc method is invoked.
that causes sudden increase of alive rayon threads, so the crossbeam's scanning is increased. finally, if a certain threshold is crossed depending on the system configuration, the catchup (the replay stage) related thread pools (blockstore_processor and entry ones) are too wasting cycles in the gc bookkeeping to process transactions/entries in timely manner.

Proposed Solution

fix the upstream (i think it will take some effort for creating demo code and the actual fix to be merged there)
monkey-patch the crossbeam-deque. (as a short term fix if the upstream method takes way long time)

The text was updated successfully, but these errors were encountered:

ryoqun · 2022-01-20T13:26:08Z

also, this fix should contribute overall latency reduction across board (dunno about the how-much though). i'm guessing just few % decrease.

linuskendall · 2022-02-02T03:36:30Z

@ryoqun We've seen this spike a lot, though less recently (possibly with move to different hardware). We have some older hardware running the SOLANA_RAYON_THREADS=8 hack, do you see any potentially downsides to this param?

ryoqun · 2022-02-09T02:36:20Z

@ryoqun We've seen this spike a lot, though less recently (possibly with move to different hardware). We have some older hardware running the SOLANA_RAYON_THREADS=8 hack, do you see any potentially downsides to this param?

there should be none as long as your node is catching up to the cluster. (the hack slightly increases the likelihood of the inability of catching up).

ryoqun · 2022-02-11T03:54:24Z

@steviez hi, just want to know how's going to fix this. i might be able to work on this soon?

steviez · 2022-02-14T23:20:40Z

Hi @ryoqun - Some stuff came up so haven't spent as much time as I would have linked, but I'm currently working on a simplified, standalone test that will exhibit the issue (I agree with you that having this would be useful) before starting to tinker in rayon.

--

Out of curiosity, I dug through the codebase and wanted to see how many worker threads we make with rayon to understand what proportion of additional threads the getBlock requests were adding.

I'll define USER_RAYON = SOLANA_RAYON_THREADS (if set) || (num_cpu_threads / 2) per the logic in here.
I'll also add that if we don't specify the number of threads to rayon::ThreadPoolBuilder, it will use num_cpu_threads by default.

Here is a catalogue of where we spawn Rayon threads:

rayon global thread pool (num_cpu_threads)
- This is the pool that is accessed when .par_iter() is used without a dedicated thread pool
retransmit_stage (8; also see here)
window_service (USER_RAYON)
entry (USER_RAYON)
- Behind thread_local!()
cluster_info
- gossip() (min(USER_RAYON, 8)
- start_socket_consum_thread() (min(USER_RAYON, 8)
- listen() (min(USER_RAYON, 8)
blockstore
- used by get_slot_entries_with_shred_info() (USER_RAYON)
  - Behind thread_local!()
- used by get_any_valid_slot_entries() (num_cpu_threads)
  - Behind thread_local!()
  - This will only get invoked if someone does a Blockstore::PurgeTupe::Exact which is not a typical use case so I will not count this in the below sum
blockstore_processor (USER_RAYON)
- Behind thread_local!()
shred (USER_RAYON)
- Behind thread_local!()
sigverify_shreds (USER_RAYON)
- Behind lazy_static!; looks to be used even if GPU present
sigverify (USER_RAYON)
- Behind lazy_static!; looks to only be used when GPU absent
rpc_subscriptions (USER_RAYON)
accounts_db
- used by scan_account_storage() (USER_RAYON)
- used by clean (USER_RAYON / 2)
bank (num_cpu_threads)

Adding all of these up and:

Assuming our nodes meets our min requirements such that min(USER_RAYON, 8) = 8
Counting thread_local! rayons separately (which includes threads that are spawned for getBlock) and as unknown since this is a bit harder to quantify given that the number of calling threads will dictate this

(2 * num_cpu_threads) + (8.5 * USER_RAYON) + 8 + NUM_THREAD_LOCAL

On a 16 physical core / 32 thread machine:

(2 * 32) + (8.5 * 32/2) + 8 = 208 + NUM_THREAD_LOCAL rayon worker threads

On that same 16 physical core / 32 thread machine with SOLANA_RAYON_THREADS=8:

(2 * 32) + (12.5 * 8) + 8 = 140 + NUM_THREAD_LOCAL rayon worker threads

Sticking with the 32-thread machine, one getBlock call adds 16 threads momentarily. Considering that number in comparison to the numbers above, this is a decent proportion.

Also, rayon worker cleanup issue aside, I wonder if 16 threads is overkill for fetching blocks from Rocks, especially on a validator under heavy load

ryoqun · 2022-08-04T08:11:00Z

(oops, actually this isn't not yet fixed with #26706. the real pr is #26555)

ryoqun assigned steviez Jan 20, 2022

ryoqun mentioned this issue Jun 12, 2022

Rayon gc workaround on v1.10.24 #25929

Closed

steviez removed their assignment Jun 12, 2022

ryoqun mentioned this issue Jun 22, 2022

Noticeable overhead in crossbeam-epoch under many Locals crossbeam-rs/crossbeam#852

Open

ryoqun mentioned this issue Jul 11, 2022

Patch crossbeam-epoch to avoid overhead #26555

Merged

willhickey mentioned this issue Jul 20, 2022

Remove resolver=2 from Cargo.toml and add it to the Windows build #26706

Merged

willhickey closed this as completed in #26706 Jul 22, 2022

ryoqun reopened this Aug 4, 2022

ryoqun closed this as completed in #26555 Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getBlock spike triggers permanent catchup difficulty by rayon/crossbeam bug #22603

getBlock spike triggers permanent catchup difficulty by rayon/crossbeam bug #22603

ryoqun commented Jan 20, 2022

ryoqun commented Jan 20, 2022

linuskendall commented Feb 2, 2022

ryoqun commented Feb 9, 2022

ryoqun commented Feb 11, 2022

steviez commented Feb 14, 2022 •

edited

Loading

ryoqun commented Aug 4, 2022 •

edited

Loading

getBlock spike triggers permanent catchup difficulty by rayon/crossbeam bug #22603

getBlock spike triggers permanent catchup difficulty by rayon/crossbeam bug #22603

Comments

ryoqun commented Jan 20, 2022

Problem

technical mechanism of the problem

Proposed Solution

ryoqun commented Jan 20, 2022

linuskendall commented Feb 2, 2022

ryoqun commented Feb 9, 2022

ryoqun commented Feb 11, 2022

steviez commented Feb 14, 2022 • edited Loading

ryoqun commented Aug 4, 2022 • edited Loading

steviez commented Feb 14, 2022 •

edited

Loading

ryoqun commented Aug 4, 2022 •

edited

Loading