Remove the Blockstore thread pool used for fetching Entries #34768

steviez · 2024-01-12T17:34:12Z

Problem

Retrieving entries from the Blockstore is a necessary for several uses-cases:

Feeding entries to ReplayStage
Feeding entries to CompletedDataSetService
Feeding entries to RPC calls such as getBlock

While these operations all occur in different threads, the implementation of the Blockstore function utilizes a thread-pool to parallelize the operation of fetching / deshreding / deserializing shreds into entries. That thread pool is currently set to scale with the number of cores on the machine. Several problems with this:

The thread pool is shared between all of these cases, but replay is the most critical and should not be prone to potential jitter/latency as a result of another operation utilizing the thread pool
The thread pool is very overprovisioned; CPU utilization of these threads from both a validator and an RPC node in the public pool would indicate that many of the threads in the pool aren't seeing much to any activity.
- Looking at CPU utilization per thread, some of the threads in the pool are seeing < 0.1% usage
In steady state operation, blocks are streamed to the validator and get replayed immediately. ReplayStage is greedy and tries to replay entries as soon as possible; this means that the block is fetched in many small chunks as opposed to one call that fetches the entire block at once. Many small chunks spread over time means there is no need for large "parallelization"

Contributes to anza-xyz#35

Summary of Changes

Remove the thread pool that the Entry fetch method had been using. This method previously parallelized over completed ranges (a completed range is a range of shreds that should be deserialized into a Vec<Entry> together), giving one completed range to one rayon thread. Now, the method looks up all completed ranges it might want to fetch with a single call to rocksdb multi_get().

Performance Impact

Background for several metrics of interest which are on a per-slot basis:

replay-slot-stats.fetch_entries: The total amount of time spent fetching entries
replay-slot-stats.num_execute_batches: The total number of calls ReplayStage makes to blockstore_processor::confirm_slot() that result in transactions getting executed. This is the same number of times that the Blockstore method to fetch entries is getting called
replay-slot-stats.confirmation_time_us: The total amount of time spent within blockstore_processor::confirm_slot(); this is inclusive of both fetch entry time as well as everything else (ie actual tx execution)

Summarizing some key points from some of the comments below:

The removal of this thread pool does increase replay-slot-stats.fetch_entries. Average numbers on my node would suggest an increase from ~3.5k ms to ~5.5k ms
replay-slot-stats.num_execute_batches has been observed to have an average value of ~70 over the past two weeks
Noting that we're spending ~2 ms more to fetch (on average), that is ~2 ms / 70 ~= 29 us per call to fetch entries
The ability to complete replay of a slot is dependent on all the shreds having been received. Once the last shred has been received, there will be exactly 1 more call to confirm_slot() (and thus to fetch entries). So, in isolation, it might appear that we've "delayed" being able to complete a block by ~29 us (on average)
However, replay-slot-stats.confirmation_time_us looks pretty consistent for my node before and after making the change to remove the thread-pool. This value staying consistent would suggest that while fetch time is growing marginally, we are at least breaking even by avoiding the thread-pool

steviez · 2024-01-12T19:53:46Z

Here are some general notes about several graphs I'l share:

The metric being examined here is replay-slot-stats.fetch_entries time; the unit of this metric is microseconds
The purple trace is my test node 83TTSHyHsrM9XHNSVAy5C3uXwifAxXMa1GwrHCVRNv77
- All ledger components on one disk + --enable-rpc-transaction-history
The cyan trace is 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 (bv1) running v1.17.13
- Snapshots + OS on one drive, rest of ledger on another
The orange trace is Ninja1spj6n9t5hVYgF3PdnYz2PLnkt7rvaw3firmjs for a 3rd party, running v1.17.13
- Zan mentioned having three NVMe drives; one for rocksdb / one for accounts / one for snapshots

Here is the last 12 hours of runtime with no averaging; the change was made on my test node around 08:15:

All nodes show some large spikes, although my node and bv1 show both more and more severe spikes than Ninja

Here is the same chart, but with a little averaging (10s bin):

Hard to place definite numbers, but my node (purple) looks to have a higher floor than the other two

And finally the same chart with even more averaging (60 min bin):

The higher level of averaging shows that 7Np and Ninja are pretty close on average around 3.5ms whereas my nodes is hovering around 4.5ms give or take

Lastly, here is the last 24 hours; my node was previously running the tip of master prior to 08:00:

Throwing out the 07:00 and 08:00 datapoints as these are likely skewed by the node shutdown / restart, we can see that my node was performing in line (or maybe even slightly better) with the others prior to making the change, and does have a definite step up with the change.

codecov · 2024-01-12T20:26:13Z

Codecov Report

Attention: Patch coverage is 92.59259% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 81.7%. Comparing base (8143fc3) to head (164a27d).
Report is 1 commits behind head on master.

❗ Current head 164a27d differs from pull request most recent head c89bdaa. Consider uploading reports for the commit c89bdaa to get more accurate results

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #34768     +/-   ##
=========================================
  Coverage    81.7%    81.7%             
=========================================
  Files         834      825      -9     
  Lines      224232   223106   -1126     
=========================================
- Hits       183351   182498    -853     
+ Misses      40881    40608    -273

steviez · 2024-01-12T23:31:32Z

After thinking about this a little more, I think measuring the effect of this PR is more nuanced than the analysis in my first comment. The metric I was examining, replay-slot-stats.fetch_entries, does seem to degrade from ~3.5ms to ~4.5ms with this change. However, a few points about this:

The degradation doesn't necessarily mean that replaying a slot is taking >=~1ms more. That ~1ms is split up across all of the fetches that occur in the slot. Over the last 2 days on mnb, I see a floor of about ~20 fetches per slot via replay-slot-stats.num_execute_batches metric. When the last shred arrives (which is out of the control of the Blockstore), there is one fetch remaining for a slot. Assuming evenly distributed time per fetch, ~1 ms / 20 fetches = 50 us / fetch difference.
This change is reducing the number of threads; we have previously observed and noted negatives with having over-provisioned rayon thread pools (bunch of threads are woken up and try to steal work from each other when there was only enough work for a couple threads to start)
While the fetch metric does grow by a small amount in absolute time, the more important metric is seemingly the net wallclock time for replaying slot, replay-slot-stats.replay_total_elapsed. That is, even if we're spending marginally more time fetching, if the total amount time to replay the block is unchanged, then we're holding performance equal while using fewer resources which would seemingly be a win. This value staying flat might suggest that while fetch time is growing marginally, we could be breaking even with fewer context switches / cache disruptions / etc.

This graph shows replay-slot-stats.replay_total_elapsed; my node is still purple and bv1 is cyan:

My node was previously doing marginally worse than bv1 but has been on par for the past 12+ hours

steviez · 2024-01-15T18:17:00Z

Marco from Triton chimed in with utilization of several nodes in the public mnb RPC pool:

$ ps -T -p 16256 -o comm,pcpu | grep -E 'solBstore[0-9]'
solBstore00      1.0
solBstore01      1.0
solBstore02      0.9
solBstore03      0.8
solBstore04      0.8
solBstore05      0.7
solBstore06      0.7
solBstore07      0.6
solBstore08      0.6
solBstore09      0.5
solBstore10      0.5
solBstore11      0.5
solBstore12      0.4
solBstore13      0.4
solBstore14      0.3
solBstore15      0.3
solBstore16      0.3
solBstore17      0.3
solBstore18      0.2
solBstore19      0.2
solBstore20      0.2
solBstore21      0.2
solBstore22      0.2
solBstore23      0.1
solBstore24      0.1
solBstore25      0.1
solBstore26      0.1
solBstore27      0.1
solBstore28      0.1
solBstore29      0.1
solBstore30      0.1
solBstore31      0.0
...
solBstore55      0.0

AND

$ ps -T -p 16256 -o comm,pcpu | grep -E 'solBstore[0-9]'
solBstore00      0.3
solBstore01      0.2
solBstore02      0.2
solBstore03      0.2
solBstore04      0.2
solBstore05      0.2
solBstore06      0.2
solBstore07      0.2
solBstore08      0.2
solBstore09      0.1
solBstore10      0.1
solBstore11      0.1
solBstore12      0.1
solBstore13      0.1
solBstore14      0.1
solBstore15      0.1
solBstore16      0.1
solBstore17      0.0
...
solBstore55      0.0

These numbers support my earlier comment about the thread pool being over-provisioned, even for an RPC node. If we don't rip this threadpool out completely, it certainly seems that putting a limit on the size of the pool instead of allowing it to scale with the number of threads on the machine would yield a win

stakeconomy · 2024-01-15T19:08:12Z

another one from a busy RPC node in our shared pool

$ ps -T -p 15263 -o comm,pcpu | grep -E 'solBstore[0-9]'
solBstore00      0.8
solBstore01      0.7
solBstore02      0.7
solBstore03      0.6
solBstore04      0.6
solBstore05      0.5
solBstore06      0.5
solBstore07      0.4
solBstore08      0.4
solBstore09      0.4
solBstore10      0.4
solBstore11      0.3
solBstore12      0.3
solBstore13      0.3
solBstore14      0.3
solBstore15      0.2
solBstore16      0.2
solBstore17      0.2
solBstore18      0.2
solBstore19      0.2
solBstore20      0.1
solBstore21      0.1
solBstore22      0.1
solBstore23      0.1
solBstore24      0.1
solBstore25      0.1
solBstore26      0.0
---
solBstore55      0.0

steviez · 2024-02-26T20:15:21Z

Ok @t-nelson - I've let this one run long enough that I feel pretty good about it so formally marked it ready for review + requested review from ya.

t-nelson

hell yeah!

ledger/src/blockstore.rs

bw-solana

Looks pretty good to me. A couple of nits and one more important question about what happens in the error case

t-nelson

nice!

looks like we have some collect() to temporaries in there that might be removable in a follow up

steviez · 2024-02-27T04:21:12Z

looks like we have some collect() to temporaries in there that might be removable in a follow up

For the sake of paper-trail, we might be able to shave a mem-copy as well. Currently, on the fetch path, we have the following copies from rocksdb to Vec<Entry>:

    0): RocksDB Memtable
        - Technically might be on disk in SST
    1): LedgerColumn::multi_get_bytes() -> Vec<Vec<u8>>
        - Ownership transferred via Shred::new_from_serialized_shred
            - Some modification may happen here with resize / truncate
    2): Shredder::deshred() -> Vec<u8>
        - All shreds payloads copied into common buffer
    3): bincode::deserialize() -> Vec<Entry>

Namely, I think could save the copy in step 1). Instead, we could have multi_get_ref() -> Vec<&[u8]>, partially deserialize what we need from those, and then copy the payloads directly into deshred buffer (without making the intermediate copy). This would require rework of Shred; there are definitely instances where we want Shred to have owned memory, so we'd have to make it work with either owned or referenced memory

t-nelson · 2024-02-27T04:52:03Z

so we'd have to make it work with either owned or referenced memory

🐄

…abs#34768) There are several cases for fetching entries from the Blockstore: - Fetching entries for block replay - Fetching entries for CompletedDataSetService - Fetching entries to service RPC getBlock requests All of these operations occur in a different calling thread. However, the currently implementation utilizes a shared thread-pool within the Blockstore function. There are several problems with this: - The thread pool is shared between all of the listed cases, despite block replay being the most critical. These other services shouldn't be able to interfere with block replay - The thread pool is overprovisioned for the average use; thread utilization on both regular validators and RPC nodes shows that many of the thread see very little activity. But, these thread existing introduce "accounting" overhead - rocksdb exposes an API to fetch multiple items at once, potentially with some parallelization under the hood. Using parallelization in our API and the underlying rocksdb is overkill and we're doing more damage than good. This change removes that threadpool completely, and instead fetches all of the desired entries in a single call. This has been observed to have a minor degradation on the time spent within the Blockstore get_slot_entries_with_shred_info() function. Namely, some buffer copying and deserialization that previously occurred in parallel now occur serially. However, the metric that tracks the amount of time spent replaying blocks (inclusive of fetch) is unchanged. Thus, despite spending marginally more time to fetch/copy/deserialize with only a single thread, the gains from not thrashing everything else with the pool keep us at parity.

ledger/src/blockstore.rs

steviez force-pushed the bstore_no_entry_fetch_tpool branch from 86a3b03 to 164a27d Compare January 12, 2024 19:07

steviez force-pushed the bstore_no_entry_fetch_tpool branch from 164a27d to ab078cd Compare January 25, 2024 18:17

github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Feb 9, 2024

steviez removed the stale [bot only] Added to stale content; results in auto-close after a week. label Feb 11, 2024

steviez force-pushed the bstore_no_entry_fetch_tpool branch from ab078cd to 25971c5 Compare February 17, 2024 19:28

steviez marked this pull request as ready for review February 26, 2024 18:49

steviez requested a review from t-nelson February 26, 2024 20:12

steviez requested a review from bw-solana February 26, 2024 20:20

t-nelson reviewed Feb 26, 2024

View reviewed changes

ledger/src/blockstore.rs Outdated Show resolved Hide resolved

ledger/src/blockstore.rs Outdated Show resolved Hide resolved

Steven Czabaniuk and others added 3 commits February 26, 2024 14:33

Remove the thread pool used for fetching Entries

3c31417

Fix off-by-one error in comment

4797558

Remove outdated TODO

6ad51f0

steviez force-pushed the bstore_no_entry_fetch_tpool branch from 25971c5 to 6ad51f0 Compare February 26, 2024 21:28

bw-solana reviewed Feb 26, 2024

View reviewed changes

ledger/src/blockstore.rs Outdated Show resolved Hide resolved

bw-solana reviewed Feb 26, 2024

View reviewed changes

ledger/src/blockstore.rs Outdated Show resolved Hide resolved

bw-solana reviewed Feb 26, 2024

View reviewed changes

ledger/src/blockstore.rs Outdated Show resolved Hide resolved

bw-solana reviewed Feb 26, 2024

View reviewed changes

steviez added 3 commits February 26, 2024 18:06

Tweak FEC set log and demote to trace

1db5c93

Use combinators over for-loop

9a80b64

Rename variables for more clarity between all and individual ranges

c89bdaa

steviez requested review from t-nelson and bw-solana February 27, 2024 00:39

t-nelson approved these changes Feb 27, 2024

View reviewed changes

bw-solana approved these changes Feb 27, 2024

View reviewed changes

steviez merged commit 09925a1 into solana-labs:master Feb 27, 2024
35 checks passed

steviez deleted the bstore_no_entry_fetch_tpool branch February 27, 2024 02:27

behzadnouri reviewed Feb 28, 2024

View reviewed changes

ledger/src/blockstore.rs Show resolved Hide resolved

This was referenced Mar 4, 2024

Audit each individual thread pool size anza-xyz/agave#35

Open

blockstore: Remove unnecessary function and threadpool anza-xyz/agave#122

Merged

steviez mentioned this pull request May 9, 2024

Remove explicit assert in Blockstore::get_slot_entries_in_block() anza-xyz/agave#1258

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the Blockstore thread pool used for fetching Entries #34768

Remove the Blockstore thread pool used for fetching Entries #34768

steviez commented Jan 12, 2024 •

edited

Loading

steviez commented Jan 12, 2024 •

edited

Loading

codecov bot commented Jan 12, 2024 •

edited

Loading

steviez commented Jan 12, 2024 •

edited

Loading

steviez commented Jan 15, 2024

stakeconomy commented Jan 15, 2024 •

edited

Loading

steviez commented Feb 26, 2024

t-nelson left a comment

bw-solana left a comment

t-nelson left a comment

steviez commented Feb 27, 2024

t-nelson commented Feb 27, 2024

Remove the Blockstore thread pool used for fetching Entries #34768

Remove the Blockstore thread pool used for fetching Entries #34768

Conversation

steviez commented Jan 12, 2024 • edited Loading

Problem

Summary of Changes

Performance Impact

steviez commented Jan 12, 2024 • edited Loading

codecov bot commented Jan 12, 2024 • edited Loading

Codecov Report

steviez commented Jan 12, 2024 • edited Loading

steviez commented Jan 15, 2024

stakeconomy commented Jan 15, 2024 • edited Loading

steviez commented Feb 26, 2024

t-nelson left a comment

Choose a reason for hiding this comment

bw-solana left a comment

Choose a reason for hiding this comment

t-nelson left a comment

Choose a reason for hiding this comment

steviez commented Feb 27, 2024

t-nelson commented Feb 27, 2024

steviez commented Jan 12, 2024 •

edited

Loading

steviez commented Jan 12, 2024 •

edited

Loading

codecov bot commented Jan 12, 2024 •

edited

Loading

steviez commented Jan 12, 2024 •

edited

Loading

stakeconomy commented Jan 15, 2024 •

edited

Loading