Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the Blockstore thread pool used for fetching Entries #34768

Merged
merged 6 commits into from
Feb 27, 2024

Conversation

steviez
Copy link
Contributor

@steviez steviez commented Jan 12, 2024

Problem

Retrieving entries from the Blockstore is a necessary for several uses-cases:

  1. Feeding entries to ReplayStage
  2. Feeding entries to CompletedDataSetService
  3. Feeding entries to RPC calls such as getBlock

While these operations all occur in different threads, the implementation of the Blockstore function utilizes a thread-pool to parallelize the operation of fetching / deshreding / deserializing shreds into entries. That thread pool is currently set to scale with the number of cores on the machine. Several problems with this:

  • The thread pool is shared between all of these cases, but replay is the most critical and should not be prone to potential jitter/latency as a result of another operation utilizing the thread pool
  • The thread pool is very overprovisioned; CPU utilization of these threads from both a validator and an RPC node in the public pool would indicate that many of the threads in the pool aren't seeing much to any activity.
    • Looking at CPU utilization per thread, some of the threads in the pool are seeing < 0.1% usage
  • In steady state operation, blocks are streamed to the validator and get replayed immediately. ReplayStage is greedy and tries to replay entries as soon as possible; this means that the block is fetched in many small chunks as opposed to one call that fetches the entire block at once. Many small chunks spread over time means there is no need for large "parallelization"

Contributes to anza-xyz#35

Summary of Changes

Remove the thread pool that the Entry fetch method had been using. This method previously parallelized over completed ranges (a completed range is a range of shreds that should be deserialized into a Vec<Entry> together), giving one completed range to one rayon thread. Now, the method looks up all completed ranges it might want to fetch with a single call to rocksdb multi_get().

Performance Impact

Background for several metrics of interest which are on a per-slot basis:

  • replay-slot-stats.fetch_entries: The total amount of time spent fetching entries
  • replay-slot-stats.num_execute_batches: The total number of calls ReplayStage makes to blockstore_processor::confirm_slot() that result in transactions getting executed. This is the same number of times that the Blockstore method to fetch entries is getting called
  • replay-slot-stats.confirmation_time_us: The total amount of time spent within blockstore_processor::confirm_slot(); this is inclusive of both fetch entry time as well as everything else (ie actual tx execution)

Summarizing some key points from some of the comments below:

  • The removal of this thread pool does increase replay-slot-stats.fetch_entries. Average numbers on my node would suggest an increase from ~3.5k ms to ~5.5k ms
  • replay-slot-stats.num_execute_batches has been observed to have an average value of ~70 over the past two weeks
  • Noting that we're spending ~2 ms more to fetch (on average), that is ~2 ms / 70 ~= 29 us per call to fetch entries
  • The ability to complete replay of a slot is dependent on all the shreds having been received. Once the last shred has been received, there will be exactly 1 more call to confirm_slot() (and thus to fetch entries). So, in isolation, it might appear that we've "delayed" being able to complete a block by ~29 us (on average)
  • However, replay-slot-stats.confirmation_time_us looks pretty consistent for my node before and after making the change to remove the thread-pool. This value staying consistent would suggest that while fetch time is growing marginally, we are at least breaking even by avoiding the thread-pool

@steviez steviez force-pushed the bstore_no_entry_fetch_tpool branch from 86a3b03 to 164a27d Compare January 12, 2024 19:07
@steviez
Copy link
Contributor Author

steviez commented Jan 12, 2024

Here are some general notes about several graphs I'l share:

  • The metric being examined here is replay-slot-stats.fetch_entries time; the unit of this metric is microseconds
  • The purple trace is my test node 83TTSHyHsrM9XHNSVAy5C3uXwifAxXMa1GwrHCVRNv77
    • All ledger components on one disk + --enable-rpc-transaction-history
  • The cyan trace is 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 (bv1) running v1.17.13
    • Snapshots + OS on one drive, rest of ledger on another
  • The orange trace is Ninja1spj6n9t5hVYgF3PdnYz2PLnkt7rvaw3firmjs for a 3rd party, running v1.17.13
    • Zan mentioned having three NVMe drives; one for rocksdb / one for accounts / one for snapshots

Here is the last 12 hours of runtime with no averaging; the change was made on my test node around 08:15:
image

  • All nodes show some large spikes, although my node and bv1 show both more and more severe spikes than Ninja

Here is the same chart, but with a little averaging (10s bin):
image

  • Hard to place definite numbers, but my node (purple) looks to have a higher floor than the other two

And finally the same chart with even more averaging (60 min bin):
image

  • The higher level of averaging shows that 7Np and Ninja are pretty close on average around 3.5ms whereas my nodes is hovering around 4.5ms give or take

Lastly, here is the last 24 hours; my node was previously running the tip of master prior to 08:00:
image

  • Throwing out the 07:00 and 08:00 datapoints as these are likely skewed by the node shutdown / restart, we can see that my node was performing in line (or maybe even slightly better) with the others prior to making the change, and does have a definite step up with the change.

Copy link

codecov bot commented Jan 12, 2024

Codecov Report

Attention: Patch coverage is 92.59259% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 81.7%. Comparing base (8143fc3) to head (164a27d).
Report is 1 commits behind head on master.

❗ Current head 164a27d differs from pull request most recent head c89bdaa. Consider uploading reports for the commit c89bdaa to get more accurate results

Additional details and impacted files
@@            Coverage Diff            @@
##           master   #34768     +/-   ##
=========================================
  Coverage    81.7%    81.7%             
=========================================
  Files         834      825      -9     
  Lines      224232   223106   -1126     
=========================================
- Hits       183351   182498    -853     
+ Misses      40881    40608    -273     

@steviez
Copy link
Contributor Author

steviez commented Jan 12, 2024

After thinking about this a little more, I think measuring the effect of this PR is more nuanced than the analysis in my first comment. The metric I was examining, replay-slot-stats.fetch_entries, does seem to degrade from ~3.5ms to ~4.5ms with this change. However, a few points about this:

  • The degradation doesn't necessarily mean that replaying a slot is taking >=~1ms more. That ~1ms is split up across all of the fetches that occur in the slot. Over the last 2 days on mnb, I see a floor of about ~20 fetches per slot via replay-slot-stats.num_execute_batches metric. When the last shred arrives (which is out of the control of the Blockstore), there is one fetch remaining for a slot. Assuming evenly distributed time per fetch, ~1 ms / 20 fetches = 50 us / fetch difference.
  • This change is reducing the number of threads; we have previously observed and noted negatives with having over-provisioned rayon thread pools (bunch of threads are woken up and try to steal work from each other when there was only enough work for a couple threads to start)
  • While the fetch metric does grow by a small amount in absolute time, the more important metric is seemingly the net wallclock time for replaying slot, replay-slot-stats.replay_total_elapsed. That is, even if we're spending marginally more time fetching, if the total amount time to replay the block is unchanged, then we're holding performance equal while using fewer resources which would seemingly be a win. This value staying flat might suggest that while fetch time is growing marginally, we could be breaking even with fewer context switches / cache disruptions / etc.

This graph shows replay-slot-stats.replay_total_elapsed; my node is still purple and bv1 is cyan:
image
My node was previously doing marginally worse than bv1 but has been on par for the past 12+ hours

@steviez
Copy link
Contributor Author

steviez commented Jan 15, 2024

Marco from Triton chimed in with utilization of several nodes in the public mnb RPC pool:

$ ps -T -p 16256 -o comm,pcpu | grep -E 'solBstore[0-9]'
solBstore00      1.0
solBstore01      1.0
solBstore02      0.9
solBstore03      0.8
solBstore04      0.8
solBstore05      0.7
solBstore06      0.7
solBstore07      0.6
solBstore08      0.6
solBstore09      0.5
solBstore10      0.5
solBstore11      0.5
solBstore12      0.4
solBstore13      0.4
solBstore14      0.3
solBstore15      0.3
solBstore16      0.3
solBstore17      0.3
solBstore18      0.2
solBstore19      0.2
solBstore20      0.2
solBstore21      0.2
solBstore22      0.2
solBstore23      0.1
solBstore24      0.1
solBstore25      0.1
solBstore26      0.1
solBstore27      0.1
solBstore28      0.1
solBstore29      0.1
solBstore30      0.1
solBstore31      0.0
...
solBstore55      0.0

AND

$ ps -T -p 16256 -o comm,pcpu | grep -E 'solBstore[0-9]'
solBstore00      0.3
solBstore01      0.2
solBstore02      0.2
solBstore03      0.2
solBstore04      0.2
solBstore05      0.2
solBstore06      0.2
solBstore07      0.2
solBstore08      0.2
solBstore09      0.1
solBstore10      0.1
solBstore11      0.1
solBstore12      0.1
solBstore13      0.1
solBstore14      0.1
solBstore15      0.1
solBstore16      0.1
solBstore17      0.0
...
solBstore55      0.0

These numbers support my earlier comment about the thread pool being over-provisioned, even for an RPC node. If we don't rip this threadpool out completely, it certainly seems that putting a limit on the size of the pool instead of allowing it to scale with the number of threads on the machine would yield a win

@stakeconomy
Copy link
Contributor

stakeconomy commented Jan 15, 2024

another one from a busy RPC node in our shared pool

$ ps -T -p 15263 -o comm,pcpu | grep -E 'solBstore[0-9]'
solBstore00      0.8
solBstore01      0.7
solBstore02      0.7
solBstore03      0.6
solBstore04      0.6
solBstore05      0.5
solBstore06      0.5
solBstore07      0.4
solBstore08      0.4
solBstore09      0.4
solBstore10      0.4
solBstore11      0.3
solBstore12      0.3
solBstore13      0.3
solBstore14      0.3
solBstore15      0.2
solBstore16      0.2
solBstore17      0.2
solBstore18      0.2
solBstore19      0.2
solBstore20      0.1
solBstore21      0.1
solBstore22      0.1
solBstore23      0.1
solBstore24      0.1
solBstore25      0.1
solBstore26      0.0
---
solBstore55      0.0

@steviez steviez force-pushed the bstore_no_entry_fetch_tpool branch from 164a27d to ab078cd Compare January 25, 2024 18:17
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Feb 9, 2024
@steviez steviez removed the stale [bot only] Added to stale content; results in auto-close after a week. label Feb 11, 2024
@steviez steviez force-pushed the bstore_no_entry_fetch_tpool branch from ab078cd to 25971c5 Compare February 17, 2024 19:28
@steviez steviez marked this pull request as ready for review February 26, 2024 18:49
@steviez steviez requested a review from t-nelson February 26, 2024 20:12
@steviez
Copy link
Contributor Author

steviez commented Feb 26, 2024

Ok @t-nelson - I've let this one run long enough that I feel pretty good about it so formally marked it ready for review + requested review from ya.

@steviez steviez requested a review from bw-solana February 26, 2024 20:20
Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hell yeah!

ledger/src/blockstore.rs Outdated Show resolved Hide resolved
ledger/src/blockstore.rs Outdated Show resolved Hide resolved
@steviez steviez force-pushed the bstore_no_entry_fetch_tpool branch from 25971c5 to 6ad51f0 Compare February 26, 2024 21:28
ledger/src/blockstore.rs Outdated Show resolved Hide resolved
ledger/src/blockstore.rs Outdated Show resolved Hide resolved
ledger/src/blockstore.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me. A couple of nits and one more important question about what happens in the error case

Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! :shipit:


looks like we have some collect() to temporaries in there that might be removable in a follow up

@steviez steviez merged commit 09925a1 into solana-labs:master Feb 27, 2024
35 checks passed
@steviez steviez deleted the bstore_no_entry_fetch_tpool branch February 27, 2024 02:27
@steviez
Copy link
Contributor Author

steviez commented Feb 27, 2024

looks like we have some collect() to temporaries in there that might be removable in a follow up

For the sake of paper-trail, we might be able to shave a mem-copy as well. Currently, on the fetch path, we have the following copies from rocksdb to Vec<Entry>:

    0): RocksDB Memtable
        - Technically might be on disk in SST
    1): LedgerColumn::multi_get_bytes() -> Vec<Vec<u8>>
        - Ownership transferred via Shred::new_from_serialized_shred
            - Some modification may happen here with resize / truncate
    2): Shredder::deshred() -> Vec<u8>
        - All shreds payloads copied into common buffer
    3): bincode::deserialize() -> Vec<Entry>

Namely, I think could save the copy in step 1). Instead, we could have multi_get_ref() -> Vec<&[u8]>, partially deserialize what we need from those, and then copy the payloads directly into deshred buffer (without making the intermediate copy). This would require rework of Shred; there are definitely instances where we want Shred to have owned memory, so we'd have to make it work with either owned or referenced memory

@t-nelson
Copy link
Contributor

so we'd have to make it work with either owned or referenced memory

🐄

jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request Feb 27, 2024
…abs#34768)

There are several cases for fetching entries from the Blockstore:
- Fetching entries for block replay
- Fetching entries for CompletedDataSetService
- Fetching entries to service RPC getBlock requests

All of these operations occur in a different calling thread. However,
the currently implementation utilizes a shared thread-pool within the
Blockstore function. There are several problems with this:
- The thread pool is shared between all of the listed cases, despite
  block replay being the most critical. These other services shouldn't
  be able to interfere with block replay
- The thread pool is overprovisioned for the average use; thread
  utilization on both regular validators and RPC nodes shows that many
  of the thread see very little activity. But, these thread existing
  introduce "accounting" overhead
- rocksdb exposes an API to fetch multiple items at once, potentially
  with some parallelization under the hood. Using parallelization in
  our API and the underlying rocksdb is overkill and we're doing more
  damage than good.

This change removes that threadpool completely, and instead fetches
all of the desired entries in a single call. This has been observed
to have a minor degradation on the time spent within the Blockstore
get_slot_entries_with_shred_info() function. Namely, some buffer
copying and deserialization that previously occurred in parallel now
occur serially.

However, the metric that tracks the amount of time spent replaying
blocks (inclusive of fetch) is unchanged. Thus, despite spending
marginally more time to fetch/copy/deserialize with only a single
thread, the gains from not thrashing everything else with the pool
keep us at parity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants