Adjust the default transaction replay thread pool size #25

steviez · 2024-02-29T07:12:41Z

Problem

The thread pool that is used to perform entry(transaction verification) and transaction execution is currently set to have the same size as the number of virtual cores on a machine. For example, a 24 core / 48 thread machine will put 48 threads into this pool.

This thread-pool is over-provisioned, and the extra thread actually cause more harm than good. When work is sent to the pool, all thread are woken up, even if there is only work for one or two threads. This "thundering herd" effect causes lots of general system disruption, and can easily be mitigated by bounding the thread pool size to more accurately fit the workload we throw at it.

Part of work for #35

Summary of Changes

Limit the default thread pool size to a maximum of X threads

steviez · 2024-02-29T07:29:01Z

I ran some nodes against testnet and mainnet to gather some information about the batches getting passed to the thread pool. This collection method was very simple; every time blockstore_processor::execute_batches_internal() was called (this is the function that invokes the threadpool), I logged the length of the batches slice that came in. Each batch is dispatched to one thread, so 3 batches would imply that we have work for 3 threads. Num datapoints refers to the number of times this was function was called over the duration:

Testnet (24 hours)
Num datapoints: 4,978,846
Mean:            10.560318796765355
Median:           7.0
Standard dev:     8.819310902536532
Max:             24

Mainnet (25.5 hours)
Num datapoints: 10,070,306
Mean:             5.5567747395163565
Median:           4.0
Standard dev:     4.900140288271035
Max:             24

Granted this is only one day of runtime on one node for each cluster, but I think it is telling. The data on testnet is noisier, but on mainnet:

mean + 0 std. dev:  5.55
mean + 1 std. dev: 10.45
mean + 2 std. dev: 15.35
mean + 3 std. dev: 20.25

steviez · 2024-02-29T07:46:02Z

Aside from gut feeling, this initial datapoint also suggests that the thread was over-utilized.

The first column is obviously thread name
The second column is accumulated CPU utilization over process lifetime, expressed as a percentage
- For example, solBstoreProc00 utilized 24.8% of a core. Or, it was scheduled 24.8% of the time
The third column is the sum of CPU utilization for that thread + previous columns / divided by CPU utilization of entire pool
- For example, the 29.67% in row 2 is saying that Proc00 and Proc01 accounted for 29.67% of the total CPU utilization from all 48 threads

solBstoreProc00	24.8	14.98%
solBstoreProc01	24.3	29.67%
solBstoreProc02	20.2	41.87%
solBstoreProc03	16.1	51.60%
solBstoreProc04	13.1	59.52%
solBstoreProc05	11.4	66.40%
solBstoreProc06	 9.1	71.90%
solBstoreProc07	 7.4	76.37%
solBstoreProc08	 6.2	80.12%
solBstoreProc09	 5.3	83.32%
solBstoreProc10	 4.5	86.04%
solBstoreProc11	 4	88.46%
solBstoreProc12	 3.4	90.51%
solBstoreProc13	 3	92.33%
solBstoreProc14	 2.6	93.90%
solBstoreProc15	 2.2	95.23%
solBstoreProc16	 1.8	96.31%
solBstoreProc17	 1.4	97.16%
solBstoreProc18	 1.1	97.82%
solBstoreProc19	 0.9	98.37%
solBstoreProc20	 0.7	98.79%
solBstoreProc21	 0.6	99.15%
solBstoreProc22	 0.4	99.40%
solBstoreProc23	 0.3	99.58%
solBstoreProc24	 0.2	99.70%
solBstoreProc25	 0.2	99.82%
solBstoreProc26	 0.1	99.88%
solBstoreProc27	 0.1	99.94%
solBstoreProc28	 0.1	100.00%
solBstoreProc29	 0	100.00%
...
solBstoreProc47	 0	100.00%

The numbers show that there is a pretty steep drop-off. I probably lost some precision, but the first 28 threads are doing 99.94% of the work. The first 24 99.58%, and the first 16 95.23%. These points also seem to suggest that the extra threads are rarely getting utilized and adding extra overhead for little to no gain (and potentially doing more harm than good when considering the general overhead)

codecov-commenter · 2024-03-03T06:42:40Z

Codecov Report

Attention: Patch coverage is 0% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 81.8%. Comparing base (10d0677) to head (cd5eefc).

Additional details and impacted files

@@           Coverage Diff           @@
##           master      #25   +/-   ##
=======================================
  Coverage    81.8%    81.8%           
=======================================
  Files         841      841           
  Lines      228307   228307           
=======================================
+ Hits       186941   186973   +32     
+ Misses      41366    41334   -32

steviez · 2024-03-05T20:37:39Z

For the sake of experimenting, I single-threaded tx replay with the included patch. The node was unable to catchup; this is somewhat expected from looking at the following two metrics:

replay-slot-stats.execute_batches_us
replay-slot-stats.execute_us

execute_batches is the wall clock time executing transactions, while execute_us is the accumulated execution time across threads. For example, if we had 4 threads spend 100ms each to replay a block, executes_batches_us would be 100M (100ms) and execute_us would be 400M (400ms).

Mainnet metrics for my test nodes show show an average of ~450-500ms for execute_us and an average of ~100-150ms for execute_batches_us. Single threading makes execute_batches_us == execute_us, so the single threaded replay node was processing blocks about as fast as they were being produced, but NOT catching up

steviez · 2024-03-06T22:15:04Z

This comment shows some graphs for two nodes:

The nodes have identical HW specs / in same DC
- 24 physical cores
Prior to 05:00 UTC, the nodes were running the same commit
After 05:00 UTC, the blue node was updated to only put 8 threads in this pool (instead of 48); the purple node was also restart at this time
Performance for all the plots I include below was comparable before 05:00 UTC

This graph shows mean execute_batches_us:

After 05:00 UTC, the wallclock amount of time for executing transactions grew by 5-10ms for the size-8 node (blue)

This graph shows max execute_batches_us:

After 05:00 UTC, the size-8 node (blue) shows a significantly drop in max time for execute_batches_us

This graph shows mean execute_us for these same two nodes during the same period:

After 05:00 UTC, the size-8 node (blue) is spending ~40 ms less in accumulated execute time

Takeaways:

The size-48 thread pool performing better for mean execute_batches_us tells us that the larger pool gets through the work faster
The size-8 thread pool performing better for max execute_batches_us could indicate that fewer threads competing is reducing the jitter. As more thread pools are tuned, I would expect to see more decreases in jitter across various metrics that utilize threadpools.
The size-8 thread pool performing better for mean execute_us tells us that the smaller pool gets through the work more efficiently.
Given that we want to keep block time as low as possible, the degradation from size-8 pool for execute_batches_us is probably not acceptable. I'm restarting with 16 threads which I'm hopeful will help us hit a sweetspot for improvements on all 3 of the above metrics

kubanemil

seems ok, why is not merged yet?

steviez · 2024-07-30T22:22:15Z

Enough time has passed - going to close + re-open this PR

steviez added the noCI Suppress CI on this Pull Request label Feb 29, 2024

yihau force-pushed the master branch 2 times, most recently from 91aae97 to 3f9a7a5 Compare March 3, 2024 04:31

steviez removed the noCI Suppress CI on this Pull Request label Mar 3, 2024

steviez force-pushed the reduce_tx_exec_tpool branch from 8eeb91e to 8dac066 Compare March 3, 2024 04:55

steviez mentioned this pull request Mar 4, 2024

Audit each individual thread pool size #35

Open

steviez force-pushed the reduce_tx_exec_tpool branch 2 times, most recently from 0f88a6c to 50dcfa8 Compare March 5, 2024 04:03

steviez mentioned this pull request Mar 11, 2024

Make ReplayStage own the threadpool for tx replay #190

Merged

Adjust the default from num_threads() to a max of 16

cd5eefc

steviez force-pushed the reduce_tx_exec_tpool branch from 50dcfa8 to cd5eefc Compare March 27, 2024 21:45

steviez changed the title ~~wip: Limit the transaction replay thread pool size~~ Adjust the default transaction replay thread pool size Mar 27, 2024

kubanemil approved these changes Apr 18, 2024

View reviewed changes

steviez closed this Jul 30, 2024

steviez deleted the reduce_tx_exec_tpool branch July 30, 2024 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust the default transaction replay thread pool size #25

Adjust the default transaction replay thread pool size #25

steviez commented Feb 29, 2024 •

edited

Loading

steviez commented Feb 29, 2024

steviez commented Feb 29, 2024 •

edited

Loading

codecov-commenter commented Mar 3, 2024 •

edited

Loading

steviez commented Mar 5, 2024

steviez commented Mar 6, 2024

kubanemil left a comment

steviez commented Jul 30, 2024

Adjust the default transaction replay thread pool size #25

Adjust the default transaction replay thread pool size #25

Conversation

steviez commented Feb 29, 2024 • edited Loading

Problem

Summary of Changes

steviez commented Feb 29, 2024

steviez commented Feb 29, 2024 • edited Loading

codecov-commenter commented Mar 3, 2024 • edited Loading

Codecov Report

steviez commented Mar 5, 2024

steviez commented Mar 6, 2024

kubanemil left a comment

Choose a reason for hiding this comment

steviez commented Jul 30, 2024

steviez commented Feb 29, 2024 •

edited

Loading

steviez commented Feb 29, 2024 •

edited

Loading

codecov-commenter commented Mar 3, 2024 •

edited

Loading