Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust the default transaction replay thread pool size #25

Closed
wants to merge 1 commit into from

Conversation

steviez
Copy link

@steviez steviez commented Feb 29, 2024

Problem

The thread pool that is used to perform entry(transaction verification) and transaction execution is currently set to have the same size as the number of virtual cores on a machine. For example, a 24 core / 48 thread machine will put 48 threads into this pool.

This thread-pool is over-provisioned, and the extra thread actually cause more harm than good. When work is sent to the pool, all thread are woken up, even if there is only work for one or two threads. This "thundering herd" effect causes lots of general system disruption, and can easily be mitigated by bounding the thread pool size to more accurately fit the workload we throw at it.

Part of work for #35

Summary of Changes

  • Limit the default thread pool size to a maximum of X threads

@steviez steviez added the noCI Suppress CI on this Pull Request label Feb 29, 2024
@steviez
Copy link
Author

steviez commented Feb 29, 2024

I ran some nodes against testnet and mainnet to gather some information about the batches getting passed to the thread pool. This collection method was very simple; every time blockstore_processor::execute_batches_internal() was called (this is the function that invokes the threadpool), I logged the length of the batches slice that came in. Each batch is dispatched to one thread, so 3 batches would imply that we have work for 3 threads. Num datapoints refers to the number of times this was function was called over the duration:

Testnet (24 hours)
Num datapoints: 4,978,846
Mean:            10.560318796765355
Median:           7.0
Standard dev:     8.819310902536532
Max:             24

Mainnet (25.5 hours)
Num datapoints: 10,070,306
Mean:             5.5567747395163565
Median:           4.0
Standard dev:     4.900140288271035
Max:             24

Granted this is only one day of runtime on one node for each cluster, but I think it is telling. The data on testnet is noisier, but on mainnet:

mean + 0 std. dev:  5.55
mean + 1 std. dev: 10.45
mean + 2 std. dev: 15.35
mean + 3 std. dev: 20.25

@steviez
Copy link
Author

steviez commented Feb 29, 2024

Aside from gut feeling, this initial datapoint also suggests that the thread was over-utilized.

  • The first column is obviously thread name
  • The second column is accumulated CPU utilization over process lifetime, expressed as a percentage
    • For example, solBstoreProc00 utilized 24.8% of a core. Or, it was scheduled 24.8% of the time
  • The third column is the sum of CPU utilization for that thread + previous columns / divided by CPU utilization of entire pool
    • For example, the 29.67% in row 2 is saying that Proc00 and Proc01 accounted for 29.67% of the total CPU utilization from all 48 threads
solBstoreProc00	24.8	14.98%
solBstoreProc01	24.3	29.67%
solBstoreProc02	20.2	41.87%
solBstoreProc03	16.1	51.60%
solBstoreProc04	13.1	59.52%
solBstoreProc05	11.4	66.40%
solBstoreProc06	 9.1	71.90%
solBstoreProc07	 7.4	76.37%
solBstoreProc08	 6.2	80.12%
solBstoreProc09	 5.3	83.32%
solBstoreProc10	 4.5	86.04%
solBstoreProc11	 4	88.46%
solBstoreProc12	 3.4	90.51%
solBstoreProc13	 3	92.33%
solBstoreProc14	 2.6	93.90%
solBstoreProc15	 2.2	95.23%
solBstoreProc16	 1.8	96.31%
solBstoreProc17	 1.4	97.16%
solBstoreProc18	 1.1	97.82%
solBstoreProc19	 0.9	98.37%
solBstoreProc20	 0.7	98.79%
solBstoreProc21	 0.6	99.15%
solBstoreProc22	 0.4	99.40%
solBstoreProc23	 0.3	99.58%
solBstoreProc24	 0.2	99.70%
solBstoreProc25	 0.2	99.82%
solBstoreProc26	 0.1	99.88%
solBstoreProc27	 0.1	99.94%
solBstoreProc28	 0.1	100.00%
solBstoreProc29	 0	100.00%
...
solBstoreProc47	 0	100.00%

The numbers show that there is a pretty steep drop-off. I probably lost some precision, but the first 28 threads are doing 99.94% of the work. The first 24 99.58%, and the first 16 95.23%. These points also seem to suggest that the extra threads are rarely getting utilized and adding extra overhead for little to no gain (and potentially doing more harm than good when considering the general overhead)

@yihau yihau force-pushed the master branch 2 times, most recently from 91aae97 to 3f9a7a5 Compare March 3, 2024 04:31
@steviez steviez removed the noCI Suppress CI on this Pull Request label Mar 3, 2024
@steviez steviez force-pushed the reduce_tx_exec_tpool branch from 8eeb91e to 8dac066 Compare March 3, 2024 04:55
@codecov-commenter
Copy link

codecov-commenter commented Mar 3, 2024

Codecov Report

Attention: Patch coverage is 0% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 81.8%. Comparing base (10d0677) to head (cd5eefc).

Additional details and impacted files
@@           Coverage Diff           @@
##           master      #25   +/-   ##
=======================================
  Coverage    81.8%    81.8%           
=======================================
  Files         841      841           
  Lines      228307   228307           
=======================================
+ Hits       186941   186973   +32     
+ Misses      41366    41334   -32     

@steviez steviez force-pushed the reduce_tx_exec_tpool branch 2 times, most recently from 0f88a6c to 50dcfa8 Compare March 5, 2024 04:03
@steviez
Copy link
Author

steviez commented Mar 5, 2024

For the sake of experimenting, I single-threaded tx replay with the included patch. The node was unable to catchup; this is somewhat expected from looking at the following two metrics:

  • replay-slot-stats.execute_batches_us
  • replay-slot-stats.execute_us

execute_batches is the wall clock time executing transactions, while execute_us is the accumulated execution time across threads. For example, if we had 4 threads spend 100ms each to replay a block, executes_batches_us would be 100M (100ms) and execute_us would be 400M (400ms).

Mainnet metrics for my test nodes show show an average of ~450-500ms for execute_us and an average of ~100-150ms for execute_batches_us. Single threading makes execute_batches_us == execute_us, so the single threaded replay node was processing blocks about as fast as they were being produced, but NOT catching up

@steviez
Copy link
Author

steviez commented Mar 6, 2024

This comment shows some graphs for two nodes:

  • The nodes have identical HW specs / in same DC
    • 24 physical cores
  • Prior to 05:00 UTC, the nodes were running the same commit
  • After 05:00 UTC, the blue node was updated to only put 8 threads in this pool (instead of 48); the purple node was also restart at this time
  • Performance for all the plots I include below was comparable before 05:00 UTC

This graph shows mean execute_batches_us:

  • After 05:00 UTC, the wallclock amount of time for executing transactions grew by 5-10ms for the size-8 node (blue)
image

This graph shows max execute_batches_us:

  • After 05:00 UTC, the size-8 node (blue) shows a significantly drop in max time for execute_batches_us
image

This graph shows mean execute_us for these same two nodes during the same period:

  • After 05:00 UTC, the size-8 node (blue) is spending ~40 ms less in accumulated execute time
image

Takeaways:

  • The size-48 thread pool performing better for mean execute_batches_us tells us that the larger pool gets through the work faster
  • The size-8 thread pool performing better for max execute_batches_us could indicate that fewer threads competing is reducing the jitter. As more thread pools are tuned, I would expect to see more decreases in jitter across various metrics that utilize threadpools.
  • The size-8 thread pool performing better for mean execute_us tells us that the smaller pool gets through the work more efficiently.
  • Given that we want to keep block time as low as possible, the degradation from size-8 pool for execute_batches_us is probably not acceptable. I'm restarting with 16 threads which I'm hopeful will help us hit a sweetspot for improvements on all 3 of the above metrics

@steviez steviez force-pushed the reduce_tx_exec_tpool branch from 50dcfa8 to cd5eefc Compare March 27, 2024 21:45
@steviez steviez changed the title wip: Limit the transaction replay thread pool size Adjust the default transaction replay thread pool size Mar 27, 2024
Copy link

@kubanemil kubanemil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems ok, why is not merged yet?

@steviez
Copy link
Author

steviez commented Jul 30, 2024

Enough time has passed - going to close + re-open this PR

@steviez steviez closed this Jul 30, 2024
@steviez steviez deleted the reduce_tx_exec_tpool branch July 30, 2024 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants