Share the threadpool for tx execution and entry verifification #216

steviez · 2024-03-12T22:52:43Z

Problem

Previously, entry verification had a dedicated threadpool used to verify PoH hashes as well as some basic transaction verification via Bank::verify_transaction(). It should also be noted that the entry verification code provides logic to offload to a GPU if one is present.

Regardless of whether a GPU is present or not, some of the verification must be done on a CPU. Moreso, the CPU verification of entries and transaction execution are serial operations; entry verification finishes first before moving onto transaction execution.

Summary of Changes

So, tx execution and entry verification are not competing for CPU cycles at the same time and can use the same pool.

One exception to the above statement is that if someone is using the feature to replay forks in parallel, then hypothetically, different forks may end up competing for the same resources at the same time. However, that is already true given that we had pools that were shared between replay of multiple forks. So, this change doesn't really change much for that case, but will reduce overhead in the single fork case which is the vast majority of the time.

This PR part of work for #35

Performance / Testing

In order to test, I got two identical nodes running against mnb (same DC / hardware / software). I then updated one node to pull in the extra commit to use same thread pool while other node stayed constant. This allows us to control for variations in timing data that naturally ebbs and flows with network activity. The nodes were running the same commit until about ~07:00 on March 21. At that point, one node was updated but both were restarted for the sake of keeping experiment as controlled as possible.

A couple relevant metrics that I'll highlight:

replay-slot-stats.entry_poh_verification_time: time spent doing PoH verification
replay-slot-stats.entry_transaction_verification_time: time spent verifying tx's
replay-slot-stats.confirmation_time_us: time spent within blockstore_processor::confirm() slot, which is inclusive of above two numbers as well as everything else (fetching shreds, tx execution, etc)

Lastly, the cyan/light blue trace is the node with this branch whereas the purple trace is the control node.

Figure 1: Mean PoH Verification Time

Figure 2: Mean Tx Verification Time

Figure 3: Mean Confirmation Time

Looking at these, the traces are very similar both before and after the update. This is pretty expected; we're still using the same number of threads for each operation. However, we're decreasing general system overhead by having fewer threads. But, if we turn up the averaging (via group by) to a larger value, we can that there does appear to be a very small performance improvement as well.

Figure 4: Mean PoH Verification Time (6 hour buckets for group by)

Figure 5: Mean Tx Verification Time (6 hour buckets for group by)

Figure 6: Mean Confirmation Time (6 hour buckets for group by)

Figure 4 and 5 show that the gap increases between the traces. However, it should be called out that the difference is pretty small, maybe less than half a millisecond per each. The same trend is there for Figure 6; however, confirmation time is a larger chunk of time overall so these marginal differences are barely visible on the graph.

steviez · 2024-03-25T20:56:37Z

I updated a bunch of data in place above. For the sake of fairness, it does appear that one of the nodes (light blue) was slightly better than the other (purple). I'll swap the control/experimental nodes and confirm that my observed equality / equal performance follows the software, and not a lucky pick of hardware.

I will however re-iterate that given that we're reducing overhead with this PR, maintaining performance on the specific metrics is sufficient in my opinion. That is, if we maintain performance while using less resources (threads), that is still a win

entry/src/entry.rs

steviez · 2024-03-26T20:36:44Z

Had to force push to resolve a conflict that appeared in master

For the sake of fairness, it does appear that one of the nodes (light blue) was slightly better than the other (purple). I'll swap the control/experimental nodes and confirm that my observed equality / equal performance follows the software, and not a lucky pick of hardware.

I did this experiment and confirmed that the after swapping which machine was running which branch, the slightly better performance followed the machine with this branch.

codecov-commenter · 2024-03-26T22:02:22Z

Codecov Report

Attention: Patch coverage is 70.09346% with 32 lines in your changes are missing coverage. Please review.

Project coverage is 81.8%. Comparing base (80d3200) to head (d3df7da).

Additional details and impacted files

@@            Coverage Diff            @@
##           master     #216     +/-   ##
=========================================
- Coverage    81.8%    81.8%   -0.1%     
=========================================
  Files         841      841             
  Lines      228242   228265     +23     
=========================================
- Hits       186923   186904     -19     
- Misses      41319    41361     +42

entry/src/entry.rs

Previously, entry verification had a dedicated threadpool used to verify PoH hashes as well as some basic transaction verification via Bank::verify_transaction(). It should also be noted that the entry verification code provides logic to offload to a GPU if one is present. Regardless of whether a GPU is present or not, some of the verification must be done on a CPU. Moreso, the CPU verification of entries and transaction execution are serial operations; entry verification finishes first before moving onto transaction execution. So, tx execution and entry verification are not competing for CPU cycles at the same time and can use the same pool. One exception to the above statement is that if someone is using the feature to replay forks in parallel, then hypothetically, different forks may end up competing for the same resources at the same time. However, that is already true given that we had pools that were shared between replay of multiple forks. So, this change doesn't really change much for that case, but will reduce overhead in the single fork case which is the vast majority of the time.

bw-solana

LGTM. I'll be interested to see the performance results after flip-flopping the HW, but it seems like we have enough evidence to say there's no performance regression

steviez · 2024-03-27T16:45:08Z

LGTM. I'll be interested to see the performance results after flip-flopping the HW, but it seems like we have enough evidence to say there's no performance regression

Definitely better to leave a paper-trail here as opposed to "trust me, I checked it" haha. I switched the nodes over around 21:00 on March 25:

Orange is control node before the swap / experimental node after teh swap
Pink is the experimental node before swap
Blue is the control node after the swap

Figure 1: Mean PoH Verification Time

Figure 2: Mean Tx Verification Time

Figure 3: Mean Confirmation Time

Graphs look pretty much the same with experimental edging out control slightly

steviez added the noCI Suppress CI on this Pull Request label Mar 21, 2024

steviez force-pushed the rm_entry_tpool branch 3 times, most recently from 58f3e04 to 87e6a2c Compare March 25, 2024 17:30

steviez removed the noCI Suppress CI on this Pull Request label Mar 25, 2024

steviez force-pushed the rm_entry_tpool branch 2 times, most recently from a759e58 to a7ef7da Compare March 25, 2024 20:03

steviez changed the title ~~wip: Share the threadpool for tx execution and entry verifification~~ Share the threadpool for tx execution and entry verifification Mar 25, 2024

steviez force-pushed the rm_entry_tpool branch 2 times, most recently from a5569e1 to 5f650df Compare March 25, 2024 20:30

steviez marked this pull request as ready for review March 25, 2024 20:49

steviez commented Mar 25, 2024

View reviewed changes

entry/src/entry.rs Show resolved Hide resolved

steviez requested review from t-nelson and bw-solana March 25, 2024 21:31

steviez force-pushed the rm_entry_tpool branch from 5f650df to 697f7ee Compare March 26, 2024 20:34

bw-solana reviewed Mar 27, 2024

View reviewed changes

entry/src/entry.rs Outdated Show resolved Hide resolved

steviez added 5 commits March 26, 2024 22:15

Update unit tests

6985455

Fixup benches

53cb346

Rename entry_thread_pool_for_tests --> thread_pool_for_tests

75d5c5d

Comment to explain thread count in test pool

d3df7da

steviez force-pushed the rm_entry_tpool branch from 697f7ee to d3df7da Compare March 27, 2024 03:15

bw-solana approved these changes Mar 27, 2024

View reviewed changes

steviez merged commit 10d0677 into anza-xyz:master Mar 27, 2024
48 checks passed

steviez deleted the rm_entry_tpool branch March 27, 2024 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share the threadpool for tx execution and entry verifification #216

Share the threadpool for tx execution and entry verifification #216

steviez commented Mar 12, 2024 •

edited

Loading

steviez commented Mar 25, 2024

steviez commented Mar 26, 2024

codecov-commenter commented Mar 26, 2024 •

edited

Loading

bw-solana left a comment •

edited

Loading

steviez commented Mar 27, 2024

Share the threadpool for tx execution and entry verifification #216

Share the threadpool for tx execution and entry verifification #216

Conversation

steviez commented Mar 12, 2024 • edited Loading

Problem

Summary of Changes

Performance / Testing

steviez commented Mar 25, 2024

steviez commented Mar 26, 2024

codecov-commenter commented Mar 26, 2024 • edited Loading

Codecov Report

bw-solana left a comment • edited Loading

Choose a reason for hiding this comment

steviez commented Mar 27, 2024

steviez commented Mar 12, 2024 •

edited

Loading

codecov-commenter commented Mar 26, 2024 •

edited

Loading

bw-solana left a comment •

edited

Loading