Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share the threadpool for tx execution and entry verifification #216

Merged
merged 5 commits into from
Mar 27, 2024

Conversation

steviez
Copy link

@steviez steviez commented Mar 12, 2024

Problem

Previously, entry verification had a dedicated threadpool used to verify PoH hashes as well as some basic transaction verification via Bank::verify_transaction(). It should also be noted that the entry verification code provides logic to offload to a GPU if one is present.

Regardless of whether a GPU is present or not, some of the verification must be done on a CPU. Moreso, the CPU verification of entries and transaction execution are serial operations; entry verification finishes first before moving onto transaction execution.

Summary of Changes

So, tx execution and entry verification are not competing for CPU cycles at the same time and can use the same pool.

One exception to the above statement is that if someone is using the feature to replay forks in parallel, then hypothetically, different forks may end up competing for the same resources at the same time. However, that is already true given that we had pools that were shared between replay of multiple forks. So, this change doesn't really change much for that case, but will reduce overhead in the single fork case which is the vast majority of the time.

This PR part of work for #35

Performance / Testing

In order to test, I got two identical nodes running against mnb (same DC / hardware / software). I then updated one node to pull in the extra commit to use same thread pool while other node stayed constant. This allows us to control for variations in timing data that naturally ebbs and flows with network activity. The nodes were running the same commit until about ~07:00 on March 21. At that point, one node was updated but both were restarted for the sake of keeping experiment as controlled as possible.

A couple relevant metrics that I'll highlight:

  • replay-slot-stats.entry_poh_verification_time: time spent doing PoH verification
  • replay-slot-stats.entry_transaction_verification_time: time spent verifying tx's
  • replay-slot-stats.confirmation_time_us: time spent within blockstore_processor::confirm() slot, which is inclusive of above two numbers as well as everything else (fetching shreds, tx execution, etc)

Lastly, the cyan/light blue trace is the node with this branch whereas the purple trace is the control node.

Figure 1: Mean PoH Verification Time
image

Figure 2: Mean Tx Verification Time
image

Figure 3: Mean Confirmation Time
image

Looking at these, the traces are very similar both before and after the update. This is pretty expected; we're still using the same number of threads for each operation. However, we're decreasing general system overhead by having fewer threads. But, if we turn up the averaging (via group by) to a larger value, we can that there does appear to be a very small performance improvement as well.

Figure 4: Mean PoH Verification Time (6 hour buckets for group by)
image

Figure 5: Mean Tx Verification Time (6 hour buckets for group by)
image

Figure 6: Mean Confirmation Time (6 hour buckets for group by)
image

Figure 4 and 5 show that the gap increases between the traces. However, it should be called out that the difference is pretty small, maybe less than half a millisecond per each. The same trend is there for Figure 6; however, confirmation time is a larger chunk of time overall so these marginal differences are barely visible on the graph.

@steviez steviez added the noCI Suppress CI on this Pull Request label Mar 21, 2024
@steviez steviez force-pushed the rm_entry_tpool branch 3 times, most recently from 58f3e04 to 87e6a2c Compare March 25, 2024 17:30
@steviez steviez removed the noCI Suppress CI on this Pull Request label Mar 25, 2024
@steviez steviez force-pushed the rm_entry_tpool branch 2 times, most recently from a759e58 to a7ef7da Compare March 25, 2024 20:03
@steviez steviez changed the title wip: Share the threadpool for tx execution and entry verifification Share the threadpool for tx execution and entry verifification Mar 25, 2024
@steviez steviez force-pushed the rm_entry_tpool branch 2 times, most recently from a5569e1 to 5f650df Compare March 25, 2024 20:30
@steviez steviez marked this pull request as ready for review March 25, 2024 20:49
@steviez
Copy link
Author

steviez commented Mar 25, 2024

I updated a bunch of data in place above. For the sake of fairness, it does appear that one of the nodes (light blue) was slightly better than the other (purple). I'll swap the control/experimental nodes and confirm that my observed equality / equal performance follows the software, and not a lucky pick of hardware.

I will however re-iterate that given that we're reducing overhead with this PR, maintaining performance on the specific metrics is sufficient in my opinion. That is, if we maintain performance while using less resources (threads), that is still a win

@steviez
Copy link
Author

steviez commented Mar 26, 2024

Had to force push to resolve a conflict that appeared in master

For the sake of fairness, it does appear that one of the nodes (light blue) was slightly better than the other (purple). I'll swap the control/experimental nodes and confirm that my observed equality / equal performance follows the software, and not a lucky pick of hardware.

I did this experiment and confirmed that the after swapping which machine was running which branch, the slightly better performance followed the machine with this branch.

@codecov-commenter
Copy link

codecov-commenter commented Mar 26, 2024

Codecov Report

Attention: Patch coverage is 70.09346% with 32 lines in your changes are missing coverage. Please review.

Project coverage is 81.8%. Comparing base (80d3200) to head (d3df7da).

Additional details and impacted files
@@            Coverage Diff            @@
##           master     #216     +/-   ##
=========================================
- Coverage    81.8%    81.8%   -0.1%     
=========================================
  Files         841      841             
  Lines      228242   228265     +23     
=========================================
- Hits       186923   186904     -19     
- Misses      41319    41361     +42     

entry/src/entry.rs Outdated Show resolved Hide resolved
steviez added 5 commits March 26, 2024 22:15
Previously, entry verification had a dedicated threadpool used to verify
PoH hashes as well as some basic transaction verification via
Bank::verify_transaction(). It should also be noted that the entry
verification code provides logic to offload to a GPU if one is present.

Regardless of whether a GPU is present or not, some of the verification
must be done on a CPU. Moreso, the CPU verification of entries and
transaction execution are serial operations; entry verification finishes
first before moving onto transaction execution.

So, tx execution and entry verification are not competing for CPU cycles
at the same time and can use the same pool.

One exception to the above statement is that if someone is using the
feature to replay forks in parallel, then hypothetically, different
forks may end up competing for the same resources at the same time.
However, that is already true given that we had pools that were shared
between replay of multiple forks. So, this change doesn't really change
much for that case, but will reduce overhead in the single fork case
which is the vast majority of the time.
Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'll be interested to see the performance results after flip-flopping the HW, but it seems like we have enough evidence to say there's no performance regression

@steviez
Copy link
Author

steviez commented Mar 27, 2024

LGTM. I'll be interested to see the performance results after flip-flopping the HW, but it seems like we have enough evidence to say there's no performance regression

Definitely better to leave a paper-trail here as opposed to "trust me, I checked it" haha. I switched the nodes over around 21:00 on March 25:

  • Orange is control node before the swap / experimental node after teh swap
  • Pink is the experimental node before swap
  • Blue is the control node after the swap

Figure 1: Mean PoH Verification Time
image

Figure 2: Mean Tx Verification Time
image

Figure 3: Mean Confirmation Time
image

Graphs look pretty much the same with experimental edging out control slightly

@steviez steviez merged commit 10d0677 into anza-xyz:master Mar 27, 2024
48 checks passed
@steviez steviez deleted the rm_entry_tpool branch March 27, 2024 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants