-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Share the threadpool for tx execution and entry verifification #216
Conversation
58f3e04
to
87e6a2c
Compare
a759e58
to
a7ef7da
Compare
a5569e1
to
5f650df
Compare
I updated a bunch of data in place above. For the sake of fairness, it does appear that one of the nodes (light blue) was slightly better than the other (purple). I'll swap the control/experimental nodes and confirm that my observed equality / equal performance follows the software, and not a lucky pick of hardware. I will however re-iterate that given that we're reducing overhead with this PR, maintaining performance on the specific metrics is sufficient in my opinion. That is, if we maintain performance while using less resources (threads), that is still a win |
Had to force push to resolve a conflict that appeared in master
I did this experiment and confirmed that the after swapping which machine was running which branch, the slightly better performance followed the machine with this branch. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #216 +/- ##
=========================================
- Coverage 81.8% 81.8% -0.1%
=========================================
Files 841 841
Lines 228242 228265 +23
=========================================
- Hits 186923 186904 -19
- Misses 41319 41361 +42 |
Previously, entry verification had a dedicated threadpool used to verify PoH hashes as well as some basic transaction verification via Bank::verify_transaction(). It should also be noted that the entry verification code provides logic to offload to a GPU if one is present. Regardless of whether a GPU is present or not, some of the verification must be done on a CPU. Moreso, the CPU verification of entries and transaction execution are serial operations; entry verification finishes first before moving onto transaction execution. So, tx execution and entry verification are not competing for CPU cycles at the same time and can use the same pool. One exception to the above statement is that if someone is using the feature to replay forks in parallel, then hypothetically, different forks may end up competing for the same resources at the same time. However, that is already true given that we had pools that were shared between replay of multiple forks. So, this change doesn't really change much for that case, but will reduce overhead in the single fork case which is the vast majority of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'll be interested to see the performance results after flip-flopping the HW, but it seems like we have enough evidence to say there's no performance regression
Problem
Previously, entry verification had a dedicated threadpool used to verify PoH hashes as well as some basic transaction verification via
Bank::verify_transaction()
. It should also be noted that the entry verification code provides logic to offload to a GPU if one is present.Regardless of whether a GPU is present or not, some of the verification must be done on a CPU. Moreso, the CPU verification of entries and transaction execution are serial operations; entry verification finishes first before moving onto transaction execution.
Summary of Changes
So, tx execution and entry verification are not competing for CPU cycles at the same time and can use the same pool.
One exception to the above statement is that if someone is using the feature to replay forks in parallel, then hypothetically, different forks may end up competing for the same resources at the same time. However, that is already true given that we had pools that were shared between replay of multiple forks. So, this change doesn't really change much for that case, but will reduce overhead in the single fork case which is the vast majority of the time.
This PR part of work for #35
Performance / Testing
In order to test, I got two identical nodes running against mnb (same DC / hardware / software). I then updated one node to pull in the extra commit to use same thread pool while other node stayed constant. This allows us to control for variations in timing data that naturally ebbs and flows with network activity. The nodes were running the same commit until about ~07:00 on March 21. At that point, one node was updated but both were restarted for the sake of keeping experiment as controlled as possible.
A couple relevant metrics that I'll highlight:
replay-slot-stats.entry_poh_verification_time
: time spent doing PoH verificationreplay-slot-stats.entry_transaction_verification_time
: time spent verifying tx'sreplay-slot-stats.confirmation_time_us
: time spent withinblockstore_processor::confirm()
slot, which is inclusive of above two numbers as well as everything else (fetching shreds, tx execution, etc)Lastly, the cyan/light blue trace is the node with this branch whereas the purple trace is the control node.
Figure 1: Mean PoH Verification Time
Figure 2: Mean Tx Verification Time
Figure 3: Mean Confirmation Time
Looking at these, the traces are very similar both before and after the update. This is pretty expected; we're still using the same number of threads for each operation. However, we're decreasing general system overhead by having fewer threads. But, if we turn up the averaging (via
group by
) to a larger value, we can that there does appear to be a very small performance improvement as well.Figure 4: Mean PoH Verification Time (6 hour buckets for
group by
)Figure 5: Mean Tx Verification Time (6 hour buckets for
group by
)Figure 6: Mean Confirmation Time (6 hour buckets for
group by
)Figure 4 and 5 show that the gap increases between the traces. However, it should be called out that the difference is pretty small, maybe less than half a millisecond per each. The same trend is there for Figure 6; however, confirmation time is a larger chunk of time overall so these marginal differences are barely visible on the graph.