Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check: set lastBlockTime in PrepareRequest handler, fix #55 #56

Merged
merged 1 commit into from
Jan 10, 2025

Conversation

roman-khimov
Copy link
Member

It's the earliest point, if a view change happens the timer is reset (because
new view comes with a new set of transactions, potentially picking up ones
received between views).

@roman-khimov
Copy link
Member Author

Four nodes, 1000 TPS:
ms_per_block_single
Same with 200ms network delay:
ms_per_block_single

@roman-khimov
Copy link
Member Author

Max performance for four nodes with 5s block time:
tps_single
ms_per_block_single

With 200ms delay:
tps_single
ms_per_block_single

I'm not entirely sure why, but does affect TPS and not in a good way (hence this is a draft).

@roman-khimov
Copy link
Member Author

Two seconds with 128K pool:
tps_single
ms_per_block_single

@roman-khimov
Copy link
Member Author

Rebased.

Copy link

codecov bot commented Jun 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.31%. Comparing base (b2ba0cd) to head (b3c1c3b).
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #56      +/-   ##
==========================================
+ Coverage   58.21%   58.31%   +0.09%     
==========================================
  Files          32       32              
  Lines        2257     2262       +5     
==========================================
+ Hits         1314     1319       +5     
  Misses        859      859              
  Partials       84       84              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@roman-khimov
Copy link
Member Author

As we remember from #56 (comment) this patch works perfectly for the case of constant low/moderate load, the difference is rather obvious and for the good. Let's concentrate on less positive side of it that was left as "unknown" three years ago.

All tests below are using 50K memory pool (and are limited by it in many cases), BoltDB as the DB and 30 workers pushing as many transactions as they can.

Starting with the simple case, single node:
ms_per_block_single_1s

It's obvious that we have zero communication overhead in this case, but one can notice that the patched version still shaves off some milliseconds from the block time leading to slightly better average TPS results (even if some blocks drop below the line which can happen from test to test irrespective of this patch):
tps_single_1s

Moving on to four nodes, 5s block time and no delays:
ms_per_block_4_nodes_5s_no_delay
tps_4_nodes_5s_no_delay
This network is obviously limited by memory pool TPS-wise, and the patch effect is hardly noticeable, 5s is too much of a time, the network is perfectly synchronized wrt block contents and always pushes 50K transactions in each block:
tpb_4_nodes_5s_no_delay

To make things more interesting we can drop block time to 1s:
ms_per_block_4_nodes_1s_no_delay

In this case nodes aren't perfectly synchronized and they can't consume 50K transactions in one second which leads to lower number of transactions in each block:
tpb_4_nodes_1s_no_delay

But notice that block time is more stable here and TPS is actually better:
tps_4_nodes_1s_no_delay

Which is connected to the fact of more stable block time with memory pool having some spare slots. Now to more complex things, we're adding 200ms delay and run with 5s block time:
ms_per_block_4_nodes_5s_delay_200ms

It's rather obvious that block times are much better here (as expected and even if we're to forget about CV), but TPS picture is similar to the one in #56 (comment):
tps_4_nodes_5s_delay_200ms

There is a degradation. And we can clearly see the tick-tock pattern in TPS which directly follows from the tick-tock in TPB:
tpb_4_nodes_5s_delay_200ms

And now I can finally explain this. This network has problems with transaction propagation and consensus, in general it can consume ~50K transactions in 5s, but accepting a block takes time, so once it flushes 50K block nodes have empty memory pools and they're adjusting timers for consensus delay leaving very little time for the next block, so when time comes to make it we don't have a lot of transactions in pools, so we end up with ~empty block, but the next one has enough overall time to fill up memory pool again, then it's flushed with 50K transactions and the cycle repeats. It's not the first time we see this pattern in various cases, but here longer block time of non-patched version just allows to collect more transactions. So we have a clear trade-off between average TPS and latency (since blocks are delayed for longer time without this patch).

But this is a memory pool limit effect again, to avoid it we can drop block time to 1s (with the same 200ms delay):
ms_per_block_4_nodes_1s_delay_200ms
Consensus takes a lot of time here because nodes are not well-synchronized, they're actively exchanging transactions when new block is being proposed and some nodes have to ask others about some transactions. Both master and patched version are far from 1s target time. But what happens to average TPS is:
tps_4_nodes_1s_delay_200ms

It's a bit better and it's more stable. And that happens because this time it's master that is doing tick-tock again:
tpb_4_nodes_1s_delay_200ms

So overall I can say that:

  • this patch is safe
  • it is beneficial for block time in all cases, especially when we're dealing with well-synchronized networks (constant low/medium transaction load)
  • it's good TPS-wise for networks where mempool is not constantly saturated
  • it can degrade TPS in some specific cases of mempool saturation, but the same degradation happens to unpatched version in different scenarios, but this degradation is not catastrophic either (on the order or 10%)
  • given that most of our real usage scenarios are not related to constantly saturated memory pools it's a good change

Copy link
Member

@AnnaShaleva AnnaShaleva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise this change looks legit to me.

check.go Show resolved Hide resolved
@AnnaShaleva AnnaShaleva modified the milestone: v0.3.2 Jan 10, 2025
It's the earliest point, if a view change happens the timer is reset (because
new view comes with a new set of transactions, potentially picking up ones
received between views).

Signed-off-by: Roman Khimov <[email protected]>
@AnnaShaleva AnnaShaleva merged commit 27db04c into master Jan 10, 2025
12 checks passed
@AnnaShaleva AnnaShaleva deleted the move-lastblock-time branch January 10, 2025 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants