Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Private blockchain stalling when less than 1/3 of validators are down - BftProcessorExecutor-QBFT-0 | INFO | ProposalValidator | Invalid Proposal Payload: Latest Prepared Metadata blockhash does not align with proposed block #6613

Open
kkeaton-rcp opened this issue Feb 26, 2024 · 13 comments
Assignees
Labels
bug Something isn't working non mainnet (private networks) not related to mainnet features - covers privacy, permissioning, IBFT2, QBFT P2 High (ex: Degrading performance issues, unexpected behavior of core features (DevP2P, syncing, etc))

Comments

@kkeaton-rcp
Copy link

kkeaton-rcp commented Feb 26, 2024

Description

I have a private blockchain with 7 validators, I take down 2 validators and the chain seems to freeze and is unable to produce blocks.

Acceptance Criteria

  • Blockchain is able to handle up to 1/3 of the validators going down (should be able to handle 2 validators going offline)

Steps to Reproduce (Bug)

I have tested this a few times and it is inconsistent. Sometimes the chain does not stall, other times the chain does stall.

Shut down two validators and then trigger events to happen on the blockchain.

Expected behavior: [What you expect to happen]
Blockchain works with 2 validators down.

Actual behavior: [What actually happens]
The blockchain will stall and gives BftProcessorExecutor-QBFT-0 | INFO | ProposalValidator | Invalid Proposal Payload: Latest Prepared Metadata blockhash does not align with proposed block errors along with EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.

Frequency: [What percentage of the time does it occur?]
This happens about half the times I have run testing.

Logs (if a bug)

2024-02-23 20:03:53.753+00:00 | nioEventLoopGroup-3-6 | DEBUG | EthPeer | handleDisconnect - EthPeer PeerId 0x9ee..., reputation PeerReputation 109, validated? true, disconnected? true, client: besu/v23.10.0/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 131.... with peer 0x9ee... inboundInitiated false initAt 1708716292348], enode enode://9ee...
2024-02-23 20:03:53.753+00:00 | nioEventLoopGroup-3-6 | DEBUG | EthPeers | Disconnected EthPeer PeerId 0x9ee..., reputation PeerReputation 109, validated? true, disconnected? true, client: besu/v23.10.0/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 1311626227 with peer 0x9ee... inboundInitiated false initAt 170...], enode enode://9e...
2024-02-23 20:03:53.754+00:00 | nioEventLoopGroup-3-6 | DEBUG | EthProtocolManager | Disconnect - Inbound - 0x08 CLIENT_QUITTING - 0x9e... - 9 peers left
9 EthPeers {
PeerId 0x0ff..., reputation PeerReputation 101, validated? true, disconnected? false, client: besu/v23.4.4/linux-x86_64/openjdk-java-17, connection [Connection with hashCode -1461... with peer 0x0ff... inboundInitiated true initAt 170...], enode enode://0ff...?discport=0,
PeerId 0x95..., reputation PeerReputation 101, validated? true, disconnected? false, client: besu/v23.4.0/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 134... with peer 0x95... inboundInitiated true initAt 170...], enode enode://95...?discport=0,

PeerId 0x0c7..., reputation PeerReputation 102, validated? true, disconnected? false, client: besu/v23.4.4/linux-x86_64/openjdk-java-17, connection [Connection with hashCode -166... with peer 0x0c7... inboundInitiated true initAt 170...], enode enode://0c7...?discport=0,

PeerId 0x31..., reputation PeerReputation 103, validated? true, disconnected? false, client: besu/v23.10.1/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 577... with peer 0x31... inboundInitiated true initAt 170...], enode enode://31...?discport=0,
PeerId 0xa5..., reputation PeerReputation 106, validated? true, disconnected? false, client: besu/v23.10.2/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 392... with peer 0xa5... inboundInitiated true initAt 170...], enode enode://a5...?discport=0,
PeerId 0x2f..., reputation PeerReputation 107, validated? true, disconnected? false, client: besu/v23.10.0/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 114... with peer 0x2f... inboundInitiated false initAt 170...], enode enode://2f...,
PeerId 0xbe..., reputation PeerReputation 107, validated? true, disconnected? false, client: besu/v23.10.0/linux-x86_64/openjdk-java-17, connection [Connection with hashCode -133... with peer 0xbe... inboundInitiated false initAt 170...], enode enode://be...,
PeerId 0x035..., reputation PeerReputation 111, validated? true, disconnected? false, client: besu/v23.10.0/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 202... with peer 0x035... inboundInitiated false initAt 170...], enode enode://035...,
PeerId 0x5a..., reputation PeerReputation 126, validated? true, disconnected? false, client: besu/v23.10.0/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 317... with peer 0x5a... inboundInitiated false initAt 170...], enode enode://5a}
2024-02-23 20:03:54.353+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | QbftRound | Importing proposed block to chain. round=ConsensusRoundIdentifier{Sequence=129..., Round=1}, hash=0xcfc....
2024-02-23 20:03:54.565+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | QbftBesuControllerBuilder | Imported #12,904,735 / 4 tx / 0 pending / 21,993,755 (0.0%) gas / (0xcfc...)
2024-02-23 20:03:54.656+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | QbftBesuControllerBuilder | Imported #12,904,736 / 0 tx / 0 pending / 0 (0.0%) gas / (0xc9b...)
2024-02-23 20:03:57.950+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:02.950+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:04.595+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | QbftRound | Importing proposed block to chain. round=ConsensusRoundIdentifier{Sequence=129..., Round=1}, hash=0xd47...
2024-02-23 20:04:04.734+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | QbftBesuControllerBuilder | Imported #12,904,737 / 4 tx / 0 pending / 12,986,827 (0.0%) gas / (0xd47...)
2024-02-23 20:04:04.735+00:00 | EthScheduler-Workers-0 | INFO  | PersistBlockTask | Block 129... (0xd47...) is already imported
2024-02-23 20:04:07.951+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:12.951+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:17.465+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | ProposalValidator | Invalid Proposal Payload: Latest Prepared Metadata blockhash does not align with proposed block
2024-02-23 20:04:17.952+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:22.952+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:27.953+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:32.953+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:33.592+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | ProposalValidator | Invalid Proposal Payload: Latest Prepared Metadata blockhash does not align with proposed block
2024-02-23 20:04:37.953+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:42.954+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:47.954+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:52.955+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:57.955+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:05:02.956+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:05:05.711+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | ProposalValidator | Invalid Proposal Payload: Latest Prepared Metadata blockhash does not align with proposed block
2024-02-23 20:05:07.956+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:05:12.957+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:05:17.957+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected

Versions (Add all that apply)

  • Software version: [besu --version]
  • besu/v23.10.0/linux-x86_64/openjdk-java-17
  • Java version: [java -version]
  • openjdk 17.0.9 2023-10-17
  • OS Name & Version: [cat /etc/*release]
  • Ubuntu 22.04.2 LTS
  • Kernel Version: [uname -a]
  • Linux 22.04.1-Ubuntu SMP Wed Jan 10 22:54:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Cloud VM, type, size: [Amazon Web Services I3-large]
  • AWS T2-Large
@non-fungible-nelson non-fungible-nelson added the non mainnet (private networks) not related to mainnet features - covers privacy, permissioning, IBFT2, QBFT label Feb 27, 2024
@non-fungible-nelson
Copy link
Contributor

@matthew1001 any insight here?

@non-fungible-nelson non-fungible-nelson added bug Something isn't working P2 High (ex: Degrading performance issues, unexpected behavior of core features (DevP2P, syncing, etc)) labels Feb 27, 2024
@matthew1001
Copy link
Contributor

Can you confirm a couple of details please?

  • It looks like you have 9 peers currently, so 10 nodes in total that are available. I'm assuming some of those are non-validators (i.e. regular JSON/RPC nodes)?
  • You've got a variety of Besu versions from the looks of it. Is it possible to try running with a more recent version across all the nodes?
    • 23.10.2 for all nodes would be a useful check if it's possible
  • Can you confirm that the 7 nodes you think are validators are in fact showing up when you issue qbft_getValidatorsByBlockNumber("latest") to some of the currently active nodes?
  • I'm assuming you haven't submitted any qbft_proposeValidatorVote calls around the time of stopping/starting the nodes?

@kkeaton-rcp
Copy link
Author

kkeaton-rcp commented Feb 29, 2024

@matthew1001

  1. Correct - 7 are validators and the rest are regular JSON/RPC nodes

  2. I confirmed this, all 7 are showing up

  3. Correct

  4. I upgraded all nodes to 23.10.2 - I'm getting a bit different behavior so there is a new exception I noticed in the logs (this is when the testing runs successfully)

java.util.concurrent.TimeoutException
        at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
        at org.hyperledger.besu.ethereum.blockcreation.txselection.BlockTransactionSelector.timeLimitedSelection(BlockTransactionSelector.java:181)
        at org.hyperledger.besu.ethereum.blockcreation.txselection.BlockTransactionSelector.buildTransactionListForBlock(BlockTransactionSelector.java:162)
        at org.hyperledger.besu.ethereum.blockcreation.AbstractBlockCreator.selectTransactions(AbstractBlockCreator.java:372)
        at org.hyperledger.besu.ethereum.blockcreation.AbstractBlockCreator.createBlock(AbstractBlockCreator.java:206)
        at org.hyperledger.besu.ethereum.blockcreation.AbstractBlockCreator.createBlock(AbstractBlockCreator.java:154)
        at org.hyperledger.besu.ethereum.blockcreation.AbstractBlockCreator.createBlock(AbstractBlockCreator.java:140)
        at org.hyperledger.besu.consensus.qbft.statemachine.QbftRound.createAndSendProposalMessage(QbftRound.java:130)
        at org.hyperledger.besu.consensus.qbft.statemachine.QbftBlockHeightManager.handleBlockTimerExpiry(QbftBlockHeightManager.java:136)
        at org.hyperledger.besu.consensus.common.bft.statemachine.BaseBftController.handleBlockTimerExpiry(BaseBftController.java:167)
        at org.hyperledger.besu.consensus.common.bft.EventMultiplexer.handleBftEvent(EventMultiplexer.java:65)
        at java.base/java.util.Optional.ifPresent(Optional.java:178)
        at org.hyperledger.besu.consensus.common.bft.BftProcessor.run(BftProcessor.java:65)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)

but I still had it fail, but I am no longer getting 2024-02-23 20:04:17.465+00:00 | BftProcessorExecutor-QBFT-0 | INFO | ProposalValidator | Invalid Proposal Payload: Latest Prepared Metadata blockhash does not align with proposed block it is just all 2024-02-23 20:05:07.956+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.

@matthew1001
Copy link
Contributor

Sorry for the slow reply @kkeaton-rcp . I haven't been able to look into it in detail yet, but another contributor @Brindrajsinh-Chauhan might have some time to do so. I'll assign it over to him for now.

@matthew1001 matthew1001 removed their assignment Mar 8, 2024
@Brindrajsinh-Chauhan
Copy link
Contributor

Brindrajsinh-Chauhan commented Mar 8, 2024

Github does not seem to let me assign it to myself, would be grateful if @non-fungible-nelson you could take that action. Will start investigating on this piece.

@matthew1001
Copy link
Contributor

Looks like I can @Brindrajsinh-Chauhan - just assigned it to you

@kkeaton-rcp
Copy link
Author

Just wondering if there was any update on this?

@Brindrajsinh-Chauhan
Copy link
Contributor

Hey sorry, have not been able to recreate this based on the scenarios you provided.

@kkeaton-rcp
Copy link
Author

Did you have a smart contract that took in some data and triggered that while the nodes were down?

@Brindrajsinh-Chauhan
Copy link
Contributor

yes send a bunch of transactions to a smart contract with a few nodes down and then bringing them back up

@kkeaton-rcp
Copy link
Author

Could a large amount of data going onto a smart contract cause this error? This error has now happened without any validator nodes being down. @Brindrajsinh-Chauhan @matthew1001

@kkeaton-rcp
Copy link
Author

Or are there any known issues that deploying a smart contract with truffle (that is now deprecated) could cause? Would love to hear others opinions on this..

@wonawunmi
Copy link

@kkeaton-rcp, I suggested a solution on a similar issue #6732 .

In your case, I suspect the surviving validators are unable to produce blocks for your transactions within the set value for requesttimeoutseconds. Default recommendation for requesttimeoutseconds is 2x your blockperiodseconds, however the hardware resources of your remaining validators play a role in ensuring prompt block creation, especially, disk IO. I suggest you review your disk IO for performance. Try increasing your requesttimeoutseconds to something higher can help reduce the occurrence of this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non mainnet (private networks) not related to mainnet features - covers privacy, permissioning, IBFT2, QBFT P2 High (ex: Degrading performance issues, unexpected behavior of core features (DevP2P, syncing, etc))
Projects
None yet
Development

No branches or pull requests

5 participants