Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CometBFT crashes with panic in fendermint during BeginBlock when fendermint is catching up (replaying) from CometBFT #1196

Open
karlem opened this issue Nov 6, 2024 · 0 comments · May be fixed by #1197
Labels
bug Something isn't working

Comments

@karlem
Copy link
Contributor

karlem commented Nov 6, 2024

Description:

We encountered an issue where cometbft crashes with a panic caught in fendermint. This issue occurs because, in BeginBlock, we attempt to resolve the CometBFT validator ID to a public key. However, when fendermint’s data folder is deleted and fendermint is restarted, cometbft attempts to start block replay but is not ready for the RPC API connection that fendermint requires for this process.

Steps to Reproduce:

  1. Run both cometbft and fendermint.
  2. Wait until a few blocks have been produced.
  3. Stop fendermint and delete its data folder.
  4. Restart fendermint.

Observed Errors:

cometbft Logs Before Crash:

I[2024-11-06|15:13:38.920] ABCI Replay Blocks module=consensus appHeight=0 storeHeight=5 stateHeight=5 I[2024-11-06|15:13:51.760] Applying block module=consensus height=1 E[2024-11-06|15:13:51.762] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus I[2024-11-06|15:13:51.762] service stop module=abci-client connection=consensus msg="Stopping socketClient service" impl=socketClient E[2024-11-06|15:13:51.762] consensus connection terminated. Did the application crash? Please restart CometBFT module=proxy err="read message: EOF"

fendermint Panic:

2024-11-06T14:13:51.762219Z ERROR fendermint/abci/src/application.rs:212: failed to execute ABCI request: Error { msg: "HTTP error", source: "error trying to connect: tcp connect error: Connection refused (os error 61)", } thread 'tokio-runtime-worker' panicked at /Users/alexei/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-abci-0.7.0/src/v037/server.rs:145:70: called Result::unwrap() on an Err value: HTTP error

Caused by: error trying to connect: tcp connect error: Connection refused (os error 61)

Location: /Users/alexei/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10:9

Caused by: 0: HTTP error 1: error trying to connect: tcp connect error: Connection refused (os error 61) note: run with RUST_BACKTRACE=1 environment variable to display a backtrace 2024-11-06T14:13:51.995565Z ERROR fendermint/app/src/main.rs:24: panicking stacktrace=" 0: std::backtrace_rs::backtrace::libunwind::trace\n

Cause:

The issue seems to be due to this line in validators.rs, where fendermint tries to resolve the validator ID to a public key by connecting to the cometbft RPC API during BeginBlock. If cometbft is not fully ready (due to replay or a fresh start with deleted data), this connection fails, causing fendermint to panic and terminate.

@karlem karlem added the bug Something isn't working label Nov 6, 2024
@karlem karlem linked a pull request Nov 6, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

1 participant