Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression bug: Hermes is unable to re-establish monitor connection after node goes down #1026

Closed
5 tasks done
adizere opened this issue Jun 1, 2021 · 2 comments · Fixed by #1027
Closed
5 tasks done
Assignees
Labels
A: bug Admin: something isn't working
Milestone

Comments

@adizere
Copy link
Member

adizere commented Jun 1, 2021

Crate

ibc-relayer primarily

Summary of Bug

The fault-tolerance mechanism introduced in #895 (and the follow-up PR #903) ensures that
Hermes can cope with a full node's websocket endpoint becoming unreachable, and continues
to function unaffected once the endpoint is reachable again.
We introduced a regression bug (possibly with the batching
feature, as Romain suggested) because this mechanism no longer works.

Version

e4a6543

Steps to Reproduce

  1. run scripts/dev-env with two chains, e.g., ibc-0 and ibc-1
  2. run hermes create channel ibc-0 ibc-1 --port-a transfer --port-b transfer and wait for it to finish
  3. run hermes start in one terminal, make sure your logging level is at least debug
  4. kill one of the gaia instances and watch the hermes output
    • kill -9 GAIA_PID
    • some error should appear in the log
  5. so far so good
  6. run scripts/dev-env again with the same configuration as before
  7. run hermes create channel ibc-0 ibc-1 --port-a transfer --port-b transfer again
  8. run hermes tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 9999 -o 1000

The problem is that Hermes (the one we started at step 3) should pick up the connection to the two gaia instances and relay the packet. But this does not happen. Instead, Hermes does connect via websocket to the chains, but it does not receive any events from either of the two chains.

What should happen

If using version 20d8fff of Hermes, and running the same recipe as above, at steps 7 and 8 we would see Hermes workers starting and doing active work (relaying). For example, the output should be:

Jun 01 16:55:31.073 DEBUG ibc_relayer::supervisor: chain ibc-0 sent events:
	UpdateClientEv(h: 0-27, cs_h: 07-tendermint-0(1-16))
 for object Client(Client { dst_chain_id: ChainId { id: "ibc-0", version: 0 }, dst_client_id: ClientId("07-tendermint-0"), src_chain_id: ChainId { id: "ibc-1", version: 1 } })

Note that 20d8fff will requite that we use command hermes start-multi at step 2 (instead of hermes start) and the configuration .toml file should have strategy = 'naive'.

Acceptance Criteria

  • Hermes should behave again as it did at version 20d8fff

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@ancazamfir
Copy link
Collaborator

A few thoughts (writing them here but we'll need separate issues)

  • yes we should fix the reconnect as a first step (Enable logging in tests and fix error log flooding from mock chain runtime #1017)
  • since it is possible that events have been missed, workers should "reset", e.g. unipath should clear packets, client recheck updates, etc
  • it is also possible that events that would have spawned new workers were missed
  • we need to make an analysis on handling websocket vs rpc vs grpc failures. These services can be configured to use different nodes but we know that for the passive relayer the events are vital. So maybe just stopping all workers involved and rescanning state at reconnect time would be ok (similar to remove+add chain)

Testing

  • the test with dev-env is ok to test some basic things but not enough.
  • this test simulates chains stopping and restarting (e.g. chain upgrade with reset height). Normally this should happen with a higher version number (so for hermes these are new chains, e.g. ibc-0 node will be upgraded and restarted as ibc-1)
  • if the gaiad restart happens fast enough we are left with some workers from the old instance for channels/ clients that don't exist anymore.
  • instead we should use gaia manager to setup a test environment for these scenarios (we would need two full nodes, one to bring down and used for events/queries etc by hermes start and one to be used by another hermes instance that makes IBC state changes)

Other issues I noticed while debugging this:

  • logs are very inconsistent in looks, see below, we should enforce a consistent span on all logs.
    • DEBUG ibc_relayer::worker::client: client 'ibc-1 -> ibc-0:07-tendermint-0'
    • DEBUG ibc_relayer::foreign_client: [ibc-0 -> ibc-1:07-tendermint-0]
    • INFO ibc_relayer::worker: [channel-0/transfer:ibc-0->ibc-1]
  • packet workers retry and exit, client workers do not
  • channel worker is not yet integrated with the telemetry
  • channel worker does not have an outer retry loop (it retries for individual commands)

@romac
Copy link
Member

romac commented Jun 2, 2021

Further work on this tracked in #1035

romac added a commit that referenced this issue Jun 2, 2021
romac added a commit that referenced this issue Jun 3, 2021
* Added details about the help command in the guide

* Bump version to 0.4.0

* Update guide to account for `start-multi` being promoted to `start`

* Fix changelog

* Document telemetry section of config file

* Fixup documentation for global section of configuration file

* Document type of each config option

* Remove unsused config default method

* Guide update for the query clients method.

* Typo fix

* Re-add Cargo.lock for proto-compiler crate

* Document addition of `host` param to telemetry config

* Document telemetry service

* Update changelog with telemetry

* Add changelog entry for #1026

* Channel worker updates

* Add missing files

* Update feature matrix

* Update mdbook to v0.4.7

* Update mdbook to v0.4.9

* Add cosmos-sdk versions supported

* Higlight compat info

* Write summary of 0.4.0 release

Co-authored-by: Romain Ruetschi <[email protected]>
Co-authored-by: Anca Zamfir <[email protected]>
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this issue Sep 13, 2022
* Added details about the help command in the guide

* Bump version to 0.4.0

* Update guide to account for `start-multi` being promoted to `start`

* Fix changelog

* Document telemetry section of config file

* Fixup documentation for global section of configuration file

* Document type of each config option

* Remove unsused config default method

* Guide update for the query clients method.

* Typo fix

* Re-add Cargo.lock for proto-compiler crate

* Document addition of `host` param to telemetry config

* Document telemetry service

* Update changelog with telemetry

* Add changelog entry for informalsystems#1026

* Channel worker updates

* Add missing files

* Update feature matrix

* Update mdbook to v0.4.7

* Update mdbook to v0.4.9

* Add cosmos-sdk versions supported

* Higlight compat info

* Write summary of 0.4.0 release

Co-authored-by: Romain Ruetschi <[email protected]>
Co-authored-by: Anca Zamfir <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: bug Admin: something isn't working
Projects
None yet
3 participants