Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve substarte-relay connection problems handling #3028

Open
1 task
bkontur opened this issue Jul 15, 2024 · 2 comments
Open
1 task

Improve substarte-relay connection problems handling #3028

bkontur opened this issue Jul 15, 2024 · 2 comments
Assignees

Comments

@bkontur
Copy link
Contributor

bkontur commented Jul 15, 2024

Investigate/check

  • Do we restart loops correctly in all kind of connection errors? E.g. RestartNeeded does it stop loop or restart? Or the only solution is to restart substrate-relay?

Possible improvement 1:

Now we are connected to the one exact node uri, e.g.:

  --source-uri wss://rococo-rpc.polkadot.io \
  --target-uri wss://bridge-hub-westend-rpc.dwellir.com \

If the node is down, or has some problem, we could configure list of uris, so when RestartNeeded, we rotate and try another uri, e.g.:

  --source-uri wss://rococo-rpc.polkadot.io 
  --source-uri wss://rococo-xyz1-rpc.polkadot.io 
  --source-uri wss://rococo-xyz2-rpc.polkadot.io 
  --target-uri wss://bridge-hub-westend-rpc.dwellir.com 
  --target-uri wss://bridge-hub-westend-xyz2-rpc.dwellir.com 
  --target-uri wss://bridge-hub-westend-xyz2-rpc.luckyfriday.com 

So, if one node is overloaded, we just try another one.

Possible improvement 2 - connect substrate-relay to some "load balancer"

This "load balancer" would do routing to the live and not overloaded node, instead of handling this in our code.

Some logs from 2024-07-12/15

https://matrix.to/#/!FqmgUhjOliBGoncGwm:parity.io/$OjKXcX4aO9lkzM46fRLKXTMi-mf9vcpdJN_RDMgIn6o?via=parity.io

e.g.:

Polkadot client has failed to return its sync status: FailedToGetSystemHealth { chain: "Polkadot", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-15T08:05:06Z {} [Polkadot_to_BridgeHubKusama_Parachains_1002] 2024-07-15 08:05:06 +00 WARN bridge Failed to read best Polkadot block: ChannelError("Background task of BridgeHubKusama client has exited with result: Err(ChannelError(\"Mandatory best headers subscription for BridgeHubKusama has finished\"))")
2024-07-15T03:17:36Z {} [Polkadot_to_BridgeHubKusama_Parachains_1002] 2024-07-15 03:17:36 +00 WARN bridge Failed to read head of Polkadot parachain ParaId(1002) at BridgeHubKusama: FailedToReadStorageValue { chain: "BridgeHubKusama", hash: "0x181d…2a58", key: StorageKey([243, 240, 56, 234, 7, 239, 168, 105, 144, 9, 71, 27, 60, 48, 159, 184, 100, 28, 243, 91, 238, 116, 177, 147, 83, 37, 172, 214, 89, 235, 25, 203, 127, 32, 114, 84, 61, 57, 196, 82, 229, 51, 84, 40, 99, 135, 86, 81, 234, 3, 0, 0]), error: RpcError(RestartNeeded(Transport(connection closed
2024-07-15T00:47:50Z {} [Polkadot_to_BridgeHubKusama_Parachains_1002] 2024-07-15 00:47:50 +00 WARN bridge Polkadot client has failed to return its sync status: FailedToGetSystemHealth { chain: "Polkadot", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-14T23:18:10Z {} 2024-07-14 23:18:10 +00 ERROR bridge [BridgeHubKusama-to-BridgeHubPolkadot-on-demand-parachain] Failed to read relay data from BridgeHubPolkadot client: ChannelError("Background task of BridgeHubPolkadot client has exited with result: Err(ChannelError(\"Finalized headers subscription for BridgeHubPolkadot has finished\"))")
2024-07-14T23:04:57Z {} [Polkadot_to_BridgeHubKusama_Sync] 2024-07-14 23:04:57 +00 INFO bridge Call of PolkadotFinalityApi_free_headers_interval at BridgeHubKusama has failed with an error: FailedStateCall { chain: "BridgeHubKusama", hash: "0x8551…5ec9", method: "PolkadotFinalityApi_free_headers_interval", arguments: Bytes([]), error: RpcError(Call(ErrorObject { code: ServerError(4003), message: "Client error: Execution failed: Other: Exported method PolkadotFinalityApi_free_headers_interval is not found", data: None })) }. Treating as `None`
2024-07-14T23:04:57Z {} [Polkadot_to_BridgeHubKusama_Sync] 2024-07-14 23:04:57 +00 ERROR bridge Finality sync loop iteration has failed with error: Target(FailedToGetSystemHealth { chain: "BridgeHubKusama", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-14T23:04:57Z {} 2024-07-14 23:04:57 +00 ERROR bridge [Polkadot-to-BridgeHubKusama-on-demand-headers] Failed to read best finalized source header from target: FailedToGetSystemHealth { chain: "BridgeHubKusama", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-14T20:27:45Z {} 2024-07-14 20:27:45 +00 WARN bridge [Polkadot-to-BridgeHubKusama-on-demand-headers] Failed to scan mandatory Polkadot headers range ((21644741, 21647633)): FailedToReadHeaderHashByNumber { chain: "Polkadot", number: "21647633", error: RpcError(RestartNeeded(Transport(i/o error: Connection reset by peer (os error 104)
2024-07-14T00:35:31Z {} [Kusama_to_BridgeHubPolkadot_Parachains_1002] 2024-07-14 00:35:31 +00 WARN bridge Kusama client has failed to return its sync status: FailedToGetSystemHealth { chain: "Kusama", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-12T22:50:53Z {} [BridgeHubPolkadot_to_BridgeHubKusama_MessageLane_00000001] 2024-07-12 22:50:53 +00 ERROR bridge Error retrieving state from BridgeHubKusama node: FailedToGetSystemHealth { chain: "BridgeHubKusama", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-12T22:50:53Z {} [BridgeHubKusama_to_BridgeHubPolkadot_MessageLane_00000001] 2024-07-12 22:50:53 +00 ERROR bridge Error retrieving state from BridgeHubPolkadot node: FailedToGetSystemHealth { chain: "BridgeHubKusama", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-12T22:42:56Z {} [Polkadot_to_BridgeHubKusama_Parachains_1002] 2024-07-12 22:42:56 +00 WARN bridge Polkadot client has failed to return its sync status: FailedToGetSystemHealth { chain: "Polkadot", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-12T21:03:49Z {} [Kusama_to_BridgeHubPolkadot_Parachains_1002] 2024-07-12 21:03:49 +00 WARN bridge Kusama client has failed to return its sync status: FailedToGetSystemHealth { chain: "Kusama", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-12T20:37:38Z {} [Kusama_to_BridgeHubPolkadot_Parachains_1002] 2024-07-12 20:37:38 +00 WARN bridge Failed to read best Kusama block: ChannelError("Background task of BridgeHubPolkadot client has exited with result: Err(ChannelError(\"Mandatory best headers subscription for BridgeHubPolkadot has finished\"))")
2024-07-12T20:13:04Z {} [Polkadot_to_BridgeHubKusama_Parachains_1002] 2024-07-12 20:13:04 +00 WARN bridge Failed to read best Polkadot block: ChannelError("Background task of BridgeHubKusama client has exited with result: Err(ChannelError(\"Mandatory best headers subscription for BridgeHubKusama has finished\"))")
2024-07-12T19:58:39Z {} 2024-07-12 19:58:39 +00 ERROR bridge [Polkadot-to-BridgeHubKusama-on-demand-headers] Failed to read best finalized source header from source: ChannelError("Background task of Polkadot client has exited with result: Err(ChannelError(\"Mandatory best headers subscription for Polkadot has finished\"))")
@bkontur
Copy link
Contributor Author

bkontur commented Jul 16, 2024

yes we do caching, but, also I would like to check RPC/runtime calls (and subscribtions) monitoring to see what and how often we do RPC/runtime calls, if there is any space for optimization.

Also maybe, the separate 6-relayer setup could help by itself

@bkontur
Copy link
Contributor Author

bkontur commented Jul 24, 2024

again, this errors stop relaying finality:

[BridgeHubPolkadot_to_BridgeHubKusama_MessageLane_00000001] 2024-07-23 14:47:28 +00 ERROR bridge Error retrieving state from BridgeHubKusama node: FailedToGetSystemHealth { chain: "BridgeHubPolkadot", error: RpcError(RestartNeeded(Transport(connection closed
[Kusama_to_BridgeHubPolkadot_Sync] 2024-07-23 14:47:27 +00 ERROR bridge Finality sync loop iteration has failed with error: Source(ChannelError("Background task of Kusama client has exited with result: Err(ChannelError(\"Mandatory best headers subscription for Kusama has finished\"))"))
[Polkadot_to_BridgeHubKusama_Sync] 2024-07-23 14:47:23 +00 ERROR bridge Finality sync loop iteration has failed with error: Target(FailedToGetSystemHealth { chain: "BridgeHubKusama", error: RpcError(RestartNeeded(Transport(connection closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

1 participant