Go Substrate Client - Connection Failover Implementation #1020

sameh-farouk · 2024-11-15T00:20:30Z

Describe the bug

The current substrate client implementation in impl.go has several critical issues:

No Real Connection Pooling
Thread Safety Issues
No proactive health checking
No failover mechanism

Current behavior:

While the client can initialize a connection from multiple URLs, it only uses one of them to establish a single active connection. During operation, it lacks a failover system to switch to a new URL if the initial connection goes down. Instead, it reconnects only to the same URL when it comes back online. If the initially selected RPC node becomes unavailable, the client is unable to send transactions until that specific node is restored.

To Reproduce

I’ve included steps here to replicate a recent issue we encountered where some RPC nodes went offline, leading to missed uptime reports by many ZOS nodes. These steps can also be used to verify the issue once the connection pooling feature has been implemented.

Steps to reproduce the behavior:

Start multiple local RPC nodes:

docker run --name node1 -d -p 9944:9944 ghcr.io/threefoldtech/tfchain:2.9.2 --port 30333 --rpc-port 9944 --dev --rpc-external
docker run --name node2 -d -p 9945:9944 ghcr.io/threefoldtech/tfchain:2.9.2 --port 30333 --rpc-port 9944 --dev --rpc-external
docker run --name node3 -d -p 9946:9944 ghcr.io/threefoldtech/tfchain:2.9.2 --port 30333 --rpc-port 9944 --dev --rpc-external

Navigate to the client directory:

cd /Projects/tfchain/clients

Open new main.go and paste the following code:

package main

import (
	"log"
	"time"

	tfchain "github.com/threefoldtech/tfchain/clients/tfchain-client-go"
)

func main() {
	urls := []string{
		"ws://127.0.0.1:9944",
		"ws://127.0.0.1:9945",
		"ws://127.0.0.1:9946",
	}
	manager := tfchain.NewManager(urls...)
	substrateConnection, err := manager.Substrate()
	if err != nil {
		panic(err)
	}
	defer substrateConnection.Close()

	// Do something with the connection
	count := 0
	log.Println("starting loop")
	for {
		count++
		ti, err := substrateConnection.Time()
		if err != nil {
			log.Println("error getting time", err)
		} else {
			log.Println("time", ti, "count", count)
		}
		time.Sleep(time.Second * 10)
		if count > 20 {
			break
		}
	}
}

Initialize the module and run the code:

go mod init example.com/test
go mod tidy
go run main.go

Check the "connecting to..." log message to identify the chosen URL/port.
Stop the node using docker stop {name} and observe that the client fails to switch to a different URL.

{"level":"debug","url":"ws://127.0.0.1:9945","time":"2024-11-15T01:22:56+02:00","message":"connecting"}
2024/11/15 01:22:56 Connecting to ws://127.0.0.1:9945...
2024/11/15 01:22:56 starting loop
2024/11/15 01:22:56 time 2024-11-15 01:22:55 +0200 EET count 1
2024/11/15 01:23:06 time 2024-11-15 01:23:06 +0200 EET count 2
2024/11/15 01:23:16 time 2024-11-15 01:23:12 +0200 EET count 3
2024/11/15 01:23:26 time 2024-11-15 01:23:24 +0200 EET count 4
2024/11/15 01:23:36 error getting time failed to lookup entity: dial tcp 127.0.0.1:9945: connect: connection refused
2024/11/15 01:23:46 error getting time failed to lookup entity: dial tcp 127.0.0.1:9945: connect: connection refused
2024/11/15 01:23:56 error getting time failed to lookup entity: dial tcp 127.0.0.1:9945: connect: connection refused
2024/11/15 01:24:06 error getting time failed to lookup entity: dial tcp 127.0.0.1:9945: connect: connection refused
2024/11/15 01:24:16 error getting time failed to lookup entity: dial tcp 127.0.0.1:9945: connect: connection refused
2024/11/15 01:24:26 error getting time failed to lookup entity: dial tcp 127.0.0.1:9945: connect: connection refused

Restart the node with docker start {name} and notice that the client reconnects only after the original node is online again.

The text was updated successfully, but these errors were encountered:

sameh-farouk · 2024-12-01T13:19:12Z

Verified
This feature was deployed to Devnet as part of ZOS. A failover occurred after the operations team was requested to stop one RPC node on Devnet. After the connection was closed, the system switched to another URL on the first use.

sameh-farouk self-assigned this Nov 15, 2024

sameh-farouk added the type_feature New feature or request label Nov 15, 2024

sameh-farouk added this to 3.15.x Nov 15, 2024

sameh-farouk moved this to Accepted in 3.15.x Nov 15, 2024

sameh-farouk moved this from Accepted to In Progress in 3.15.x Nov 15, 2024

sameh-farouk added this to the 2.9.x milestone Nov 15, 2024

xmonader mentioned this issue Nov 17, 2024

Update the tfchain client for substrate manager threefoldtech/zos#2492

Closed

sameh-farouk changed the title ~~Go Substrate Client - Connection Pool Implementation~~ Go Substrate Client - Connection Failover Implementation Nov 25, 2024

sameh-farouk mentioned this issue Nov 25, 2024

Feat: Go Substrate Client - Connection Failover Implementation #1022

Merged

2 tasks

xmonader mentioned this issue Nov 26, 2024

Nodes left in a bad state after 3.15 update threefoldtech/zos#2485

Closed

sameh-farouk moved this from In Progress to Pending Review in 3.15.x Nov 27, 2024

sameh-farouk moved this from Pending Review to Pending Deployment in 3.15.x Nov 27, 2024

sameh-farouk moved this from Pending Deployment to In Verification in 3.15.x Dec 1, 2024

sameh-farouk closed this as completed Dec 1, 2024

github-project-automation bot moved this from In Verification to Done in 3.15.x Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Go Substrate Client - Connection Failover Implementation #1020

Go Substrate Client - Connection Failover Implementation #1020

sameh-farouk commented Nov 15, 2024

sameh-farouk commented Dec 1, 2024 •

edited

Loading

Go Substrate Client - Connection Failover Implementation #1020

Go Substrate Client - Connection Failover Implementation #1020

Comments

sameh-farouk commented Nov 15, 2024

Describe the bug

Current behavior:

To Reproduce

sameh-farouk commented Dec 1, 2024 • edited Loading

sameh-farouk commented Dec 1, 2024 •

edited

Loading