Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go Substrate Client - Connection Failover Implementation #1020

Closed
sameh-farouk opened this issue Nov 15, 2024 · 1 comment
Closed

Go Substrate Client - Connection Failover Implementation #1020

sameh-farouk opened this issue Nov 15, 2024 · 1 comment
Assignees
Labels
type_feature New feature or request
Milestone

Comments

@sameh-farouk
Copy link
Member

Describe the bug

The current substrate client implementation in impl.go has several critical issues:

  1. No Real Connection Pooling
  2. Thread Safety Issues
  3. No proactive health checking
  4. No failover mechanism

Current behavior:

While the client can initialize a connection from multiple URLs, it only uses one of them to establish a single active connection. During operation, it lacks a failover system to switch to a new URL if the initial connection goes down. Instead, it reconnects only to the same URL when it comes back online. If the initially selected RPC node becomes unavailable, the client is unable to send transactions until that specific node is restored.

To Reproduce

I’ve included steps here to replicate a recent issue we encountered where some RPC nodes went offline, leading to missed uptime reports by many ZOS nodes. These steps can also be used to verify the issue once the connection pooling feature has been implemented.

Steps to reproduce the behavior:

  1. Start multiple local RPC nodes:
docker run --name node1 -d -p 9944:9944 ghcr.io/threefoldtech/tfchain:2.9.2 --port 30333 --rpc-port 9944 --dev --rpc-external
docker run --name node2 -d -p 9945:9944 ghcr.io/threefoldtech/tfchain:2.9.2 --port 30333 --rpc-port 9944 --dev --rpc-external
docker run --name node3 -d -p 9946:9944 ghcr.io/threefoldtech/tfchain:2.9.2 --port 30333 --rpc-port 9944 --dev --rpc-external
  1. Navigate to the client directory:
cd /Projects/tfchain/clients
  1. Open new main.go and paste the following code:
package main

import (
	"log"
	"time"

	tfchain "github.com/threefoldtech/tfchain/clients/tfchain-client-go"
)

func main() {
	urls := []string{
		"ws://127.0.0.1:9944",
		"ws://127.0.0.1:9945",
		"ws://127.0.0.1:9946",
	}
	manager := tfchain.NewManager(urls...)
	substrateConnection, err := manager.Substrate()
	if err != nil {
		panic(err)
	}
	defer substrateConnection.Close()

	// Do something with the connection
	count := 0
	log.Println("starting loop")
	for {
		count++
		ti, err := substrateConnection.Time()
		if err != nil {
			log.Println("error getting time", err)
		} else {
			log.Println("time", ti, "count", count)
		}
		time.Sleep(time.Second * 10)
		if count > 20 {
			break
		}
	}
}
  1. Initialize the module and run the code:
go mod init example.com/test
go mod tidy
go run main.go
  1. Check the "connecting to..." log message to identify the chosen URL/port.

  2. Stop the node using docker stop {name} and observe that the client fails to switch to a different URL.

{"level":"debug","url":"ws://127.0.0.1:9945","time":"2024-11-15T01:22:56+02:00","message":"connecting"}
2024/11/15 01:22:56 Connecting to ws://127.0.0.1:9945...
2024/11/15 01:22:56 starting loop
2024/11/15 01:22:56 time 2024-11-15 01:22:55 +0200 EET count 1
2024/11/15 01:23:06 time 2024-11-15 01:23:06 +0200 EET count 2
2024/11/15 01:23:16 time 2024-11-15 01:23:12 +0200 EET count 3
2024/11/15 01:23:26 time 2024-11-15 01:23:24 +0200 EET count 4
2024/11/15 01:23:36 error getting time failed to lookup entity: dial tcp 127.0.0.1:9945: connect: connection refused
2024/11/15 01:23:46 error getting time failed to lookup entity: dial tcp 127.0.0.1:9945: connect: connection refused
2024/11/15 01:23:56 error getting time failed to lookup entity: dial tcp 127.0.0.1:9945: connect: connection refused
2024/11/15 01:24:06 error getting time failed to lookup entity: dial tcp 127.0.0.1:9945: connect: connection refused
2024/11/15 01:24:16 error getting time failed to lookup entity: dial tcp 127.0.0.1:9945: connect: connection refused
2024/11/15 01:24:26 error getting time failed to lookup entity: dial tcp 127.0.0.1:9945: connect: connection refused
  1. Restart the node with docker start {name} and notice that the client reconnects only after the original node is online again.
@sameh-farouk sameh-farouk self-assigned this Nov 15, 2024
@sameh-farouk sameh-farouk added the type_feature New feature or request label Nov 15, 2024
@sameh-farouk sameh-farouk moved this to Accepted in 3.15.x Nov 15, 2024
@sameh-farouk sameh-farouk moved this from Accepted to In Progress in 3.15.x Nov 15, 2024
@sameh-farouk sameh-farouk added this to the 2.9.x milestone Nov 15, 2024
@sameh-farouk sameh-farouk changed the title Go Substrate Client - Connection Pool Implementation Go Substrate Client - Connection Failover Implementation Nov 25, 2024
@sameh-farouk sameh-farouk moved this from In Progress to Pending Review in 3.15.x Nov 27, 2024
@sameh-farouk sameh-farouk moved this from Pending Review to Pending Deployment in 3.15.x Nov 27, 2024
@sameh-farouk sameh-farouk moved this from Pending Deployment to In Verification in 3.15.x Dec 1, 2024
@sameh-farouk
Copy link
Member Author

sameh-farouk commented Dec 1, 2024

Verified
This feature was deployed to Devnet as part of ZOS. A failover occurred after the operations team was requested to stop one RPC node on Devnet. After the connection was closed, the system switched to another URL on the first use.

photo_2024-12-01_15-12-17

@github-project-automation github-project-automation bot moved this from In Verification to Done in 3.15.x Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_feature New feature or request
Projects
Status: Done
Development

No branches or pull requests

1 participant