fix(go-client): update config once replica server failed and forward to primary meta server if it was changed #1916

lengyuexuexuan · 2024-02-22T06:18:53Z

What problem does this PR solve?

What is changed and how does it work?

As for #1856.
when go client is writing to one partition and the replica node core dump, go client will finish after timeout without updating the configuration. In this case, the go client only restart to solve the problem.

In this pr, the client would update conconfiguration of table automatically when someone replica core dump.
After testing, we found that the the replicaerror is "context.DeadlineExceeded" when the replica core dump.

incubator-pegasus/go-client/pegasus/table_connector.go

Lines 705 to 706 in 41141c1

    
           case context.DeadlineExceeded: 
        
           	confUpdate = true

Therefore, when client meets the errror, the go client will update configuration automatically.
Besides, this request will not retry. Because only in the case of timeout, the configuration will be automatically updated. If you try again before then, it will still fail. There is also the risk of infinite retries.
Therefore, it is better to directly return the request error to the user and let the user try again.

As for #1880
When the client sends an RPC message "RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX" to the meta server, if the meta server isn't primary, the response that forward to the primary meta server will return.

According to the above description, assuming that the client does not have a primary meta server configured, we can connect to the primary meta server in this way.

In this PR, we implement this function through the following steps.

First parse the response, determine whether its errno is ERR_FORWARD_TO_OTHERS, and then parse it to get the primary meta server address.

incubator-pegasus/go-client/session/meta_call.go

Lines 166 to 177 in 41141c1

    
           func (c *metaCall) getMetaServiceForwardAddress(resp metaResponse) *base.RPCAddress { 
        
           	rep, ok := resp.(*replication.QueryCfgResponse) 
        
           	if !ok || rep.GetErr().Errno != base.ERR_FORWARD_TO_OTHERS.String() { 
        
           		return nil 
        
           	} else if rep.GetPartitions() == nil || len(rep.GetPartitions()) == 0 { 
        
           		return nil 
        
           	} else { 
        
           		return rep.Partitions[0].Primary 
        
           	} 
        
           }

Secondly, determine whether the address is already in the client configuration. If it is already there, skip it directly. Otherwise, establish a connection and pull the configuration directly from the primary meta server.

incubator-pegasus/go-client/session/meta_call.go

Lines 118 to 138 in 41141c1

    
           if forwardAddr == nil { 
        
           	return false 
        
           } 
        
           addr := forwardAddr.GetAddress() 
        
           found := false 
        
           for i := range c.metaIPAddrs { 
        
           	if addr == c.metaIPAddrs[i] { 
        
           		found = true 
        
           		break 
        
           	} 
        
           } 
        
           if !found { 
        
           	c.metaIPAddrs = append(c.metaIPAddrs, addr) 
        
           	c.metas = append(c.metas, &metaSession{ 
        
           		NodeSession: newNodeSession(addr, NodeTypeMeta), 
        
           		logger:      pegalog.GetLogger(), 
        
           	}) 
        
           	curLeader = len(c.metas) - 1 
        
           	c.metas[curLeader].logger.Printf("add forward address %s as meta server", addr) 
        
           } 
        
           resp, err = c.callFunc(ctx, c.metas[curLeader])

It should be noted that the IP address and session do not have a one-to-one correspondence, because there may be situations where the IP address is unavailable.
This is why there is a priamry meta server configuration in the client, but the curllead cannot be used as the index of the metaIPAddrs array.

incubator-pegasus/go-client/session/meta_call.go

Lines 123 to 128 in 41141c1

    
           for i := range c.metaIPAddrs { 
        
           	if addr == c.metaIPAddrs[i] { 
        
           		found = true 
        
           		break 
        
           	} 
        
           }

Tests

Unit test
Manual test (add detailed scripts or steps below)

Start onebox, and the primary meta server is not added to the go client configuration.
The go client writes data to a certain partition and then kills the replica process.

…coredump. Add one feature that the client would forward to the priamry when the metalist of client don't contain the primary.

…n the CI process

acelyc111

Thanks for the contribution!

go-client/session/meta_session_test.go

acelyc111 · 2024-02-28T02:41:08Z

go-client/session/meta_session_test.go

 		// This a trick for testing. If metaCall issue to other meta, not only to the leader, this nil channel will cause panic.
 		call.backupCh = nil
 		metaResp, err := call.Run(context.Background())
 		assert.Nil(t, err)
 		assert.Equal(t, metaResp.GetErr().Errno, base.ERR_OK.String())
 	}
 }
+
+// This case mocks the case that the server primary meta is not in the client metalist.


The meta servers in the test are 0.0.0.0:3460{1..3}, which one is "not in the client metalist" ?

When onebox starts, the primary meta server is randomized. Therefore, a loop is used, and only one meta server is passed to the go client each time. This ensures that redirection is required twice in the loop.

go-client/session/meta_call.go

…to primary meta server if it was changed (apache#1916) apache#1880 apache#1856 As for apache#1856: when go client is writing to one partition and the replica node core dump, go client will finish after timeout without updating the configuration. In this case, the go client only restart to solve the problem. In this pr, the client would update configuration of table automatically when someone replica core dump. After testing, we found that the the replica error is "context.DeadlineExceeded" (incubator-pegasus/go-client/pegasus/table_connector.go) when the replica core dump. Therefore, when client meets the error, the go client will update configuration automatically. Besides, this request will not retry. Because only in the case of timeout, the configuration will be automatically updated. If you try again before then, it will still fail. There is also the risk of infinite retries. Therefore, it is better to directly return the request error to the user and let the user try again. As for apache#1880: When the client sends an RPC message "RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX" to the meta server, if the meta server isn't primary, the response that forward to the primary meta server will return. According to the above description, assuming that the client does not have a primary meta server configured, we can connect to the primary meta server in this way. About tests: 1. Start onebox, and the primary meta server is not added to the go client configuration. 2. The go client writes data to a certain partition and then kills the replica process.

lengyuexuexuan added 2 commits February 19, 2024 10:59

Fix the bug that the go client wouldn't update conf when one replica …

41141c1

…coredump. Add one feature that the client would forward to the priamry when the metalist of client don't contain the primary.

fix go client bug

54b478a

github-actions bot added the go-client label Feb 22, 2024

Solve the problem of error "File is not goimports-ed (goimports)" i…

90caeba

…n the CI process

acelyc111 reviewed Feb 28, 2024

View reviewed changes

lengyuexuexuan and others added 5 commits March 4, 2024 19:36

Fix go client format issues

0b39c96

Fix go client format issues

1bdcd07

fix data race

de72a05

Merge branch 'apache:master' into fix_go_client

792a990

Merge branch 'apache:master' into fix_go_client

23b16e9

acelyc111 approved these changes Apr 8, 2024

View reviewed changes

lengyuexuexuan requested a review from acelyc111 April 16, 2024 08:38

lengyuexuexuan mentioned this pull request May 29, 2024

chore(go-client): add generation thrift files of go-client #1917

Merged

empiredan approved these changes Jun 20, 2024

View reviewed changes

empiredan merged commit 2ab5be6 into apache:master Jun 20, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(go-client): update config once replica server failed and forward to primary meta server if it was changed #1916

fix(go-client): update config once replica server failed and forward to primary meta server if it was changed #1916

lengyuexuexuan commented Feb 22, 2024

acelyc111 left a comment

acelyc111 Feb 28, 2024

lengyuexuexuan Mar 4, 2024

	func (c metaCall) getMetaServiceForwardAddress(resp metaResponse) base.RPCAddress {
	rep, ok := resp.(*replication.QueryCfgResponse)
	if !ok \|\| rep.GetErr().Errno != base.ERR_FORWARD_TO_OTHERS.String() {
	return nil
	} else if rep.GetPartitions() == nil \|\| len(rep.GetPartitions()) == 0 {
	return nil
	} else {
	return rep.Partitions[0].Primary

	}

	}

	if forwardAddr == nil {
	return false
	}
	addr := forwardAddr.GetAddress()
	found := false
	for i := range c.metaIPAddrs {
	if addr == c.metaIPAddrs[i] {
	found = true
	break
	}
	}
	if !found {
	c.metaIPAddrs = append(c.metaIPAddrs, addr)
	c.metas = append(c.metas, &metaSession{
	NodeSession: newNodeSession(addr, NodeTypeMeta),
	logger: pegalog.GetLogger(),
	})
	curLeader = len(c.metas) - 1
	c.metas[curLeader].logger.Printf("add forward address %s as meta server", addr)
	}
	resp, err = c.callFunc(ctx, c.metas[curLeader])

fix(go-client): update config once replica server failed and forward to primary meta server if it was changed #1916

fix(go-client): update config once replica server failed and forward to primary meta server if it was changed #1916

Conversation

lengyuexuexuan commented Feb 22, 2024

What problem does this PR solve?

What is changed and how does it work?

Tests

acelyc111 left a comment

Choose a reason for hiding this comment

acelyc111 Feb 28, 2024

Choose a reason for hiding this comment

lengyuexuexuan Mar 4, 2024

Choose a reason for hiding this comment