Protobuf and etcd upgrade #2997

atoulme · 2021-10-26T15:57:35Z

Type of change

Improvement (improvement to code, performance, etc)

Description

Update protobuf and etcd to their latest versions.

Additional details

This is a reprisal of the work on updating protobuf by #2185

Related issues

FAB-18363

atoulme · 2021-10-28T19:08:56Z

A few regressions from etcd upgrade to look into, very interesting issues to look into. Don't mind me, I'll keep digging.

common/cauthdsl/policy_test.go

orderer/consensus/etcdraft/chain_test.go

Param-S · 2022-01-20T07:45:40Z

@atoulme can you update your branch & update the PR.

atoulme · 2022-01-20T23:54:01Z

Done - tests running.

Param-S · 2022-01-21T08:07:57Z

there are failures related to protobuf changes:
https://dev.azure.com/Hyperledger/Fabric/_build/results?buildId=46926&view=logs&j=6b58850f-3858-5a05-33e2-5e41cbf03c4e&t=bddec1cf-ba37-5883-9c3e-fd1e8608f9a1&l=3471
, please check here core/chaincode/lifecycle/serializer_test.go

https://dev.azure.com/Hyperledger/Fabric/_build/results?buildId=46926&view=logs&j=6b58850f-3858-5a05-33e2-5e41cbf03c4e&t=bddec1cf-ba37-5883-9c3e-fd1e8608f9a1&l=3726
core/endorser/endorser_test.go:997

This testcase fails with panic. I am seeing the same issue on my local environment. I am currently debugging this one.
https://dev.azure.com/Hyperledger/Fabric/_build/results?buildId=46926&view=logs&j=6b58850f-3858-5a05-33e2-5e41cbf03c4e&t=bddec1cf-ba37-5883-9c3e-fd1e8608f9a1&l=5373

Param-S · 2022-01-24T18:57:55Z

@atoulme , following change should address the panic(mentioned in the last comment). Please check by updating this PR.

git diff chain_test.go
diff --git a/orderer/consensus/etcdraft/chain_test.go b/orderer/consensus/etcdraft/chain_test.go
index c716fa225..826396287 100644
--- a/orderer/consensus/etcdraft/chain_test.go
+++ b/orderer/consensus/etcdraft/chain_test.go
@@ -212,7 +212,10 @@ var _ = Describe("Chain", func() {
                }

                JustBeforeEach(func() {
-                       chain, err = etcdraft.NewChain(support, opts, configurator, nil, cryptoProvider, noOpBlockPuller, fakeHaltCallbacker.HaltCallback, observeC)
+                       rpc := &mocks.FakeRPC{}
+                       chain, err = etcdraft.NewChain(support, opts, configurator, rpc, cryptoProvider, noOpBlockPuller, fakeHaltCallbacker.HaltCallback, observeC)

orderer/consensus/etcdraft/node.go

atoulme · 2022-01-24T23:45:15Z

Thanks Param, I have applied your changes.

atoulme · 2022-01-25T01:13:15Z

Looks like we have 2 errors left, but both pass locally on my laptop. Any ideas?

Param-S · 2022-01-25T09:00:23Z

I rerun the unit-tests job(with the assumption that the problem related to timing), the failed testcases are passed now in the 2nd run. It needs bit investigation on what is the issue in the first run.

There is separate failure, it seems to me related to GRPC msg update. We need to update the testcase wrt latest msg.

https://dev.azure.com/Hyperledger/Fabric/_build/results?buildId=47086&view=logs&j=e306c17a-d139-54bf-a475-f5a11259cee7&t=1e3023a5-584f-52f3-49bc-66bd27d27b6d&l=130
zap_test.go:310:
Error Trace: zap_test.go:310
Error: Not equal:
expected: "grpc DEBUG TestGRPCLogger message\n"
actual : "grpc DEBUG callWrapper message\n"

atoulme · 2022-01-26T01:51:16Z

Yes, I was wondering about this. I fixed the problem now by increasing the caller skip level when zap looks for the caller.

gossip/service/gossip_service_test.go

gossip/util/logging.go

gossip/service/gossip_service_test.go

protoutil/txutils_test.go

yacovm · 2022-01-26T20:42:59Z

What is also missing is an explanation of the nature of the changes.

Can you also explain the following?

How Jason's comment and the changes to the Raft chain (in the commits) are tied. Can you explain how the new Raft version behaves and what changes were done to accommodate it?
Whether running mixed Raft clusters (suppose, a cluster of 4) is possible after this change or not.

orderer/consensus/etcdraft/node.go

atoulme · 2022-01-27T01:19:46Z

@yacovm at the risk of disappointing you, this is just a straight up upgrade of etcd and protobuf to newer versions. There's some API changes, especially in the way a node starts and is configured with existing peers. There's some differences in the errors thrown by protobuf. There are no functional changes besides this.

I have no clue at all as to whether those changes make this code incompatible with previous revisions, ie I have not tried to form a cluster with the code before and after, and I am certainly not the best equipped for that.

yacovm · 2022-01-27T11:40:53Z

@yacovm at the risk of disappointing you, this is just a straight up upgrade of etcd and protobuf to newer versions. There's some API changes, especially in the way a node starts and is configured with existing peers. There's some differences in the errors thrown by protobuf. There are no functional changes besides this.

I have no clue at all as to whether those changes make this code incompatible with previous revisions, ie I have not tried to form a cluster with the code before and after, and I am certainly not the best equipped for that.

Sure, I am not blaming or implying the work you did is not valuable.
However I am concerned about:

There are no functional changes besides this.

I understood that etcd made a functional change in how they handle snapshots and reconfiguration (specifically, removal of nodes). As per @guoger 's comment:

If the snapshot sent was taken before adding new node, then it certainly does not contain new node, and would fail the check aforementioned. etcd server it self handles this by injecting the actual ConfState to snapshot message (produced by raft) before sending them out.

Now, clearly this means that the PR already contains a functional change (in the dependencies).

It is up to Fabric to make sure that the functional change in dependencies doesn't translate to a functional change in operations.

I'm not sure that the trick that Jay pointed out that etcd is doing (which you also attempt to perform) is done correctly, because:

It only takes place if a config change was applied:

			state := n.confState.Load()
			if state != nil {
				msg.Snapshot.Metadata.ConfState.Voters = state.(*raftpb.ConfState).Voters
			}

What if this node was restarted and then it means the confState is nil? In that case won't the voters contain the voters from before the config?

What if since we do the reconfig in two stages, the second entry from Raft is yet to be applied and then we still send the old voters?

I have no clue at all as to whether those changes make this code incompatible with previous revisions, ie I have not tried to form a cluster with the code before and after, and I am certainly not the best equipped for that.

I understand, but I think we need to test it before merging this PR so we will know how to advise users.

yacovm · 2022-04-18T12:07:34Z

@Param-S I think it works for you just because the latest config block was still relatively "fresh" and as a result it's still in the WAL and hasn't been garbage collected by snapshot.

If you put many transactions between the last config and the restart of the node, you will see that ApplyConfChange will not be called.

I made a small test where I put 100 transactions to enforce a snapshot that prunes the WAL:

			By("performing operation with orderer1")
			for i := 1; i < 100; i++ {
				env := CreateBroadcastEnvelope(network, o1, network.SystemChannel.Name, make([]byte, 1024 * 10))
				resp, err := ordererclient.Broadcast(network, o1, env)
				Expect(err).NotTo(HaveOccurred())
				Expect(resp.Status).To(Equal(common.Status_SUCCESS))

				block := FetchBlock(network, o1, uint64(i), network.SystemChannel.Name)
				Expect(block).NotTo(BeNil())
			}

and added a print to ApplyConfChange:

func (n *node) ApplyConfChange(cc raftpb.ConfChange) *raftpb.ConfState {
	debug.PrintStack()
	return n.Node.ApplyConfChange(cc)
}

and it doesn't print it after a restart.

Signed-off-by: Parameswaran Selvam <[email protected]>

Param-S · 2022-05-12T18:47:39Z

@yacovm The current implementation of NewChain reads the confstate of latest snapshot and stores it chain object.

fabric/orderer/consensus/etcdraft/chain.go

Line 245 in ccfa8a4

cc = s.Metadata.ConfState

Now I updated the same flow to set the same value to node's confstate attribute which can be used later in the flow.

Signed-off-by: Parameswaran Selvam <[email protected]>

Param-S · 2022-05-14T13:39:32Z

@yacovm Now, the ConfState picked up from the latest snapshot and at the node initialization time itself, it should address the issue. Could you check & confirm

yacovm · 2022-05-16T09:10:00Z

Looks good, let's get some more eyes on this.

atoulme · 2022-05-24T22:24:39Z

How many eyes needed? Can this be merged please?

yacovm · 2022-05-25T11:05:08Z

How many eyes needed? Can this be merged please?

At least one more pair of eyes besides mine :-)

atoulme · 2022-05-25T13:49:20Z

Do you have someone in mind?

yacovm · 2022-05-25T15:01:26Z

Do you have someone in mind?

Probably @guoger is best, afterwards maybe @C0rWin or @manish-sethi

C0rWin · 2022-05-26T07:08:37Z

Do you have someone in mind?

Probably @guoger is best, afterwards maybe @C0rWin or @manish-sethi

I can do it over the weekend

atoulme · 2022-06-01T15:49:08Z

@C0rWin any update?

atoulme · 2022-06-03T16:44:14Z

@C0rWin any update? @yacovm anyone else available or can we please merge?

C0rWin · 2022-06-03T16:52:38Z

@C0rWin any update? @yacovm anyone else available or can we please merge?

This is quite colossal PR to review. Yes, it takes some time as I'm running several tests to see what output and check it... I know you have been doing it for a long while, but please have patience, and we will merge it.

atoulme · 2022-06-04T06:30:14Z

Sure, any ETA?

atoulme · 2022-06-17T18:42:07Z

@C0rWin any update please? Are the tests not covering enough this change? What can we do here?

C0rWin

@atoulme sorry, took me some time to review, run locally and seems the PR is fine, though a little bit too bigger than I would go with merge :)

but given there is not other way, LGTM and thanks

atoulme · 2022-06-19T13:21:55Z

Great, please merge?

SamYuan1990 · 2022-06-21T10:58:07Z

Hi guys,

so does this pr merge means we are completed with protobuf upgrade? and which means for PR been blocked by protobuf can be reopened in pr review process?
such as #3202 ?

thanks and regards
Sam

denyeart · 2022-06-21T13:54:32Z

@SamYuan1990 Yes, please proceed now!

SamYuan1990 · 2022-06-21T13:57:46Z

@SamYuan1990 Yes, please proceed now!

ok, @denyeart , please take a look at #3498

atoulme requested a review from a team as a code owner October 26, 2021 15:57

atoulme force-pushed the protobuf branch 2 times, most recently from 5a16af1 to 2f6c72d Compare October 26, 2021 22:35

atoulme force-pushed the protobuf branch 3 times, most recently from 7d8a93c to f6fa718 Compare November 6, 2021 01:13

atoulme force-pushed the protobuf branch from 9a598e1 to 538d533 Compare December 1, 2021 18:11

atoulme force-pushed the protobuf branch from 538d533 to 3c5f908 Compare January 11, 2022 01:03

Param-S reviewed Jan 12, 2022

View reviewed changes

common/cauthdsl/policy_test.go Outdated Show resolved Hide resolved

Param-S reviewed Jan 16, 2022

View reviewed changes

orderer/consensus/etcdraft/chain_test.go Outdated Show resolved Hide resolved

atoulme force-pushed the protobuf branch from 9e55e70 to 024fca1 Compare January 20, 2022 23:53

atoulme force-pushed the protobuf branch from 024fca1 to a8ca6a7 Compare January 21, 2022 00:03

Param-S reviewed Jan 24, 2022

View reviewed changes

orderer/consensus/etcdraft/node.go Show resolved Hide resolved