Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protobuf and etcd upgrade #2997

Merged
merged 40 commits into from
Jun 20, 2022
Merged

Protobuf and etcd upgrade #2997

merged 40 commits into from
Jun 20, 2022

Conversation

atoulme
Copy link
Contributor

@atoulme atoulme commented Oct 26, 2021

Type of change

  • Improvement (improvement to code, performance, etc)

Description

Update protobuf and etcd to their latest versions.

Additional details

This is a reprisal of the work on updating protobuf by #2185

Related issues

FAB-18363

@atoulme atoulme requested a review from a team as a code owner October 26, 2021 15:57
@atoulme atoulme force-pushed the protobuf branch 2 times, most recently from 5a16af1 to 2f6c72d Compare October 26, 2021 22:35
@atoulme
Copy link
Contributor Author

atoulme commented Oct 28, 2021

A few regressions from etcd upgrade to look into, very interesting issues to look into. Don't mind me, I'll keep digging.

@Param-S
Copy link
Contributor

Param-S commented Jan 20, 2022

@atoulme can you update your branch & update the PR.

@atoulme
Copy link
Contributor Author

atoulme commented Jan 20, 2022

Done - tests running.

@Param-S
Copy link
Contributor

Param-S commented Jan 21, 2022

@Param-S
Copy link
Contributor

Param-S commented Jan 24, 2022

@atoulme , following change should address the panic(mentioned in the last comment). Please check by updating this PR.

git diff chain_test.go
diff --git a/orderer/consensus/etcdraft/chain_test.go b/orderer/consensus/etcdraft/chain_test.go
index c716fa225..826396287 100644
--- a/orderer/consensus/etcdraft/chain_test.go
+++ b/orderer/consensus/etcdraft/chain_test.go
@@ -212,7 +212,10 @@ var _ = Describe("Chain", func() {
                }

                JustBeforeEach(func() {
-                       chain, err = etcdraft.NewChain(support, opts, configurator, nil, cryptoProvider, noOpBlockPuller, fakeHaltCallbacker.HaltCallback, observeC)
+                       rpc := &mocks.FakeRPC{}
+                       chain, err = etcdraft.NewChain(support, opts, configurator, rpc, cryptoProvider, noOpBlockPuller, fakeHaltCallbacker.HaltCallback, observeC)

@atoulme
Copy link
Contributor Author

atoulme commented Jan 24, 2022

Thanks Param, I have applied your changes.

@atoulme
Copy link
Contributor Author

atoulme commented Jan 25, 2022

Looks like we have 2 errors left, but both pass locally on my laptop. Any ideas?

@Param-S
Copy link
Contributor

Param-S commented Jan 25, 2022

I rerun the unit-tests job(with the assumption that the problem related to timing), the failed testcases are passed now in the 2nd run. It needs bit investigation on what is the issue in the first run.

There is separate failure, it seems to me related to GRPC msg update. We need to update the testcase wrt latest msg.

https://dev.azure.com/Hyperledger/Fabric/_build/results?buildId=47086&view=logs&j=e306c17a-d139-54bf-a475-f5a11259cee7&t=1e3023a5-584f-52f3-49bc-66bd27d27b6d&l=130
zap_test.go:310:
Error Trace: zap_test.go:310
Error: Not equal:
expected: "grpc DEBUG TestGRPCLogger message\n"
actual : "grpc DEBUG callWrapper message\n"

@atoulme
Copy link
Contributor Author

atoulme commented Jan 26, 2022

Yes, I was wondering about this. I fixed the problem now by increasing the caller skip level when zap looks for the caller.

gossip/util/logging.go Outdated Show resolved Hide resolved
protoutil/txutils_test.go Outdated Show resolved Hide resolved
@yacovm
Copy link
Contributor

yacovm commented Jan 26, 2022

What is also missing is an explanation of the nature of the changes.

Can you also explain the following?

  1. How Jason's comment and the changes to the Raft chain (in the commits) are tied. Can you explain how the new Raft version behaves and what changes were done to accommodate it?
  2. Whether running mixed Raft clusters (suppose, a cluster of 4) is possible after this change or not.

@atoulme
Copy link
Contributor Author

atoulme commented Jan 27, 2022

@yacovm at the risk of disappointing you, this is just a straight up upgrade of etcd and protobuf to newer versions. There's some API changes, especially in the way a node starts and is configured with existing peers. There's some differences in the errors thrown by protobuf. There are no functional changes besides this.

I have no clue at all as to whether those changes make this code incompatible with previous revisions, ie I have not tried to form a cluster with the code before and after, and I am certainly not the best equipped for that.

@yacovm
Copy link
Contributor

yacovm commented Jan 27, 2022

@yacovm at the risk of disappointing you, this is just a straight up upgrade of etcd and protobuf to newer versions. There's some API changes, especially in the way a node starts and is configured with existing peers. There's some differences in the errors thrown by protobuf. There are no functional changes besides this.

I have no clue at all as to whether those changes make this code incompatible with previous revisions, ie I have not tried to form a cluster with the code before and after, and I am certainly not the best equipped for that.

Sure, I am not blaming or implying the work you did is not valuable.
However I am concerned about:

There are no functional changes besides this.

I understood that etcd made a functional change in how they handle snapshots and reconfiguration (specifically, removal of nodes). As per @guoger 's comment:

If the snapshot sent was taken before adding new node, then it certainly does not contain new node, and would fail the check aforementioned. etcd server it self handles this by injecting the actual ConfState to snapshot message (produced by raft) before sending them out.

Now, clearly this means that the PR already contains a functional change (in the dependencies).

It is up to Fabric to make sure that the functional change in dependencies doesn't translate to a functional change in operations.

I'm not sure that the trick that Jay pointed out that etcd is doing (which you also attempt to perform) is done correctly, because:

  1. It only takes place if a config change was applied:
			state := n.confState.Load()
			if state != nil {
				msg.Snapshot.Metadata.ConfState.Voters = state.(*raftpb.ConfState).Voters
			}

What if this node was restarted and then it means the confState is nil? In that case won't the voters contain the voters from before the config?

  1. What if since we do the reconfig in two stages, the second entry from Raft is yet to be applied and then we still send the old voters?

I have no clue at all as to whether those changes make this code incompatible with previous revisions, ie I have not tried to form a cluster with the code before and after, and I am certainly not the best equipped for that.

I understand, but I think we need to test it before merging this PR so we will know how to advise users.

@yacovm
Copy link
Contributor

yacovm commented Apr 18, 2022

@Param-S I think it works for you just because the latest config block was still relatively "fresh" and as a result it's still in the WAL and hasn't been garbage collected by snapshot.

If you put many transactions between the last config and the restart of the node, you will see that ApplyConfChange will not be called.

I made a small test where I put 100 transactions to enforce a snapshot that prunes the WAL:

			By("performing operation with orderer1")
			for i := 1; i < 100; i++ {
				env := CreateBroadcastEnvelope(network, o1, network.SystemChannel.Name, make([]byte, 1024 * 10))
				resp, err := ordererclient.Broadcast(network, o1, env)
				Expect(err).NotTo(HaveOccurred())
				Expect(resp.Status).To(Equal(common.Status_SUCCESS))

				block := FetchBlock(network, o1, uint64(i), network.SystemChannel.Name)
				Expect(block).NotTo(BeNil())
			}

and added a print to ApplyConfChange:

func (n *node) ApplyConfChange(cc raftpb.ConfChange) *raftpb.ConfState {
	debug.PrintStack()
	return n.Node.ApplyConfChange(cc)
}

and it doesn't print it after a restart.

@Param-S
Copy link
Contributor

Param-S commented May 12, 2022

@yacovm The current implementation of NewChain reads the confstate of latest snapshot and stores it chain object.

cc = s.Metadata.ConfState

Now I updated the same flow to set the same value to node's confstate attribute which can be used later in the flow.

@Param-S
Copy link
Contributor

Param-S commented May 14, 2022

@yacovm Now, the ConfState picked up from the latest snapshot and at the node initialization time itself, it should address the issue. Could you check & confirm

@yacovm
Copy link
Contributor

yacovm commented May 16, 2022

Looks good, let's get some more eyes on this.

@atoulme
Copy link
Contributor Author

atoulme commented May 24, 2022

How many eyes needed? Can this be merged please?

@yacovm
Copy link
Contributor

yacovm commented May 25, 2022

How many eyes needed? Can this be merged please?

At least one more pair of eyes besides mine :-)

@atoulme
Copy link
Contributor Author

atoulme commented May 25, 2022

Do you have someone in mind?

@yacovm
Copy link
Contributor

yacovm commented May 25, 2022

Do you have someone in mind?

Probably @guoger is best, afterwards maybe @C0rWin or @manish-sethi

@C0rWin
Copy link
Contributor

C0rWin commented May 26, 2022

Do you have someone in mind?

Probably @guoger is best, afterwards maybe @C0rWin or @manish-sethi

I can do it over the weekend

@atoulme
Copy link
Contributor Author

atoulme commented Jun 1, 2022

@C0rWin any update?

@atoulme
Copy link
Contributor Author

atoulme commented Jun 3, 2022

@C0rWin any update? @yacovm anyone else available or can we please merge?

@C0rWin
Copy link
Contributor

C0rWin commented Jun 3, 2022

@C0rWin any update? @yacovm anyone else available or can we please merge?

This is quite colossal PR to review. Yes, it takes some time as I'm running several tests to see what output and check it... I know you have been doing it for a long while, but please have patience, and we will merge it.

@atoulme
Copy link
Contributor Author

atoulme commented Jun 4, 2022

Sure, any ETA?

@atoulme
Copy link
Contributor Author

atoulme commented Jun 17, 2022

@C0rWin any update please? Are the tests not covering enough this change? What can we do here?

Copy link
Contributor

@C0rWin C0rWin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atoulme sorry, took me some time to review, run locally and seems the PR is fine, though a little bit too bigger than I would go with merge :)

but given there is not other way, LGTM and thanks

@atoulme
Copy link
Contributor Author

atoulme commented Jun 19, 2022

Great, please merge?

@yacovm yacovm merged commit 3dfc0b3 into hyperledger:main Jun 20, 2022
@SamYuan1990
Copy link
Contributor

Hi guys,

so does this pr merge means we are completed with protobuf upgrade? and which means for PR been blocked by protobuf can be reopened in pr review process?
such as #3202 ?

thanks and regards
Sam

@denyeart
Copy link
Contributor

@SamYuan1990 Yes, please proceed now!

@SamYuan1990
Copy link
Contributor

@SamYuan1990 Yes, please proceed now!

ok, @denyeart , please take a look at #3498

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants