Error in upgrading orderer from 2.4.2 to 2.5.0 (wal: max entry size limit exceeded) #4290

sarojsh1 · 2023-06-20T08:20:31Z

Description

When upgrading orderer from 2.4.2 to 2.5.0 it is giving below error. Starting orderer container with image tag 2.4.2 working fine but not with 2.5.0.

[orderer.commmon.multichannel] initAppChannels -> Failed to create chain support for channel 'testchannel', error: error creating consenter for channel: testchannel: failed to restore persisted raft data: failed to create or read WAL: failed to read WAL and cannot repair: wal: max entry size limit exceeded

Steps to reproduce

Upgrade hyperledger fabric components from 2.4.2 to 2.5.0. Setup is with 1 orderer and 2 peers. Binaries and other config files are updated as per 2.5.0 release.
Followed below step to reproduce.

Backup orderer container volume.
Stopped orderer container.
Removed orderer container.
Updated the orderer image tag from 2.4.2 to 2.5.0.
Tried to start new orderer container with with new image (2.5.0).
Orderer did not start and threw below error.

[orderer.commmon.multichannel] initAppChannels -> Failed to create chain support for channel 'testchannel', error: error creating consenter for channel: testchannel: failed to restore persisted raft data: failed to create or read WAL: failed to read WAL and cannot repair: wal: max entry size limit exceeded

yacovm · 2023-06-20T19:08:17Z

Do you have a transactions that are extremely big?

sarojsh1 · 2023-06-21T07:14:46Z

Do you have a transactions that are extremely big?

Yes, I have few transactions those have the size like 31180 KB, 9404 KB, 3453 KB etc.

sarojsh1 · 2023-06-26T08:05:39Z

Do you have a transactions that are extremely big?

Yes, I have few transactions those have the size like 31180 KB, 9404 KB, 3453 KB etc.

Any suggestion or solution to fix this? Are these transactions showstopper to upgrade from 2.4.2 to 2.5.0?

yacovm · 2023-06-26T12:07:47Z

You only have 1 orderer, so this is not a production environment, i hope?

sarojsh1 · 2023-06-26T13:25:22Z

You only have 1 orderer, so this is not a production environment, i hope?

Yes, this is not a production environment. But I have below concerns and I appreciate any suggestion or help on these.

Is big transaction size an issue?
If same situation come in production environment then how to fix it (I understand that production will not have 1 orderer still asuming the use case)?
In multi orderer environment is this not an issue?
Is upgrading and syncing (not using exiting file) an orderer from other orderer in the channel is different than what it is doing to upgrade and start orderer from existing files?

yacovm · 2023-06-27T11:12:40Z

You only have 1 orderer, so this is not a production environment, i hope?

Yes, this is not a production environment. But I have below concerns and I appreciate any suggestion or help on these.
1. Is big transaction size an issue?

2. If same situation come in production environment then how to fix it (I understand that production will not have 1 orderer still asuming the use case)?

3. In multi orderer environment is this not an issue?

4. Is upgrading and syncing (not using exiting file) an orderer  from other orderer in the channel is different than what it is doing to upgrade and start orderer from existing files?

As you see from the code of etcd, it is.
Probably by removing the node from the channel and re-adding it.
It is an issue if you need to restart from a WAL that contains this record. The WAL is periodically pruned, so it really depends on luck, which is not a good thing.
I am not sure you can just bring a WAL from another orderer.

denyeart · 2023-07-08T02:56:51Z

@yacovm @tock-ibm Do you think this is a new issue in v2.5, perhaps due to the etcd update?

yacovm · 2023-07-10T12:24:26Z

@yacovm @tock-ibm Do you think this is a new issue in v2.5, perhaps due to the etcd update?

Based on the git blame when I looked at the code, I think it's related to some update, but not sure about the version.

But what are you trying to say? You want to put a release note message or something?

sarojsh1 · 2023-07-12T13:54:15Z

Adding more details about error when tried to upgrade with single orderer:

orderer.test.com | 2023-07-12 10:38:49.855 UTC 04b9 DEBU [blkstorage] nextBlockBytesAndPlacementInfo -> blockbytes [5858] read from file [1]
orderer.test.com | 2023-07-12 10:38:49.855 UTC 04ba DEBU [orderer.commmon.multichannel] newBlockWriter -> [channel: testchannel] Creating block writer for tip of chain (blockNumber=149, lastConfigBlockNum=2, lastConfigSeq=2)
orderer.test.com | 2023-07-12 10:38:49.855 UTC 04bb INFO [orderer.consensus.etcdraft] HandleChain -> EvictionSuspicion not set, defaulting to 10m0s
orderer.test.com | 2023-07-12 10:38:49.855 UTC 04bc INFO [orderer.consensus.etcdraft] HandleChain -> Without system channel: after eviction Registrar.SwitchToFollower will be called
orderer.test.com | 2023-07-12 10:38:49.973 UTC 04bd DEBU [orderer.consensus.etcdraft] CreateStorage -> Loaded snapshot at Term 2 and Index 44, Nodes: [1] channel=testchannel node=1
orderer.test.com | 2023-07-12 10:38:49.973 UTC 04be INFO [orderer.consensus.etcdraft] createOrReadWAL -> Found WAL data at path '/var/hyperledger/production/orderer/etcdraft/wal/testchannel', replaying it channel=testchannel node=1
orderer.test.com | 2023-07-12 10:38:49.973 UTC 04bf DEBU [orderer.consensus.etcdraft] createOrReadWAL -> Loading WAL at Term 2 and Index 44 channel=testchannel node=1
orderer.test.com | 2023-07-12 10:38:50.015 UTC 04c0 WARN [orderer.consensus.etcdraft] createOrReadWAL -> Failed to read WAL: wal: max entry size limit exceeded channel=testchannel node=1
orderer.test.com | 2023-07-12 10:38:50.034 UTC 04c1 PANI [orderer.commmon.multichannel] initAppChannels -> Failed to create chain support for channel 'testchannel', error: error creating consenter for channel: testchannel: failed to restore persisted raft data: failed to create or read WAL: failed to read WAL and cannot repair: wal: max entry size limit exceeded
orderer.test.com | panic: Failed to create chain support for channel 'testchannel', error: error creating consenter for channel: testchannel: failed to restore persisted raft data: failed to create or read WAL: failed to read WAL and cannot repair: wal: max entry size limit exceeded
orderer.test.com |
orderer.test.com | goroutine 1 [running]:
orderer.test.com | go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0008c6000, {0x0, 0x0, 0x0})
orderer.test.com | /vendor/go.uber.org/zap/zapcore/entry.go:232 +0x44c
orderer.test.com | go.uber.org/zap.(*SugaredLogger).log(0xc000123cd8, 0x4, {0x10d10d9?, 0x1222e00?}, {0xc0007184e8?, 0x20?, 0x195a760?}, {0x0, 0x0, 0x0})
orderer.test.com | /vendor/go.uber.org/zap/sugar.go:227 +0xee
orderer.test.com | go.uber.org/zap.(*SugaredLogger).Panicf(...)
orderer.test.com | /vendor/go.uber.org/zap/sugar.go:159
orderer.test.com | github.com/hyperledger/fabric/common/flogging.(*FabricLogger).Panicf(...)
orderer.test.com | /common/flogging/zap.go:74
orderer.test.com | github.com/hyperledger/fabric/orderer/common/multichannel.(*Registrar).initAppChannels(0xc000205180, {0xc000385520, 0x2, 0x10?}, 0x10?)
orderer.test.com | /orderer/common/multichannel/registrar.go:345 +0xac6
orderer.test.com | github.com/hyperledger/fabric/orderer/common/multichannel.(*Registrar).init(0xc000205180, 0x44d931?)
orderer.test.com | /orderer/common/multichannel/registrar.go:168 +0xa5
orderer.test.com | github.com/hyperledger/fabric/orderer/common/multichannel.(*Registrar).Initialize(0xc000205180, 0xc00038c420?)
orderer.test.com | /orderer/common/multichannel/registrar.go:144 +0x33
orderer.test.com | github.com/hyperledger/fabric/orderer/common/server.initializeMultichannelRegistrar(_, _, _, {0x0, {0x0, {0xc0001d7880, 0x37b, 0x37c}, {0xc0002a3000, 0xf1, ...}, ...}, ...}, ...)
orderer.test.com | /orderer/common/server/main.go:828 +0x5f1
orderer.test.com | github.com/hyperledger/fabric/orderer/common/server.Main()
orderer.test.com | /orderer/common/server/main.go:245 +0x15c5
orderer.test.com | main.main()
orderer.test.com | /cmd/orderer/main.go:15 +0x17
orderer.test.comexited with code 2

Tried below steps to solve the issue but no luck:

Scaled orderer to 3 orderers (added 2 more orderer with v2.4.2).
Added new orders to channel consenters and everything is working fine.
Removed and added existing orderer again (deleted existing volume).
Orderers matrics showing the below details and as per my understanding it is correct.

  consensus_etcdraft_active_nodes{channel="testchannel2"} 3
  consensus_etcdraft_active_nodes{channel="testchannel"} 3

  consensus_etcdraft_cluster_size{channel="testchannel2"} 3
  consensus_etcdraft_cluster_size{channel="testchannel"} 3

  consensus_etcdraft_committed_block_number{channel="testchannel2"} 49
  consensus_etcdraft_committed_block_number{channel="testchannel"} 157

Tried to upgrade one orderer from 2.4.2 to 2.5.0 and it started without any error but when I am seeing the matrics then it is showing different values than it was showing with 2.4.2. Channel with big transaction size is producing this difference.
a. consensus_etcdraft_active_nodes is 0.
b. consensus_etcdraft_committed_block_number is different.

  consensus_etcdraft_active_nodes{channel="testchannel2"} 3
  consensus_etcdraft_active_nodes{channel="testchannel"} 0

  consensus_etcdraft_cluster_size{channel="testchannel2"} 3
  consensus_etcdraft_cluster_size{channel="testchannel"} 3

  consensus_etcdraft_committed_block_number{channel="testchannel2"} 49
  consensus_etcdraft_committed_block_number{channel="testchannel"} 150

Other orderer nodes showing below details:

consensus_etcdraft_active_nodes{channel="testchannel2"} 3
consensus_etcdraft_active_nodes{channel="testchannel"} 2

consensus_etcdraft_cluster_size{channel="testchannel2"} 3
consensus_etcdraft_cluster_size{channel="testchannel"} 3

consensus_etcdraft_committed_block_number{channel="testchannel2"} 49
consensus_etcdraft_committed_block_number{channel="testchannel"} 157

denyeart · 2023-07-18T12:09:37Z

@yacovm @tock-ibm Do you think this is a new issue in v2.5, perhaps due to the etcd update?

Based on the git blame when I looked at the code, I think it's related to some update, but not sure about the version.

But what are you trying to say? You want to put a release note message or something?

@yacovm I'm just trying to assess for now... is this a legitimate new problem that can impact production environments? If so, I'm assuming we should investigate further and either fix, put in place a mitigation, or mention in release notes and doc.

yacovm · 2023-07-19T16:56:58Z

Maybe we can ask etcd community to make this constant to a variable and add a method that changes it?

@sarojsh1 can you produce us a sample file system that reproduces this error so we can test the above hypothesis?

denyeart · 2023-07-26T21:43:28Z

@yacovm @tock-ibm

A similar problem was reported at etcd-io/etcd#14025.

etcd-io/etcd#14114 was used to fix it in etcd raft v3.5.5 (Fabric is still on v3.5.1).

Here is the commit with the fix:
etcd-io/etcd@621cd7b

So moving up to latest etcd v3.5.9 may help to resolve it. Although it looks like there still may be an upper limit due to the write ahead log file size, which appears to be 64MB.

// The length of current WAL entry must be less than the remaining file size.

https://pkg.go.dev/go.etcd.io/etcd/server/v3/wal#pkg-overview

https://github.com/etcd-io/etcd/blob/release-3.5/server/wal/wal.go#L55

Am I reading it correctly that we still may have an issue with very large transactions? WDYT?

yacovm · 2023-07-27T00:16:00Z

So moving up to latest etcd v3.5.9 may help to resolve it. Although it looks like there still may be an upper limit due to the write ahead log file size, which appears to be 64MB.

Possible. I think we should consider restricting too big Raft transactions.

Well they clearly say:

// The length of current WAL entry must be less than the remaining file size.

So the WAL entry that is not corrupted is always less than the remaining file size, by definition, no? Since the remaining file size contains the "header" of the entry and the data itself.

But I still would like someone to make a test to reproduce this problem in our current latest code base, so we can be sure the problem is only reproduced at startup, and then see if the upgrade to the latest Raft version fixes this or not.

@semil do you want to take a look at this?

semil · 2023-07-31T10:38:17Z

Please look at #4339 for bug reproduction.
#4340 includes a shorter test and upgrade of Etcd Raft 3.5.1 -> 3.5.9, it solved the issues as expected.

yacovm · 2023-07-31T10:49:11Z

Thanks @semil !

denyeart · 2023-08-01T22:30:30Z

Thanks @semil, this fix will be included in Fabric v2.5.4.

sarojsh1 · 2023-08-22T15:50:56Z

Thank you All! :)
I tried Fabric v2.5.4 and it worked well! Error does not exist anymore.

sarojsh1 added the bug label Jun 20, 2023

This was referenced Jul 31, 2023

WIP: WAL "max entry size limit exceeded" bug reproduction #4339

Closed

Wal file size bug fix #4340

Merged

denyeart closed this as completed Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in upgrading orderer from 2.4.2 to 2.5.0 (wal: max entry size limit exceeded) #4290

Error in upgrading orderer from 2.4.2 to 2.5.0 (wal: max entry size limit exceeded) #4290

sarojsh1 commented Jun 20, 2023 •

edited

Loading

yacovm commented Jun 20, 2023

sarojsh1 commented Jun 21, 2023

sarojsh1 commented Jun 26, 2023

yacovm commented Jun 26, 2023

sarojsh1 commented Jun 26, 2023

yacovm commented Jun 27, 2023

denyeart commented Jul 8, 2023

yacovm commented Jul 10, 2023 •

edited

Loading

sarojsh1 commented Jul 12, 2023 •

edited

Loading

denyeart commented Jul 18, 2023

yacovm commented Jul 19, 2023

denyeart commented Jul 26, 2023 •

edited

Loading

yacovm commented Jul 27, 2023

semil commented Jul 31, 2023

yacovm commented Jul 31, 2023

denyeart commented Aug 1, 2023

sarojsh1 commented Aug 22, 2023

Error in upgrading orderer from 2.4.2 to 2.5.0 (wal: max entry size limit exceeded) #4290

Error in upgrading orderer from 2.4.2 to 2.5.0 (wal: max entry size limit exceeded) #4290

Comments

sarojsh1 commented Jun 20, 2023 • edited Loading

Description

Steps to reproduce

yacovm commented Jun 20, 2023

sarojsh1 commented Jun 21, 2023

sarojsh1 commented Jun 26, 2023

yacovm commented Jun 26, 2023

sarojsh1 commented Jun 26, 2023

yacovm commented Jun 27, 2023

denyeart commented Jul 8, 2023

yacovm commented Jul 10, 2023 • edited Loading

sarojsh1 commented Jul 12, 2023 • edited Loading

denyeart commented Jul 18, 2023

yacovm commented Jul 19, 2023

denyeart commented Jul 26, 2023 • edited Loading

yacovm commented Jul 27, 2023

semil commented Jul 31, 2023

yacovm commented Jul 31, 2023

denyeart commented Aug 1, 2023

sarojsh1 commented Aug 22, 2023

sarojsh1 commented Jun 20, 2023 •

edited

Loading

yacovm commented Jul 10, 2023 •

edited

Loading

sarojsh1 commented Jul 12, 2023 •

edited

Loading

denyeart commented Jul 26, 2023 •

edited

Loading