Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in upgrading orderer from 2.4.2 to 2.5.0 (wal: max entry size limit exceeded) #4290

Closed
sarojsh1 opened this issue Jun 20, 2023 · 17 comments
Labels

Comments

@sarojsh1
Copy link

sarojsh1 commented Jun 20, 2023

Description

When upgrading orderer from 2.4.2 to 2.5.0 it is giving below error. Starting orderer container with image tag 2.4.2 working fine but not with 2.5.0.

[orderer.commmon.multichannel] initAppChannels -> Failed to create chain support for channel 'testchannel', error: error creating consenter for channel: testchannel: failed to restore persisted raft data: failed to create or read WAL: failed to read WAL and cannot repair: wal: max entry size limit exceeded

Steps to reproduce

Upgrade hyperledger fabric components from 2.4.2 to 2.5.0. Setup is with 1 orderer and 2 peers. Binaries and other config files are updated as per 2.5.0 release.
Followed below step to reproduce.

  1. Backup orderer container volume.
  2. Stopped orderer container.
  3. Removed orderer container.
  4. Updated the orderer image tag from 2.4.2 to 2.5.0.
  5. Tried to start new orderer container with with new image (2.5.0).
  6. Orderer did not start and threw below error.

[orderer.commmon.multichannel] initAppChannels -> Failed to create chain support for channel 'testchannel', error: error creating consenter for channel: testchannel: failed to restore persisted raft data: failed to create or read WAL: failed to read WAL and cannot repair: wal: max entry size limit exceeded

@sarojsh1 sarojsh1 added the bug label Jun 20, 2023
@yacovm
Copy link
Contributor

yacovm commented Jun 20, 2023

Do you have a transactions that are extremely big?

@sarojsh1
Copy link
Author

Do you have a transactions that are extremely big?

Yes, I have few transactions those have the size like 31180 KB, 9404 KB, 3453 KB etc.

@sarojsh1
Copy link
Author

Do you have a transactions that are extremely big?

Yes, I have few transactions those have the size like 31180 KB, 9404 KB, 3453 KB etc.

Any suggestion or solution to fix this? Are these transactions showstopper to upgrade from 2.4.2 to 2.5.0?

@yacovm
Copy link
Contributor

yacovm commented Jun 26, 2023

You only have 1 orderer, so this is not a production environment, i hope?

@sarojsh1
Copy link
Author

You only have 1 orderer, so this is not a production environment, i hope?

Yes, this is not a production environment. But I have below concerns and I appreciate any suggestion or help on these.

  1. Is big transaction size an issue?
  2. If same situation come in production environment then how to fix it (I understand that production will not have 1 orderer still asuming the use case)?
  3. In multi orderer environment is this not an issue?
  4. Is upgrading and syncing (not using exiting file) an orderer from other orderer in the channel is different than what it is doing to upgrade and start orderer from existing files?

@yacovm
Copy link
Contributor

yacovm commented Jun 27, 2023

You only have 1 orderer, so this is not a production environment, i hope?

Yes, this is not a production environment. But I have below concerns and I appreciate any suggestion or help on these.

1. Is big transaction size an issue?

2. If same situation come in production environment then how to fix it (I understand that production will not have 1 orderer still asuming the use case)?

3. In multi orderer environment is this not an issue?

4. Is upgrading and syncing (not using exiting file) an orderer  from other orderer in the channel is different than what it is doing to upgrade and start orderer from existing files?
  1. As you see from the code of etcd, it is.
  2. Probably by removing the node from the channel and re-adding it.
  3. It is an issue if you need to restart from a WAL that contains this record. The WAL is periodically pruned, so it really depends on luck, which is not a good thing.
  4. I am not sure you can just bring a WAL from another orderer.

@denyeart
Copy link
Contributor

denyeart commented Jul 8, 2023

@yacovm @tock-ibm Do you think this is a new issue in v2.5, perhaps due to the etcd update?

@yacovm
Copy link
Contributor

yacovm commented Jul 10, 2023

@yacovm @tock-ibm Do you think this is a new issue in v2.5, perhaps due to the etcd update?

Based on the git blame when I looked at the code, I think it's related to some update, but not sure about the version.

But what are you trying to say? You want to put a release note message or something?

@sarojsh1
Copy link
Author

sarojsh1 commented Jul 12, 2023

Adding more details about error when tried to upgrade with single orderer:

orderer.test.com | 2023-07-12 10:38:49.855 UTC 04b9 DEBU [blkstorage] nextBlockBytesAndPlacementInfo -> blockbytes [5858] read from file [1]
orderer.test.com | 2023-07-12 10:38:49.855 UTC 04ba DEBU [orderer.commmon.multichannel] newBlockWriter -> [channel: testchannel] Creating block writer for tip of chain (blockNumber=149, lastConfigBlockNum=2, lastConfigSeq=2)
orderer.test.com | 2023-07-12 10:38:49.855 UTC 04bb INFO [orderer.consensus.etcdraft] HandleChain -> EvictionSuspicion not set, defaulting to 10m0s
orderer.test.com | 2023-07-12 10:38:49.855 UTC 04bc INFO [orderer.consensus.etcdraft] HandleChain -> Without system channel: after eviction Registrar.SwitchToFollower will be called
orderer.test.com | 2023-07-12 10:38:49.973 UTC 04bd DEBU [orderer.consensus.etcdraft] CreateStorage -> Loaded snapshot at Term 2 and Index 44, Nodes: [1] channel=testchannel node=1
orderer.test.com | 2023-07-12 10:38:49.973 UTC 04be INFO [orderer.consensus.etcdraft] createOrReadWAL -> Found WAL data at path '/var/hyperledger/production/orderer/etcdraft/wal/testchannel', replaying it channel=testchannel node=1
orderer.test.com | 2023-07-12 10:38:49.973 UTC 04bf DEBU [orderer.consensus.etcdraft] createOrReadWAL -> Loading WAL at Term 2 and Index 44 channel=testchannel node=1
orderer.test.com | 2023-07-12 10:38:50.015 UTC 04c0 WARN [orderer.consensus.etcdraft] createOrReadWAL -> Failed to read WAL: wal: max entry size limit exceeded channel=testchannel node=1
orderer.test.com | 2023-07-12 10:38:50.034 UTC 04c1 PANI [orderer.commmon.multichannel] initAppChannels -> Failed to create chain support for channel 'testchannel', error: error creating consenter for channel: testchannel: failed to restore persisted raft data: failed to create or read WAL: failed to read WAL and cannot repair: wal: max entry size limit exceeded
orderer.test.com | panic: Failed to create chain support for channel 'testchannel', error: error creating consenter for channel: testchannel: failed to restore persisted raft data: failed to create or read WAL: failed to read WAL and cannot repair: wal: max entry size limit exceeded
orderer.test.com |
orderer.test.com | goroutine 1 [running]:
orderer.test.com | go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0008c6000, {0x0, 0x0, 0x0})
orderer.test.com | /vendor/go.uber.org/zap/zapcore/entry.go:232 +0x44c
orderer.test.com | go.uber.org/zap.(*SugaredLogger).log(0xc000123cd8, 0x4, {0x10d10d9?, 0x1222e00?}, {0xc0007184e8?, 0x20?, 0x195a760?}, {0x0, 0x0, 0x0})
orderer.test.com | /vendor/go.uber.org/zap/sugar.go:227 +0xee
orderer.test.com | go.uber.org/zap.(*SugaredLogger).Panicf(...)
orderer.test.com | /vendor/go.uber.org/zap/sugar.go:159
orderer.test.com | github.com/hyperledger/fabric/common/flogging.(*FabricLogger).Panicf(...)
orderer.test.com | /common/flogging/zap.go:74
orderer.test.com | github.com/hyperledger/fabric/orderer/common/multichannel.(*Registrar).initAppChannels(0xc000205180, {0xc000385520, 0x2, 0x10?}, 0x10?)
orderer.test.com | /orderer/common/multichannel/registrar.go:345 +0xac6
orderer.test.com | github.com/hyperledger/fabric/orderer/common/multichannel.(*Registrar).init(0xc000205180, 0x44d931?)
orderer.test.com | /orderer/common/multichannel/registrar.go:168 +0xa5
orderer.test.com | github.com/hyperledger/fabric/orderer/common/multichannel.(*Registrar).Initialize(0xc000205180, 0xc00038c420?)
orderer.test.com | /orderer/common/multichannel/registrar.go:144 +0x33
orderer.test.com | github.com/hyperledger/fabric/orderer/common/server.initializeMultichannelRegistrar(_, _, _, {0x0, {0x0, {0xc0001d7880, 0x37b, 0x37c}, {0xc0002a3000, 0xf1, ...}, ...}, ...}, ...)
orderer.test.com | /orderer/common/server/main.go:828 +0x5f1
orderer.test.com | github.com/hyperledger/fabric/orderer/common/server.Main()
orderer.test.com | /orderer/common/server/main.go:245 +0x15c5
orderer.test.com | main.main()
orderer.test.com | /cmd/orderer/main.go:15 +0x17
orderer.test.comexited with code 2


Tried below steps to solve the issue but no luck:

  1. Scaled orderer to 3 orderers (added 2 more orderer with v2.4.2).
  2. Added new orders to channel consenters and everything is working fine.
  3. Removed and added existing orderer again (deleted existing volume).
  4. Orderers matrics showing the below details and as per my understanding it is correct.
  consensus_etcdraft_active_nodes{channel="testchannel2"} 3
  consensus_etcdraft_active_nodes{channel="testchannel"} 3

  consensus_etcdraft_cluster_size{channel="testchannel2"} 3
  consensus_etcdraft_cluster_size{channel="testchannel"} 3

  consensus_etcdraft_committed_block_number{channel="testchannel2"} 49
  consensus_etcdraft_committed_block_number{channel="testchannel"} 157
  1. Tried to upgrade one orderer from 2.4.2 to 2.5.0 and it started without any error but when I am seeing the matrics then it is showing different values than it was showing with 2.4.2. Channel with big transaction size is producing this difference.
    a. consensus_etcdraft_active_nodes is 0.
    b. consensus_etcdraft_committed_block_number is different.
  consensus_etcdraft_active_nodes{channel="testchannel2"} 3
  consensus_etcdraft_active_nodes{channel="testchannel"} 0

  consensus_etcdraft_cluster_size{channel="testchannel2"} 3
  consensus_etcdraft_cluster_size{channel="testchannel"} 3

  consensus_etcdraft_committed_block_number{channel="testchannel2"} 49
  consensus_etcdraft_committed_block_number{channel="testchannel"} 150

Other orderer nodes showing below details:

consensus_etcdraft_active_nodes{channel="testchannel2"} 3
consensus_etcdraft_active_nodes{channel="testchannel"} 2

consensus_etcdraft_cluster_size{channel="testchannel2"} 3
consensus_etcdraft_cluster_size{channel="testchannel"} 3

consensus_etcdraft_committed_block_number{channel="testchannel2"} 49
consensus_etcdraft_committed_block_number{channel="testchannel"} 157

@denyeart
Copy link
Contributor

@yacovm @tock-ibm Do you think this is a new issue in v2.5, perhaps due to the etcd update?

Based on the git blame when I looked at the code, I think it's related to some update, but not sure about the version.

But what are you trying to say? You want to put a release note message or something?

@yacovm I'm just trying to assess for now... is this a legitimate new problem that can impact production environments? If so, I'm assuming we should investigate further and either fix, put in place a mitigation, or mention in release notes and doc.

@yacovm
Copy link
Contributor

yacovm commented Jul 19, 2023

Maybe we can ask etcd community to make this constant to a variable and add a method that changes it?

@sarojsh1 can you produce us a sample file system that reproduces this error so we can test the above hypothesis?

@denyeart
Copy link
Contributor

denyeart commented Jul 26, 2023

@yacovm @tock-ibm

A similar problem was reported at etcd-io/etcd#14025.

etcd-io/etcd#14114 was used to fix it in etcd raft v3.5.5 (Fabric is still on v3.5.1).

Here is the commit with the fix:
etcd-io/etcd@621cd7b

So moving up to latest etcd v3.5.9 may help to resolve it. Although it looks like there still may be an upper limit due to the write ahead log file size, which appears to be 64MB.

// The length of current WAL entry must be less than the remaining file size.

https://pkg.go.dev/go.etcd.io/etcd/server/v3/wal#pkg-overview

https://github.com/etcd-io/etcd/blob/release-3.5/server/wal/wal.go#L55

Am I reading it correctly that we still may have an issue with very large transactions? WDYT?

@yacovm
Copy link
Contributor

yacovm commented Jul 27, 2023

So moving up to latest etcd v3.5.9 may help to resolve it. Although it looks like there still may be an upper limit due to the write ahead log file size, which appears to be 64MB.

Possible. I think we should consider restricting too big Raft transactions.

Well they clearly say:

// The length of current WAL entry must be less than the remaining file size.

So the WAL entry that is not corrupted is always less than the remaining file size, by definition, no? Since the remaining file size contains the "header" of the entry and the data itself.

But I still would like someone to make a test to reproduce this problem in our current latest code base, so we can be sure the problem is only reproduced at startup, and then see if the upgrade to the latest Raft version fixes this or not.

@semil do you want to take a look at this?

@semil
Copy link
Contributor

semil commented Jul 31, 2023

Please look at #4339 for bug reproduction.
#4340 includes a shorter test and upgrade of Etcd Raft 3.5.1 -> 3.5.9, it solved the issues as expected.

@yacovm
Copy link
Contributor

yacovm commented Jul 31, 2023

Thanks @semil !

@denyeart
Copy link
Contributor

denyeart commented Aug 1, 2023

Thanks @semil, this fix will be included in Fabric v2.5.4.

@denyeart denyeart closed this as completed Aug 1, 2023
@sarojsh1
Copy link
Author

Thank you All! :)
I tried Fabric v2.5.4 and it worked well! Error does not exist anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants