Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-2994 Tool required to recover log and snapshot entries with CRC errors #487

Closed
wants to merge 6 commits into from

Conversation

anmolnar
Copy link
Contributor

@anmolnar anmolnar commented Mar 10, 2018

https://issues.apache.org/jira/browse/ZOOKEEPER-2994

In the event of ZooKeeper transaction log becomes corrupted and fail CRC checks (preventing startup) we should have a mechanism to get the cluster running again.

Previously we achieved this by loading the broken transaction log with a modified version of ZK with disabled CRC check and forced it to write new txn log files.

It has proven that once you end up with the corrupt txn log there is no way to recover except manually modifying the crc check. That's basically why the tool is needed.

It's called TxnLogToolkit, a new console application similar to LogFormatter and SnapshotFormatter, but it's intentionally separated to keep backward compatibility in the existing tools.

This PR contains TXN log tool only.

You probably also notice a refactoring to extract file padding logic from FileTxnLog to reuse in the new tool. Related code changes can be reviewed alone in a separate commit if preferred.

@phunt
Copy link
Contributor

phunt commented Apr 13, 2018

Looks promising - doesn't seem very useful (and potentially dangerous) without docs - perhaps add a troubleshooting or recovery section here?
http://zookeeper.apache.org/doc/r3.4.11/zookeeperAdmin.html#sc_dataFileManagement
The jira was original for 3.5+, I think this would be great to get into 3.4+.

@nkalmar
Copy link
Contributor

nkalmar commented Apr 18, 2018

Useful addition, +1

As @phunt pointed out, docs in zookeeperAdmin.xml could be updated.

@nkalmar
Copy link
Contributor

nkalmar commented Apr 18, 2018

I used your updated documentation, and managed to recover a corrupted log file:

bin/zkTxnLogToolkit.sh -d ~/workspace/zookeeper/standalone/version-2/log.1
ZooKeeper Transactional Log File with dbid 0 txnlog format version 2
4/9/18 3:13:19 PM CEST session 0x10000ebe13a0000 cxid 0x0 zxid 0x1 createSession 30000
4/9/18 3:15:21 PM CEST session 0x10000ebe13a0000 cxid 0x0 zxid 0x2 closeSession null
4/9/18 3:17:41 PM CEST session 0x10000ebe13a0001 cxid 0x0 zxid 0x3 createSession 30000
4/9/18 3:18:13 PM CEST session 0x10000ebe13a0001 cxid 0x0 zxid 0x4 closeSession null
EOF reached after 4 txns.

Corrupted log.1 file

bin/zkTxnLogToolkit.sh -d ~/workspace/zookeeper/standalone/version-2/log.1
ZooKeeper Transactional Log File with dbid 0 txnlog format version 2
CRC ERROR - 4/10/18 5:12:11 AM CEST session 0x10000ebe13a0000 cxid 0x0 zxid 0x1 createSession 30000
4/10/18 5:12:11 AM CEST session 0x10000ebe13a0000 cxid 0x0 zxid 0x1 createSession 30000
4/9/18 3:15:21 PM CEST session 0x10000ebe13a0000 cxid 0x0 zxid 0x2 closeSession null
CRC ERROR - 4/9/18 3:17:41 PM CEST session 0x10044aa44aaaaaa cxid 0x0 zxid 0x3 createSession 30000
4/9/18 3:17:41 PM CEST session 0x10044aa44aaaaaa cxid 0x0 zxid 0x3 createSession 30000
4/9/18 3:18:13 PM CEST session 0x10000ebe13a0001 cxid 0x0 zxid 0x4 closeSession null
EOF reached after 4 txns.

bin/zkTxnLogToolkit.sh -r ~/workspace/zookeeper/standalone/version-2/log.1
ZooKeeper Transactional Log File with dbid 0 txnlog format version 2
CRC ERROR - 4/10/18 5:12:11 AM CEST session 0x10000ebe13a0000 cxid 0x0 zxid 0x1 createSession 30000
Would you like to fix it (Yes/No/Abort) ? Y
EOF reached after 4 txns.
Recovery file /Users/.../zookeeper/standalone/version-2/log.1.fixed has been written with 1 fixed CRC error(s)

bin/zkTxnLogToolkit.sh -d ~/workspace/zookeeper/standalone/version-2/log.1.fixed
ZooKeeper Transactional Log File with dbid 0 txnlog format version 2
4/9/18 3:13:19 PM CEST session 0x10000ebe13a0000 cxid 0x0 zxid 0x1 createSession 30000
4/9/18 3:15:21 PM CEST session 0x10000ebe13a0000 cxid 0x0 zxid 0x2 closeSession null
4/9/18 3:17:41 PM CEST session 0x10000ebe13a0001 cxid 0x0 zxid 0x3 createSession 30000
4/9/18 3:18:13 PM CEST session 0x10000ebe13a0001 cxid 0x0 zxid 0x4 closeSession null
EOF reached after 4 txns.

LGTM!

@anmolnar
Copy link
Contributor Author

@phunt I added documentation to ZookeeperAdmin docs.
@nkalmar thanks!

Copy link
Contributor

@nkalmar nkalmar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested, works fine!

@asfgit asfgit closed this in 154f9c5 Apr 23, 2018
asfgit pushed a commit that referenced this pull request Apr 23, 2018
…h CRC errors

https://issues.apache.org/jira/browse/ZOOKEEPER-2994

In the event  of ZooKeeper transaction log becomes corrupted and fail CRC checks (preventing startup) we should have a mechanism to get the cluster running again.

Previously we achieved this by loading the broken transaction log with a modified version of ZK with disabled CRC check and forced it to write new txn log files.

It has proven that once you end up with the corrupt txn log there is no way to recover except manually modifying the crc check. That's basically why the tool is needed.

It's called TxnLogToolkit, a new console application similar to LogFormatter and SnapshotFormatter, but it's intentionally separated to keep backward compatibility in the existing tools.

This PR contains TXN log tool only.

You probably also notice a refactoring to extract file padding logic from FileTxnLog to reuse in the new tool. Related code changes can be reviewed alone in a separate commit if preferred.

Author: Andor Molnar <[email protected]>

Reviewers: [email protected]

Closes #487 from anmolnar/ZOOKEEPER-2994 and squashes the following commits:

221760c [Andor Molnar] ZOOKEEPER-2994. Added documentation and startup scripts
a69d729 [Andor Molnar] ZOOKEEPER-2994. Fix findbugs warning
0b95efe [Andor Molnar] ZOOKEEPER-2994. Fix for unit test
15fa45c [Andor Molnar] ZOOKEEPER-2994. Added padding, tool renamed to TxnLogToolkit, interactive mode, etc.
6a1ad0e [Andor Molnar] ZOOKEEPER-2994. Refactor FileTxnLog's padding logic to separate class for reusability
0d089cc [Andor Molnar] ZOOKEEPER-2994. Added new tool TxnLogTool for txn log file recovery

Change-Id: I7560362633a7bc919ae6d3ca7e3588e196a1919c
(cherry picked from commit 154f9c5)
Signed-off-by: Patrick Hunt <[email protected]>
@phunt
Copy link
Contributor

phunt commented Apr 23, 2018

+1 Thanks @anmolnar this looks good. Please consider backporting to 3.4 (separate jira).

Also in future please don't include any changed files from the toplevel docs directory (html/pdf files) as these are regenerated during commit.

nkalmar pushed a commit to nkalmar/zookeeper that referenced this pull request Apr 24, 2018
…h CRC errors

https://issues.apache.org/jira/browse/ZOOKEEPER-2994

In the event  of ZooKeeper transaction log becomes corrupted and fail CRC checks (preventing startup) we should have a mechanism to get the cluster running again.

Previously we achieved this by loading the broken transaction log with a modified version of ZK with disabled CRC check and forced it to write new txn log files.

It has proven that once you end up with the corrupt txn log there is no way to recover except manually modifying the crc check. That's basically why the tool is needed.

It's called TxnLogToolkit, a new console application similar to LogFormatter and SnapshotFormatter, but it's intentionally separated to keep backward compatibility in the existing tools.

This PR contains TXN log tool only.

You probably also notice a refactoring to extract file padding logic from FileTxnLog to reuse in the new tool. Related code changes can be reviewed alone in a separate commit if preferred.

Author: Andor Molnar <[email protected]>

Reviewers: [email protected]

Closes apache#487 from anmolnar/ZOOKEEPER-2994 and squashes the following commits:

221760c [Andor Molnar] ZOOKEEPER-2994. Added documentation and startup scripts
a69d729 [Andor Molnar] ZOOKEEPER-2994. Fix findbugs warning
0b95efe [Andor Molnar] ZOOKEEPER-2994. Fix for unit test
15fa45c [Andor Molnar] ZOOKEEPER-2994. Added padding, tool renamed to TxnLogToolkit, interactive mode, etc.
6a1ad0e [Andor Molnar] ZOOKEEPER-2994. Refactor FileTxnLog's padding logic to separate class for reusability
0d089cc [Andor Molnar] ZOOKEEPER-2994. Added new tool TxnLogTool for txn log file recovery

Change-Id: I7560362633a7bc919ae6d3ca7e3588e196a1919c
@anmolnar anmolnar deleted the ZOOKEEPER-2994 branch April 24, 2018 14:14
anmolnar added a commit to anmolnar/zookeeper that referenced this pull request Apr 24, 2018
…h CRC errors

https://issues.apache.org/jira/browse/ZOOKEEPER-2994

In the event  of ZooKeeper transaction log becomes corrupted and fail CRC checks (preventing startup) we should have a mechanism to get the cluster running again.

Previously we achieved this by loading the broken transaction log with a modified version of ZK with disabled CRC check and forced it to write new txn log files.

It has proven that once you end up with the corrupt txn log there is no way to recover except manually modifying the crc check. That's basically why the tool is needed.

It's called TxnLogToolkit, a new console application similar to LogFormatter and SnapshotFormatter, but it's intentionally separated to keep backward compatibility in the existing tools.

This PR contains TXN log tool only.

You probably also notice a refactoring to extract file padding logic from FileTxnLog to reuse in the new tool. Related code changes can be reviewed alone in a separate commit if preferred.

Author: Andor Molnar <[email protected]>

Reviewers: [email protected]

Closes apache#487 from anmolnar/ZOOKEEPER-2994 and squashes the following commits:

221760c [Andor Molnar] ZOOKEEPER-2994. Added documentation and startup scripts
a69d729 [Andor Molnar] ZOOKEEPER-2994. Fix findbugs warning
0b95efe [Andor Molnar] ZOOKEEPER-2994. Fix for unit test
15fa45c [Andor Molnar] ZOOKEEPER-2994. Added padding, tool renamed to TxnLogToolkit, interactive mode, etc.
6a1ad0e [Andor Molnar] ZOOKEEPER-2994. Refactor FileTxnLog's padding logic to separate class for reusability
0d089cc [Andor Molnar] ZOOKEEPER-2994. Added new tool TxnLogTool for txn log file recovery

Change-Id: I7560362633a7bc919ae6d3ca7e3588e196a1919c
anmolnar added a commit to anmolnar/zookeeper that referenced this pull request Apr 24, 2018
…h CRC errors

https://issues.apache.org/jira/browse/ZOOKEEPER-2994

In the event  of ZooKeeper transaction log becomes corrupted and fail CRC checks (preventing startup) we should have a mechanism to get the cluster running again.

Previously we achieved this by loading the broken transaction log with a modified version of ZK with disabled CRC check and forced it to write new txn log files.

It has proven that once you end up with the corrupt txn log there is no way to recover except manually modifying the crc check. That's basically why the tool is needed.

It's called TxnLogToolkit, a new console application similar to LogFormatter and SnapshotFormatter, but it's intentionally separated to keep backward compatibility in the existing tools.

This PR contains TXN log tool only.

You probably also notice a refactoring to extract file padding logic from FileTxnLog to reuse in the new tool. Related code changes can be reviewed alone in a separate commit if preferred.

Author: Andor Molnar <[email protected]>

Reviewers: [email protected]

Closes apache#487 from anmolnar/ZOOKEEPER-2994 and squashes the following commits:

221760c [Andor Molnar] ZOOKEEPER-2994. Added documentation and startup scripts
a69d729 [Andor Molnar] ZOOKEEPER-2994. Fix findbugs warning
0b95efe [Andor Molnar] ZOOKEEPER-2994. Fix for unit test
15fa45c [Andor Molnar] ZOOKEEPER-2994. Added padding, tool renamed to TxnLogToolkit, interactive mode, etc.
6a1ad0e [Andor Molnar] ZOOKEEPER-2994. Refactor FileTxnLog's padding logic to separate class for reusability
0d089cc [Andor Molnar] ZOOKEEPER-2994. Added new tool TxnLogTool for txn log file recovery

Change-Id: I7560362633a7bc919ae6d3ca7e3588e196a1919c
asfgit pushed a commit that referenced this pull request Apr 25, 2018
…h CRC errors (3.4)

This is the 3.4 version of #487
phunt I've just realized that the patch must introduce a new dependency: commons-cli.
Not sure if you're willing to merge it in this case.

Author: Andor Molnar <[email protected]>

Reviewers: [email protected]

Closes #508 from anmolnar/ZOOKEEPER-2994_34 and squashes the following commits:

357ab2b [Andor Molnar] ZOOKEEPER-2994. Removed dependency of commons.cli. Use custom impl instead.
3bc2e5f [Andor Molnar] ZOOKEEPER-2994: Tool required to recover log and snapshot entries with CRC errors

Change-Id: I7def29dc338726c3eccb0a4fd4530a1ffb0f3932
lvfangmin pushed a commit to lvfangmin/zookeeper that referenced this pull request Jun 17, 2018
…h CRC errors

https://issues.apache.org/jira/browse/ZOOKEEPER-2994

In the event  of ZooKeeper transaction log becomes corrupted and fail CRC checks (preventing startup) we should have a mechanism to get the cluster running again.

Previously we achieved this by loading the broken transaction log with a modified version of ZK with disabled CRC check and forced it to write new txn log files.

It has proven that once you end up with the corrupt txn log there is no way to recover except manually modifying the crc check. That's basically why the tool is needed.

It's called TxnLogToolkit, a new console application similar to LogFormatter and SnapshotFormatter, but it's intentionally separated to keep backward compatibility in the existing tools.

This PR contains TXN log tool only.

You probably also notice a refactoring to extract file padding logic from FileTxnLog to reuse in the new tool. Related code changes can be reviewed alone in a separate commit if preferred.

Author: Andor Molnar <[email protected]>

Reviewers: [email protected]

Closes apache#487 from anmolnar/ZOOKEEPER-2994 and squashes the following commits:

221760c [Andor Molnar] ZOOKEEPER-2994. Added documentation and startup scripts
a69d729 [Andor Molnar] ZOOKEEPER-2994. Fix findbugs warning
0b95efe [Andor Molnar] ZOOKEEPER-2994. Fix for unit test
15fa45c [Andor Molnar] ZOOKEEPER-2994. Added padding, tool renamed to TxnLogToolkit, interactive mode, etc.
6a1ad0e [Andor Molnar] ZOOKEEPER-2994. Refactor FileTxnLog's padding logic to separate class for reusability
0d089cc [Andor Molnar] ZOOKEEPER-2994. Added new tool TxnLogTool for txn log file recovery

Change-Id: I7560362633a7bc919ae6d3ca7e3588e196a1919c
RokLenarcic pushed a commit to RokLenarcic/zookeeper that referenced this pull request Sep 3, 2022
…h CRC errors

https://issues.apache.org/jira/browse/ZOOKEEPER-2994

In the event  of ZooKeeper transaction log becomes corrupted and fail CRC checks (preventing startup) we should have a mechanism to get the cluster running again.

Previously we achieved this by loading the broken transaction log with a modified version of ZK with disabled CRC check and forced it to write new txn log files.

It has proven that once you end up with the corrupt txn log there is no way to recover except manually modifying the crc check. That's basically why the tool is needed.

It's called TxnLogToolkit, a new console application similar to LogFormatter and SnapshotFormatter, but it's intentionally separated to keep backward compatibility in the existing tools.

This PR contains TXN log tool only.

You probably also notice a refactoring to extract file padding logic from FileTxnLog to reuse in the new tool. Related code changes can be reviewed alone in a separate commit if preferred.

Author: Andor Molnar <[email protected]>

Reviewers: [email protected]

Closes apache#487 from anmolnar/ZOOKEEPER-2994 and squashes the following commits:

221760c [Andor Molnar] ZOOKEEPER-2994. Added documentation and startup scripts
a69d729 [Andor Molnar] ZOOKEEPER-2994. Fix findbugs warning
0b95efe [Andor Molnar] ZOOKEEPER-2994. Fix for unit test
15fa45c [Andor Molnar] ZOOKEEPER-2994. Added padding, tool renamed to TxnLogToolkit, interactive mode, etc.
6a1ad0e [Andor Molnar] ZOOKEEPER-2994. Refactor FileTxnLog's padding logic to separate class for reusability
0d089cc [Andor Molnar] ZOOKEEPER-2994. Added new tool TxnLogTool for txn log file recovery

Change-Id: I7560362633a7bc919ae6d3ca7e3588e196a1919c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants