Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a notice to man borg-check about defective hardware? #5753

Closed
martinvonwittich opened this issue Apr 2, 2021 · 0 comments · Fixed by #5855
Closed

Add a notice to man borg-check about defective hardware? #5753

martinvonwittich opened this issue Apr 2, 2021 · 0 comments · Fixed by #5855

Comments

@martinvonwittich
Copy link

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes.

Is this a BUG / ISSUE report or a QUESTION?

Neither, it's a suggestion to improve the documentation.

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

borg 1.1.9 (from Debian buster, borgbackup=1.1.9-2+deb10u1 (both client and server)

Operating system (distribution) and version.

Debian 10 (buster) (both client and server)

Hardware / network configuration, and filesystems used.

irrelevant

How much data is handled by borg?

irrelevant

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg check --progress --verbose /path/to/repo

Describe the problem you're observing.

I encountered my first case of a failing borg check yesterday that wasn't actually caused by a damaged repository, but by defective memory. This is actually documented in the FAQ:

Checking memory
Intermittent issues, such as borg check finding errors inconsistently between runs, are frequently caused by bad memory.
Run memtest86+ (or an equivalent memory tester) to verify that the memory subsystem is operating correctly.

But due to the fact that I had several damaged repositories in the past (due to power failures) that I was always able to to repair with borg check --repair, I didn't actually bother to look through the FAQ, and directly proceeded with borg check --repair. The timeline looked something like this:

  • A manual attempt to run the backup failed with File failed integrity check: /path/to/repo/cache/9ce198549ec83582155f288b853891e6cb1d33f41547d52748d4e2c9fb5ada1d/chunks
  • borg check found errors (Segment entry checksum mismatch), so I figured that the repository was damaged.
  • borg check --repair suddenly no longer found any errors.
  • Another borg check again found errors.
  • Another borg check --repair again didn't find any errors.
  • At this point I was very confused and wondered if I was doing something wrong or if I had encountered a very strange bug, because it looked to me like I had some kind of repository corruption that is visible to borg check, but not to borg check --repair. I asked for help on IRC, and was advised to use the most recent borg version as the one provided by Debian is rather old.
  • I downloaded a static build of the most recent borg release (1.1.16), and ran a check with this version. Now this also didn't find any errors during borg check.
  • I ran borg 1.1.9 check again to see if the old version was still able to see the corruption, and again it would report Segment entry checksum mismatch.

Only at this point I actually compared the output of my multiple borg check runs and noticed that the segment numbers were always different:

Starting repository check
Data integrity error: Segment entry checksum mismatch [segment 2061, offset 448940545]
Starting repository check                                                                          
Data integrity error: Segment entry checksum mismatch [segment 1479, offset 393404125]             
Starting repository check
Data integrity error: Segment entry checksum mismatch [segment 1007, offset 37742236]                                                                                                                                                       
Data integrity error: Segment entry checksum mismatch [segment 2129, offset 116527082]                                                                                                                                                      

At this point it was rapidly becoming clear to me that the hardware might be the actual cause, and a memtester run confirmed it:

server ~ # memtester 6G
memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 6144MB (6442450944 bytes)
got  6144MB (6442450944 bytes), trying mlock ...locked.
Loop 1:
  Stuck Address       : ok
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok
  Block Sequential    : testing   4FAILURE: 0x404040404040404 != 0x404040404040400 at offset 0x4a7f0fb0.
  Checkerboard        : ok
  Bit Spread          : testing   2FAILURE: 0x00000014 != 0x00000010 at offset 0x4a7f0fb0.
  Bit Flip            : testing  17FAILURE: 0x00000004 != 0x00000000 at offset 0x4a7f0fb0.
  Walking Ones        : ok
  Walking Zeroes      : testing 125FAILURE: 0x00000004 != 0x00000000 at offset 0x4a7f0fb0.
  8-bit Writes        : ok
  16-bit Writes       : |FAILURE: 0x77cea9b8efef8c2c != 0x77cea9b8efef8c28 at offset 0x4a7f0fb0.

I think mentioning this potential problem in man borg-check would be worthwhile, to increase the likelihood that people who use borg check are aware of it. E.g. with the following passage:

Note that borg check can report spurious errors when running on defective hardware. If you're seeing errors during borg check but not in a subsequent borg check --repair, run multiple checks and compare the defective segment IDs. If the defective segment IDs vary between checks, check your hardware e.g. with memtest86+ or memtester.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

irrelevant

Include any warning/errors/backtraces from the system logs

irrelevant

ThomasWaldmann added a commit that referenced this issue Jun 18, 2021
add notice about defective hardware to check documentation (#5753)
ThomasWaldmann added a commit that referenced this issue Jun 19, 2021
add notice about defective hardware to check documentation (#5753)
@ghost ghost mentioned this issue Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants