Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create: better handling "file changed" warnings, mitigation by retries and/or reflinks #6622

Closed
fff7d1bc opened this issue Apr 18, 2022 · 12 comments
Assignees
Milestone

Comments

@fff7d1bc
Copy link

fff7d1bc commented Apr 18, 2022

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

ISSUE

System information. For client/server mode post info for both machines.

Gentoo, ~amd64.

Your borg version (borg -V).

borg 1.2.0

Operating system (distribution) and version.

Hardware / network configuration, and filesystems used.

How much data is handled by borg?

~20 GB

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg create --compression zstd,12 -v --progress --stats --one-file-system --exclude-if-present .borg-exclude-dir /mnt/dropzone/backups/borgbackup/home::{hostname}home_piotr{now:%Y-%m-%d_%H:%M:%S} /home/piotr

Describe the problem you're observing.

For quite a while I am facing issues with 'borg create' failing with non zero exit code when it fails to backup some .sqlite or .sqlite-wal files in ~/.mozilla/firefox as they change while read. Currently there's no retry feature in create command which means I need to do another full tree backup up until it finally can back those files.

The quick solution to it would be to add a feature to 'create' command to retry up to N times with X backoff period in between retries in case it cannot backup file because it changed, like '--file-changed-retry 5,1 to retry up to 5 times with 1s sleep in between.

The another idea is using reflink.

Reflinks are more and more popular, latest versions of coreutils and midnight commander actually defaults to use them, In short, reflinks are file-level zero-copy snapshots, Currently XFS, BTRFS and ZFS support them.

The great feature would be if borg, in case of 'file changed during read' would try to use reflink instead for the time of backup and then just close the FD.

Linux years ago got the O_TMPFILE to open(2) call, which allow you to get a unnamed description in given directory (that you can later linkat() once you fill it with data), and the copy_file_range(2) into the file descriptor, then read the fd as backup file. Since kernel 4.5 the copy_file_range() calls actually uses reflink, if possible, what it means it would allow borg to actually backup file that does change during backup, because the fd that borg has is a zero-copy clone.

It would be equal to doing something like this manually (pseudocode):

if ! borg create ... /path/to/file.sqlite; then
    cp --reflink=always /path/to/file.sqlite /path/to/file.sqlite.snapshot
    borg create ... /path/to/file.sqlite.snapshot
    rm -f /path/to/file.sqlite.snapshot
fi

I use reflinks to make crash consistent backups of my virtual machines that hare hundreds gigabytes of size, by reflinking, backing up the data and then removing them. Creating reflink is instant, it does not matter how long the reflinked file is read, since it will never chnage, and then it is purged. It would be perfect to have reflink support in borg, but I understand it would require tier 3 hackery with CPython and would still be limited to only work on supported file systems (xfs, btrfs, zfs), only on Linux and only if borg process can get write access to the directory that contain file that is being backup (open with O_TMPFILE will require it).

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Half of my backups fails.

Include any warning/errors/backtraces from the system logs

@ThomasWaldmann
Copy link
Member

reflink: Guess we could only use that as an optional feature and fall back to normal open if it is not present.

But I am not really sure whether it could be implemented like that / how much effort that would be. We use OS-level FDs, Python file objects and for some stuff also "by filename" access to files. If somebody wants to try, just make a PR.

@fff7d1bc
Copy link
Author

Would you be more keen to accept a PR with retry on file changed? If so I could look into adding it.

@ThomasWaldmann
Copy link
Member

Well, all methods have their pros and cons.

Doing nothing is the easiest, people wanting to have stable files can still use fs or block level snapshotting (IF they have it).

reflink has the platform and fs support issue, people could use it if we support that and IF they have it.

Retrying might work, but we can't be sure (if there are many ongoing writes, it can well be that all the retries fail).

If a PR would introduce a rather simple change that works for most cases, I guess it could be acceptable.

@ThomasWaldmann
Copy link
Member

BTW, we must not create multiple borg metadata stream items for the same fs item when "retrying".

@ThomasWaldmann
Copy link
Member

Well, the world is not black and white. A lot of software only differentiates between all-good and error.

borg has all-good (0), definitely an error (2) and warning (1). the latter means that you have to look at the logs.

it might be something harmless or something in a file you do not really care about, but it might also be your important sqlite db which has changed while we backed it up.

@mikabytes
Copy link

mikabytes commented Jun 15, 2022

I'd just like to pitch in on this feature request. I have some ~200TiB of data that is backed up every 5 minutes. 2 out of 3 attempts results in RC 1. I want to be sure that I got the whole file, so I set it up to retry the borg create command on any warning. Every backup takes roughly 1 minute, meaning I'm spending 3 out of 5 minutes doing backups.

No biggie. But it is some extra load on the system. Having a retry-if-changed on the file level would be great to solve this.

This is CephFS. I don't think there's a "reflink" concept.

@ThomasWaldmann ThomasWaldmann added this to the 2.0.0b5 milestone Feb 10, 2023
@ThomasWaldmann ThomasWaldmann changed the title create: better handling "file changed" errors, mitigation by retries and/or reflinks create: better handling "file changed" warnings, mitigation by retries and/or reflinks Feb 11, 2023
@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Feb 11, 2023

https://gitlab.com/rubdos/pyreflink separate pyreflink library, but seems not much maintained recently.

python/cpython#81338 reflink support ticket in Python's issue tracker.

os.copy_file_range (Linux only) in Python 3.8+. See comment there: python/cpython#81338 (comment)

BTW, we have this consistency/snapshot problem not only within the file contents, but also in the file metadata (like xattrs, ACLs, etc.).

@borgbackup borgbackup deleted a comment from fff7d1bc Feb 11, 2023
@ThomasWaldmann
Copy link
Member

Guess pyreflink is a bit problematic:

  • only some fs support it
  • no support for it in CPython yet
  • not sure if pyreflink is still alive

Can we close this in favour of #7346?

@ThomasWaldmann ThomasWaldmann self-assigned this Feb 19, 2023
@gellnerm
Copy link

Until we have something else, can we have a simple flag "--no-file-changed-warning" that exits with code 0 instead of a warning? Because now scripts like

borg create || handle_error

will not work anymore.

@ThomasWaldmann
Copy link
Member

@gellnerm if you want to handle errors, you need to check for rc 2. rc 1 is warnings, rc 0 is "all ok".

@gellnerm
Copy link

Thanks, now I'm using

borg create ...
test $? -eq 2 && handle_err

@ThomasWaldmann
Copy link
Member

Superseded by #7346.

tux2linux added a commit to Blunix-GmbH/role-borgbackup-client that referenced this issue Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants