Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interrupting Sync with ctrl+c leads to expiring everything #169

Open
t-oster opened this issue May 6, 2021 · 13 comments
Open

Interrupting Sync with ctrl+c leads to expiring everything #169

t-oster opened this issue May 6, 2021 · 13 comments
Assignees
Labels

Comments

@t-oster
Copy link

t-oster commented May 6, 2021

Hi,

I have to investigate this further but here is my log:

root@dwd2:/home/vbox# DEBUG=1 zrep -S rpool/USERDATA/vms/WS2016_UNICORN
overiding stale lock on rpool/USERDATA/vms/WS2016_UNICORN from pid 721881
DEBUG: zrep_lock_fs: set lock on rpool/USERDATA/vms/WS2016_UNICORN
sending rpool/USERDATA/vms/WS2016_UNICORN@zrep_000020 to dwd-alt:hddpool/vms/WS2016_UNICORN
^Ccannot open 'hddpool/vms/WS2016_UNICORN@zrep_000020': dataset does not exist
WARNING: setting zrep:sent failed on dwd-alt:hddpool/vms/WS2016_UNICORN@zrep_000020
Using fallback methods. You should go patch hddpool/vms/WS2016_UNICORN to have newer ZFS version
Expiring zrep snaps on rpool/USERDATA/vms/WS2016_UNICORN
DEBUG: expiring rpool/USERDATA/vms/WS2016_UNICORN@zrep_000002
DEBUG: expiring rpool/USERDATA/vms/WS2016_UNICORN@zrep_000003
DEBUG: expiring rpool/USERDATA/vms/WS2016_UNICORN@zrep_000004
DEBUG: expiring rpool/USERDATA/vms/WS2016_UNICORN@zrep_000005
DEBUG: expiring rpool/USERDATA/vms/WS2016_UNICORN@zrep_000006
...

As you can see, I hit ^C after a while (line 5), which cancels the sending process. However it seems as if zrep thinks it completed successfully and then tries to continue with expiring snapshots instead of aborting the whole script.
This bad because it expired all common snapshots and now I have to delete my backup FS and do a full sync!
If I have time, I will try to reproduce this and see where to check the exit state or similar...

@ppbrown ppbrown self-assigned this May 9, 2021
@ppbrown ppbrown added the bug label Mar 17, 2022
@ppbrown
Copy link
Member

ppbrown commented Mar 17, 2022

Hello,
Sorry this one was left without a response for a long time.
I think I was holding off on replying to this, because you mentioned you would check back.
I marked this as a bug, because zrep should never delete ALL common snapshots. At minimum, expire should only up to, but not including, the last successful sync.

If you are still there, could you please tell me if you did any further investigation?

@crispyduck00
Copy link

Uhhh, just saw this old bug and tested it. This one is really bad, interrupting the send, e.g. by Ctrl-C it is really expiring also the last successful sent one and marking the canceled/failed as sent local.

@crispyduck00
Copy link

Played a while with this. Seems for the Ctrl-C problem trap "exit" INT needs to be added to the script.

But should there not also be a check if the receive was successful?
E.g. by adding:

			   if [[ $? -ne 0 ]] ; then
				zrep_errquit Error: zfs send/receive failed.
			   fi

after all the send/recv???

@crispyduck00
Copy link

Anything against adding trap "exit" INT, to exit the whole script on Ctrl-C?

And for the second part, is there already on another place a check if send/receive was successful? If beside Ctrl-C send/receive fails? If not after a failed send/receive the script will still set the sent property and call expire, which will destroy the "real" last successfully synced snapshot.
Which is bad, as the hosts will never get synced again without sending over everything again.

@ppbrown
Copy link
Member

ppbrown commented Mar 20, 2022 via email

@ppbrown
Copy link
Member

ppbrown commented Mar 27, 2022

This still isnt making sense to me.
I need to be show debug output like the original message. So, with DEBUG=1, and then showing the last sequence actualy sent, and then sequence numbers expired on either side.

I dont have any output in this ticket supporting the claim,
"This bad because it expired all common snapshots"

the output here, is that it sent sequence 00020, but only expired 06

@ppbrown
Copy link
Member

ppbrown commented Mar 27, 2022

I dont even know if it is misexpiring on the local side or the remote side.

@crispyduck00
Copy link

Oh, seems I forgot last time to send my comment; here it is:

Tested it once again:

**precondition sync:**
  host1#
  tank/prodfs with tank/prodfs@zrep_000000
  host2#
  tank/prodfs with tank/prodfs@zrep_000000

**doing sync and interrupt by Ctrl-C:**
  host1# zrep -S -f tank/prodfs
  sending tank/prodfs@zrep_000001 to host2:tank/prodfs
  host1# Ctrl-C
  ^Cmbuffer: warning: error during output to <stdout>: canceled
  cannot open 'tank/prodfs@zrep_000001': dataset does not exist
  WARNING: setting zrep:sent failed on host2:tank/prodfs@zrep_000001
  Using fallback methods. You should go patch tank/prodfs to have newer ZFS version
  Expiring zrep snaps on tank/prodfs
  Also running expire on host2:tank/prodfs now...
  Expiring zrep snaps on tank/prodfs

**results in:**
  host1#
  tank/prodfs with tank/prodfs@zrep_000001
  host2#
  tank/prodfs with tank/prodfs@zrep_000000

So the last successful sent snapshot zrep_000000 is locally expired.

The idea with trap "exit" INT is also not really a working solution.

@crispyduck00
Copy link

I don't know why it things send/recv was successful. But as it thinks it was successful it try's to set sent on desthost snap, as it does not exist it uses fallback and afterwards sets sent on local one, which is the bad thing that later causes that the "real" last sent is expired.

As quick and dirty fix (till you come up with something better) I simply changed the fallback to:

	#Even if we are "old mode", other side may not be.
	# So try newer way first.
	zrep_ssh $desthost zfs set $senttimeprop $destfs@$snapname
	if [[ $? -ne 0 ]] ; then
#####crispyduck: if setting senttimeprop does not work expect snap not exists. Exit!
		zrep_errquit Problem setting sent prop on dest host.	
#####
		print WARNING: setting ${ZREPTAG}:sent failed on $desthost:$destfs@$snapname
		print Using fallback methods. You should go patch $destfs to have newer ZFS version
		zrep_ssh $desthost zfs set ${ZREPTAG}:lastsent=${newsnap} $destfs
		zrep_ssh $desthost zfs set ${ZREPTAG}:lastsenttime=${timeinsec} $destfs
	fi

As my hosts all support props on snapshots, I assume if zrep cant set the prop on the desthost snapshot that maybe the snap does not exist and zrep_errquit before sent could be set locally.

So far this worked in all my tests.

@ppbrown
Copy link
Member

ppbrown commented Mar 31, 2022

Seems like half the problem is with the "attempt fallback methods for older ZFS" modules.
It bothers me that I still dont understand the real cause-and-effect for this bug.
That being said, .. especially since this is a very rare and user-triggered bug..
Maybe the "fix" is to focus on zrep v2 removing the old-version fallbacks.

havent fully committed to that, but its the direction im considering.

@ppbrown
Copy link
Member

ppbrown commented Apr 1, 2022

FYI: The thing that is confusing me, is that the ACTUAL send, has fail checks. So if the send failed, it should just exit.

that suggests to me that the problem is when other things get interrupted.

@crispyduck00
Copy link

I know, I don't really understand why the fail check is not working here.
I tested this 100 times, seems if it is interrupted while sending, it still results in a $? 0.

Tested with Ctrl-C, but also when running as systemd service and stopping the service while a send.

@ppbrown
Copy link
Member

ppbrown commented Apr 4, 2022

I now kinda recall that zfs send/rceive sometimes jams up a system a bit longer than would be expected.
Maybe you could do some experiments:
use 'kill' to explicitly kill the zfs, either on the sending or receiving side. see if that makes any difference.
and/or experiment with killing the network halfway through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants