Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restore after a node failover #273

Closed
dr-zeta opened this issue Feb 15, 2020 · 10 comments
Closed

restore after a node failover #273

dr-zeta opened this issue Feb 15, 2020 · 10 comments
Labels
bug Something isn't working

Comments

@dr-zeta
Copy link

dr-zeta commented Feb 15, 2020

Hi everybody.
I found a restore issue after a failover test on a worker node that hosts the operator pods.
The operator cannot restore the last backup because it doesn't exist.
This is the log error:

2020-02-14T10:06:38.255Z	INFO	controller-jenkins	backuprestore/backuprestore.go:147	Restoring backup '3634'	{"cr": "master"}
2020-02-14T10:06:38.390Z	WARN	controller-jenkins	jenkins/jenkins_controller.go:154	Reconcile loop failed 10 times with the same error, giving up: pod exec error operation on stream: stdout 'Running restore backup
' stderr 'tar: can't open '/backup/3634.tar.gz': No such file or directory

In this case you have to manually insert the recoveryOnce value to 3633 (last minus 1) to recover the last good backup.

I think that you increase the last backup version before that you have correctly store it.
Is it possible?

Thanks

@tomaszsek
Copy link

Hi @dr-zeta,

Sorry for the delay. I don't know how I can miss that issue 😞. The backup mechanism first make backup and if it has been successfully than update the status.lastBackup. I think that the backup file didin't store correctly due to node failover.

Cheers

@tomaszsek tomaszsek added the bug Something isn't working label Apr 2, 2020
@tomaszsek
Copy link

The backup mechanism will be extended with the third command which will return the latest backup number.

@pniederlag
Copy link

I can at least confirm we have seen this issue multiple times as well.

Until now I didn't report the issue as I wasn't sure wether this is tied to our setup using custom backup/restore scripts (just slightly adopted from shipped versions).

Trying to nail things down we have added some debugging into our backups and are noticing a behaviour that really puzzles us. The same backup-number is sometimes reused. Unfortunatly there is no obivous pattern to this happening.

Could you maybe check wether this is happening for you as well?

# get ID of your operator pod, adpopt with context and/or namespace
OPERATOR_ID=$(kubectl get pods)
# check number of occurance per backup number
kubectl logs ${OPERATOR_ID} | grep 'Performing backup' | cut -f5 | uniq -c

Sample output on our side

      1 Performing backup '6507'
      1 Performing backup '6508'
      1 Performing backup '6509'
      2 Performing backup '6510'
      1 Performing backup '6511'
      1 Performing backup '6512'
      1 Performing backup '6513'
      1 Performing backup '6514'
      2 Performing backup '6515'
      1 Performing backup '6516'

As we can see 6510 and 6515 have been used twice.

Is this just happening on our side?

@pniederlag
Copy link

Well, actually there is some kind of pattern involved.

We have a backup-schedule of 900 seconds.

Whenever we see second occurance of the same backup number it is triggered "of schedule", that is exactly after our backup has finished.

See timestamps in our debug.log, whenever the backup is triggered with same number a second time it is "of schedule" rigth after the backup-job has finished.

2020-05-02 06:06:46 UTC Custom backup 6507 START
2020-05-02 06:12:53 UTC Custom backup DONE
----------------------------------------
2020-05-02 06:21:46 UTC Custom backup 6508 START
2020-05-02 06:27:49 UTC Custom backup DONE
----------------------------------------
2020-05-02 06:36:46 UTC Custom backup 6509 START
2020-05-02 06:42:57 UTC Custom backup DONE
----------------------------------------
2020-05-02 06:51:46 UTC Custom backup 6510 START
2020-05-02 06:57:51 UTC Custom backup DONE
----------------------------------------
2020-05-02 06:57:53 UTC Custom backup 6510 START
6510 already existing, SHOULD NOT HAPPEN
2020-05-02 07:04:03 UTC Custom backup DONE
----------------------------------------
2020-05-02 07:06:46 UTC Custom backup 6511 START
2020-05-02 07:12:54 UTC Custom backup DONE
----------------------------------------
2020-05-02 07:21:46 UTC Custom backup 6512 START
2020-05-02 07:27:53 UTC Custom backup DONE
----------------------------------------
2020-05-02 07:36:46 UTC Custom backup 6513 START
2020-05-02 07:42:56 UTC Custom backup DONE
----------------------------------------
2020-05-02 07:51:46 UTC Custom backup 6514 START
2020-05-02 07:57:55 UTC Custom backup DONE
----------------------------------------
2020-05-02 08:06:46 UTC Custom backup 6515 START
2020-05-02 08:13:00 UTC Custom backup DONE
----------------------------------------
2020-05-02 08:13:02 UTC Custom backup 6515 START
6515 already existing, SHOULD NOT HAPPEN
2020-05-02 08:19:13 UTC Custom backup DONE
----------------------------------------
2020-05-02 08:21:46 UTC Custom backup 6516 START
2020-05-02 08:27:54 UTC Custom backup DONE
----------------------------------------
2020-05-02 08:36:46 UTC Custom backup 6517 START

My guess is this issue is related to the restore issue, sry if this is wrong.

@tomaszsek
Copy link

@pniederlag Thanks for the detailed logs. I'm surprised that the backup number is used twice 😵. We have to narrow it and fix in the operator, but as a fast workaround, I can suggest skipping backup if the backup file already exists. We will implement it in the PVC backup for sure.

@pniederlag
Copy link

"skipping backup in case of duplicated backup number". Well, been there, done that, made it even worse for us.

I had an "exit 1" in the backup script in case of hitting an existing backup number. For some weird reasons in this scenario the backup number never incremented anymore. So we ended up running a couple of days without backups until we spotted the problem.

First of all it would be important to verify that the issue of duplicated backup numbers is a general problem and not something that is specific on our side.

@tomaszsek can you verify this oddity on the backup number?

I just ran into the issue again, that the author of this issue reported. What I can see:

  • backup-number 6560 was duplicated
  • during backup-number 6560 pods where terminated (cause is out of scope of this ticket)
  • size/content of backup-number 6560 is "borked"
  • when restore is triggered after recreation of pods it uses backup-number 6561 (which never ran)

quite strange to me.

How is backup handled in case it is ongoing/running at the time the pod is recreated?

@tomaszsek
Copy link

I had an "exit 1" in the backup script in case of hitting an existing backup number.

If you will exit with 0 it will work, but backup shouldn't be triggered twice with the same number.

How is backup handled in case it is ongoing/running at the time the pod is recreated?

If the spec.backup.makeBackupBeforePodDeletion is set than backup will be performed on terminating pod.

@tomaszsek can you verify this oddity on the backup number?

I will try to reproduce the issue.

@tomaszsek
Copy link

I confirm that backup is called twice with the same number on termination pod when spec.backup.makeBackupBeforePodDeletion is set:

2020-05-03T15:18:00.471+0200	INFO	controller-jenkins	backuprestore/backuprestore.go:172	Performing backup '66'	{"cr": "example"}
2020-05-03T15:18:00.821+0200	INFO	controller-jenkins	backuprestore/backuprestore.go:172	Performing backup '66'	{"cr": "example"}

The PVC backup provider works because the backup gets overridden.

akram pushed a commit to akram/jenkins-operator that referenced this issue May 6, 2020
akram pushed a commit to akram/jenkins-operator that referenced this issue May 6, 2020
@pniederlag
Copy link

Thx, as our problem is not only occuring in relation to pod deletion but also during regular operations my collegue now has put our issue on a separate ticket (#396). Thx for your efforts and work!

@tomaszsek
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants