restore after a node failover #273

dr-zeta · 2020-02-15T15:15:35Z

Hi everybody.
I found a restore issue after a failover test on a worker node that hosts the operator pods.
The operator cannot restore the last backup because it doesn't exist.
This is the log error:

2020-02-14T10:06:38.255Z	INFO	controller-jenkins	backuprestore/backuprestore.go:147	Restoring backup '3634'	{"cr": "master"}
2020-02-14T10:06:38.390Z	WARN	controller-jenkins	jenkins/jenkins_controller.go:154	Reconcile loop failed 10 times with the same error, giving up: pod exec error operation on stream: stdout 'Running restore backup
' stderr 'tar: can't open '/backup/3634.tar.gz': No such file or directory

In this case you have to manually insert the recoveryOnce value to 3633 (last minus 1) to recover the last good backup.

I think that you increase the last backup version before that you have correctly store it.
Is it possible?

Thanks

The text was updated successfully, but these errors were encountered:

tomaszsek · 2020-02-27T12:11:02Z

Hi @dr-zeta,

Sorry for the delay. I don't know how I can miss that issue 😞. The backup mechanism first make backup and if it has been successfully than update the status.lastBackup. I think that the backup file didin't store correctly due to node failover.

Cheers

tomaszsek · 2020-04-02T17:50:12Z

The backup mechanism will be extended with the third command which will return the latest backup number.

pniederlag · 2020-05-02T08:32:16Z

I can at least confirm we have seen this issue multiple times as well.

Until now I didn't report the issue as I wasn't sure wether this is tied to our setup using custom backup/restore scripts (just slightly adopted from shipped versions).

Trying to nail things down we have added some debugging into our backups and are noticing a behaviour that really puzzles us. The same backup-number is sometimes reused. Unfortunatly there is no obivous pattern to this happening.

Could you maybe check wether this is happening for you as well?

# get ID of your operator pod, adpopt with context and/or namespace
OPERATOR_ID=$(kubectl get pods)
# check number of occurance per backup number
kubectl logs ${OPERATOR_ID} | grep 'Performing backup' | cut -f5 | uniq -c

Sample output on our side

      1 Performing backup '6507'
      1 Performing backup '6508'
      1 Performing backup '6509'
      2 Performing backup '6510'
      1 Performing backup '6511'
      1 Performing backup '6512'
      1 Performing backup '6513'
      1 Performing backup '6514'
      2 Performing backup '6515'
      1 Performing backup '6516'

As we can see 6510 and 6515 have been used twice.

Is this just happening on our side?

pniederlag · 2020-05-02T08:40:45Z

Well, actually there is some kind of pattern involved.

We have a backup-schedule of 900 seconds.

Whenever we see second occurance of the same backup number it is triggered "of schedule", that is exactly after our backup has finished.

See timestamps in our debug.log, whenever the backup is triggered with same number a second time it is "of schedule" rigth after the backup-job has finished.

2020-05-02 06:06:46 UTC Custom backup 6507 START
2020-05-02 06:12:53 UTC Custom backup DONE
----------------------------------------
2020-05-02 06:21:46 UTC Custom backup 6508 START
2020-05-02 06:27:49 UTC Custom backup DONE
----------------------------------------
2020-05-02 06:36:46 UTC Custom backup 6509 START
2020-05-02 06:42:57 UTC Custom backup DONE
----------------------------------------
2020-05-02 06:51:46 UTC Custom backup 6510 START
2020-05-02 06:57:51 UTC Custom backup DONE
----------------------------------------
2020-05-02 06:57:53 UTC Custom backup 6510 START
6510 already existing, SHOULD NOT HAPPEN
2020-05-02 07:04:03 UTC Custom backup DONE
----------------------------------------
2020-05-02 07:06:46 UTC Custom backup 6511 START
2020-05-02 07:12:54 UTC Custom backup DONE
----------------------------------------
2020-05-02 07:21:46 UTC Custom backup 6512 START
2020-05-02 07:27:53 UTC Custom backup DONE
----------------------------------------
2020-05-02 07:36:46 UTC Custom backup 6513 START
2020-05-02 07:42:56 UTC Custom backup DONE
----------------------------------------
2020-05-02 07:51:46 UTC Custom backup 6514 START
2020-05-02 07:57:55 UTC Custom backup DONE
----------------------------------------
2020-05-02 08:06:46 UTC Custom backup 6515 START
2020-05-02 08:13:00 UTC Custom backup DONE
----------------------------------------
2020-05-02 08:13:02 UTC Custom backup 6515 START
6515 already existing, SHOULD NOT HAPPEN
2020-05-02 08:19:13 UTC Custom backup DONE
----------------------------------------
2020-05-02 08:21:46 UTC Custom backup 6516 START
2020-05-02 08:27:54 UTC Custom backup DONE
----------------------------------------
2020-05-02 08:36:46 UTC Custom backup 6517 START

My guess is this issue is related to the restore issue, sry if this is wrong.

tomaszsek · 2020-05-03T07:44:25Z

@pniederlag Thanks for the detailed logs. I'm surprised that the backup number is used twice 😵. We have to narrow it and fix in the operator, but as a fast workaround, I can suggest skipping backup if the backup file already exists. We will implement it in the PVC backup for sure.

pniederlag · 2020-05-03T08:53:17Z

"skipping backup in case of duplicated backup number". Well, been there, done that, made it even worse for us.

I had an "exit 1" in the backup script in case of hitting an existing backup number. For some weird reasons in this scenario the backup number never incremented anymore. So we ended up running a couple of days without backups until we spotted the problem.

First of all it would be important to verify that the issue of duplicated backup numbers is a general problem and not something that is specific on our side.

@tomaszsek can you verify this oddity on the backup number?

I just ran into the issue again, that the author of this issue reported. What I can see:

backup-number 6560 was duplicated
during backup-number 6560 pods where terminated (cause is out of scope of this ticket)
size/content of backup-number 6560 is "borked"
when restore is triggered after recreation of pods it uses backup-number 6561 (which never ran)

quite strange to me.

How is backup handled in case it is ongoing/running at the time the pod is recreated?

tomaszsek · 2020-05-03T11:37:36Z

I had an "exit 1" in the backup script in case of hitting an existing backup number.

If you will exit with 0 it will work, but backup shouldn't be triggered twice with the same number.

How is backup handled in case it is ongoing/running at the time the pod is recreated?

If the spec.backup.makeBackupBeforePodDeletion is set than backup will be performed on terminating pod.

@tomaszsek can you verify this oddity on the backup number?

I will try to reproduce the issue.

tomaszsek · 2020-05-03T13:22:43Z

I confirm that backup is called twice with the same number on termination pod when spec.backup.makeBackupBeforePodDeletion is set:

2020-05-03T15:18:00.471+0200	INFO	controller-jenkins	backuprestore/backuprestore.go:172	Performing backup '66'	{"cr": "example"}
2020-05-03T15:18:00.821+0200	INFO	controller-jenkins	backuprestore/backuprestore.go:172	Performing backup '66'	{"cr": "example"}

The PVC backup provider works because the backup gets overridden.

…etion

pniederlag · 2020-05-19T10:52:03Z

Thx, as our problem is not only occuring in relation to pod deletion but also during regular operations my collegue now has put our issue on a separate ticket (#396). Thx for your efforts and work!

tomaszsek · 2021-01-18T14:23:15Z

Fixed in https://github.com/jenkinsci/kubernetes-operator/releases/tag/v0.5.0

tomaszsek added the bug Something isn't working label Apr 2, 2020

Nuru mentioned this issue May 3, 2020

Cannot specify jenkins-home volumeMount for jenkins-master container #265

Closed

tomaszsek added a commit that referenced this issue May 3, 2020

#273 Prevent make backup with the same number before pod deletion

6b96d2d

akram pushed a commit to akram/jenkins-operator that referenced this issue May 6, 2020

jenkinsci#273 Prevent make backup with the same number before pod del…

cbaf8b2

…etion

akram pushed a commit to akram/jenkins-operator that referenced this issue May 6, 2020

jenkinsci#273 Prevent make backup with the same number before pod del…

168f952

…etion

emu42 mentioned this issue May 19, 2020

Successful backup triggers another backup #396

Closed

tomaszsek added a commit that referenced this issue May 26, 2020

#273 Add get-latest.sh script to the PVC backup provider

3cc726d

tomaszsek closed this as completed Jan 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restore after a node failover #273

restore after a node failover #273

dr-zeta commented Feb 15, 2020

tomaszsek commented Feb 27, 2020

tomaszsek commented Apr 2, 2020

pniederlag commented May 2, 2020

pniederlag commented May 2, 2020

tomaszsek commented May 3, 2020

pniederlag commented May 3, 2020

tomaszsek commented May 3, 2020

tomaszsek commented May 3, 2020

pniederlag commented May 19, 2020

tomaszsek commented Jan 18, 2021

restore after a node failover #273

restore after a node failover #273

Comments

dr-zeta commented Feb 15, 2020

tomaszsek commented Feb 27, 2020

tomaszsek commented Apr 2, 2020

pniederlag commented May 2, 2020

pniederlag commented May 2, 2020

tomaszsek commented May 3, 2020

pniederlag commented May 3, 2020

tomaszsek commented May 3, 2020

tomaszsek commented May 3, 2020

pniederlag commented May 19, 2020

tomaszsek commented Jan 18, 2021