[BUG] Restoring backup fails due to event revision mismatch #287

Gerrit91 · 2020-11-24T14:02:16Z

Describe the bug:

We encountered an issue where backup restoration was failing because of "revision mismatches". In this cluster we are running an installation of Rook.

For some reason the expected revision 69263102 seems to be present in the first delta snapshot (which is only 1kB in size). The increment contains only one single entry:

[{"etcdEvent":{"kv":{"key":"L3JlZ2lzdHJ5L2NvbmZpZ21hcHMva3ViZS1zeXN0ZW0vY2VydC1tYW5hZ2VyLWNhaW5qZWN0b3ItbGVhZGVyLWVsZWN0aW9uLWNvcmU=","create_revision":42004,"mod_revision":69263102,"version":4511635,"value":"azhzAAoPCgJ2MRIJQ29uZmlnTWFwEo8ECowECixjZXJ0LW1hbmFnZXItY2FpbmplY3Rvci1sZWFkZXItZWxlY3Rpb24tY29yZRIAGgtrdWJlLXN5c3RlbSIAKiQxNGQ5NWYzMy1lZGQyLTQzMzUtOWUyMi02YjM4ZDgwYThmOTUyADgAQggIzOy0+QUQAGKHAgooY29udHJvbC1wbGFuZS5hbHBoYS5rdWJlcm5ldGVzLmlvL2xlYWRlchLaAXsiaG9sZGVySWRlbnRpdHkiOiJjZXJ0LW1hbmFnZXItY2FpbmplY3Rvci1jZmM0OTViYzUtZzVoczZfNWY5ZDk3MDctNjE2MS00ZTNkLTljNjktYWNhZDliNTEyZDlhIiwibGVhc2VEdXJhdGlvblNlY29uZHMiOjE1LCJhY3F1aXJlVGltZSI6IjIwMjAtMTEtMjNUMDg6NDA6MDJaIiwicmVuZXdUaW1lIjoiMjAyMC0xMS0yM1QwODo0MTozNVoiLCJsZWFkZXJUcmFuc2l0aW9ucyI6ODh9egCKAYkBCgpjYWluamVjdG9yEgZVcGRhdGUaAnYxIggIv+vt/QUQADIIRmllbGRzVjE6WwpZeyJmOm1ldGFkYXRhIjp7ImY6YW5ub3RhdGlvbnMiOnsiLiI6e30sImY6Y29udHJvbC1wbGFuZS5hbHBoYS5rdWJlcm5ldGVzLmlvL2xlYWRlciI6e319fX0aACIA"}},"time":"2020-11-23T08:43:23.717449355Z"}]

We were first encountering an issue where the backup restoration did not work, producing output like that:

etcd-main-0 backup-restore time="2020-11-24T09:56:24Z" level=info msg="Removing directory(/var/etcd/data/new.etcd.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.pa
rt.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.par
t.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part)."                                                                                            
etcd-main-0 backup-restore time="2020-11-24T09:56:24Z" level=error msg="Failed initialization: error while restoring corrupt data: failed to delete previous temporary data directory: failed to remove directory /
var/etcd/data/new.etcd.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.par
t.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part
.part.part.part.part.part with err: unlinkat /var/etcd/data/new.etcd.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.pa
rt.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.par
t.part.part.part.part.part.part.part.part.part.part.part.part.part.part: file name too long" actor=backup-restore-server

And this issue now happened when trying out the latest version of the backup-restore image v0.11.1.

Expected behavior:

Restoration succeeds.

How To Reproduce (as minimally and precisely as possible):

I don't really know. 😕

We can potentially provide a backup that produces the described behavior.

Logs:

etcd-main-0 backup-restore {"level":"info","ts":1606215948.1601083,"caller":"mvcc/kvstore.go:378","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":69260799}
etcd-main-0 backup-restore {"level":"info","ts":"2020-11-24T11:05:48.202Z","caller":"mvcc/kvstore.go:378","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":69260799}
etcd-main-0 backup-restore {"level":"info","ts":"2020-11-24T11:05:48.215Z","caller":"mvcc/kvstore.go:378","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":69260799}
etcd-main-0 backup-restore time="2020-11-24T11:05:48Z" level=error msg="Failed initialization: error while restoring corrupt data: Failed to restore snapshot: mismatched event revision while applying delta snapshot, expected 69263102 but applied 69263103 " actor=backup-restore-server

Screenshots (if applicable):

Environment (please complete the following information):

Etcd version/commit ID : v3.3.17
Etcd-backup-restore version/commit ID: v0.11.1
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

amshuman-kr · 2020-11-27T18:02:13Z

The following part of the issue is already solved in #259 released in v0.11.1.

etcd-main-0 backup-restore time="2020-11-24T09:56:24Z" level=info msg="Removing directory(/var/etcd/data/new.etcd.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.pa
rt.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.par
t.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part)."                                                                                            
etcd-main-0 backup-restore time="2020-11-24T09:56:24Z" level=error msg="Failed initialization: error while restoring corrupt data: failed to delete previous temporary data directory: failed to remove directory /
var/etcd/data/new.etcd.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.par
t.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part
.part.part.part.part.part with err: unlinkat /var/etcd/data/new.etcd.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.pa
rt.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.par
t.part.part.part.part.part.part.part.part.part.part.part.part.part.part: file name too long" actor=backup-restore-server

@shreyas-s-rao @abdasgupta Do you think #282 would address the following part? I think this can only be solved by solving #283.

mismatched event revision while applying delta snapshot, expected 69263102 but applied 69263103

Update: I wrongly referred to #228 earlier. I have corrected the reference to #283.

amshuman-kr · 2020-12-08T07:49:18Z

@Gerrit91 Regarding the second part of the error mismatched event revision while applying delta snapshot, expected 69263102 but applied 69263103, the related #283 was closed because the right thing was already being done regarding handling transactions during restoration as the revision is matched while applying each incremental snapshot.

We can potentially provide a backup that produces the described behavior.

It is possible that there is an edge case. Is this issue re-producible? Or would it be possible to share the the backup somehow which we can try to restore to re-produce the problem?

abdasgupta · 2023-01-10T07:59:16Z

/close we already addressed the concern in #263 and #283 . Please reopen if the issue occurs again

Gerrit91 added the kind/bug Bug label Nov 24, 2020

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Sep 22, 2021

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Mar 24, 2022

abdasgupta self-assigned this Jan 5, 2023

gardener-robot closed this as completed Jan 10, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Restoring backup fails due to event revision mismatch #287

[BUG] Restoring backup fails due to event revision mismatch #287

Gerrit91 commented Nov 24, 2020

amshuman-kr commented Nov 27, 2020 •

edited

Loading

amshuman-kr commented Dec 8, 2020 •

edited

Loading

abdasgupta commented Jan 10, 2023

[BUG] Restoring backup fails due to event revision mismatch #287

[BUG] Restoring backup fails due to event revision mismatch #287

Comments

Gerrit91 commented Nov 24, 2020

amshuman-kr commented Nov 27, 2020 • edited Loading

amshuman-kr commented Dec 8, 2020 • edited Loading

abdasgupta commented Jan 10, 2023

amshuman-kr commented Nov 27, 2020 •

edited

Loading

amshuman-kr commented Dec 8, 2020 •

edited

Loading