Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Restoring backup fails due to event revision mismatch #287

Closed
Gerrit91 opened this issue Nov 24, 2020 · 3 comments
Closed

[BUG] Restoring backup fails due to event revision mismatch #287

Gerrit91 opened this issue Nov 24, 2020 · 3 comments
Assignees
Labels
kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage) status/closed Issue is closed (either delivered or triaged)

Comments

@Gerrit91
Copy link

Describe the bug:

We encountered an issue where backup restoration was failing because of "revision mismatches". In this cluster we are running an installation of Rook.

For some reason the expected revision 69263102 seems to be present in the first delta snapshot (which is only 1kB in size). The increment contains only one single entry:

[{"etcdEvent":{"kv":{"key":"L3JlZ2lzdHJ5L2NvbmZpZ21hcHMva3ViZS1zeXN0ZW0vY2VydC1tYW5hZ2VyLWNhaW5qZWN0b3ItbGVhZGVyLWVsZWN0aW9uLWNvcmU=","create_revision":42004,"mod_revision":69263102,"version":4511635,"value":"azhzAAoPCgJ2MRIJQ29uZmlnTWFwEo8ECowECixjZXJ0LW1hbmFnZXItY2FpbmplY3Rvci1sZWFkZXItZWxlY3Rpb24tY29yZRIAGgtrdWJlLXN5c3RlbSIAKiQxNGQ5NWYzMy1lZGQyLTQzMzUtOWUyMi02YjM4ZDgwYThmOTUyADgAQggIzOy0+QUQAGKHAgooY29udHJvbC1wbGFuZS5hbHBoYS5rdWJlcm5ldGVzLmlvL2xlYWRlchLaAXsiaG9sZGVySWRlbnRpdHkiOiJjZXJ0LW1hbmFnZXItY2FpbmplY3Rvci1jZmM0OTViYzUtZzVoczZfNWY5ZDk3MDctNjE2MS00ZTNkLTljNjktYWNhZDliNTEyZDlhIiwibGVhc2VEdXJhdGlvblNlY29uZHMiOjE1LCJhY3F1aXJlVGltZSI6IjIwMjAtMTEtMjNUMDg6NDA6MDJaIiwicmVuZXdUaW1lIjoiMjAyMC0xMS0yM1QwODo0MTozNVoiLCJsZWFkZXJUcmFuc2l0aW9ucyI6ODh9egCKAYkBCgpjYWluamVjdG9yEgZVcGRhdGUaAnYxIggIv+vt/QUQADIIRmllbGRzVjE6WwpZeyJmOm1ldGFkYXRhIjp7ImY6YW5ub3RhdGlvbnMiOnsiLiI6e30sImY6Y29udHJvbC1wbGFuZS5hbHBoYS5rdWJlcm5ldGVzLmlvL2xlYWRlciI6e319fX0aACIA"}},"time":"2020-11-23T08:43:23.717449355Z"}]

We were first encountering an issue where the backup restoration did not work, producing output like that:

etcd-main-0 backup-restore time="2020-11-24T09:56:24Z" level=info msg="Removing directory(/var/etcd/data/new.etcd.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.pa
rt.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.par
t.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part)."                                                                                            
etcd-main-0 backup-restore time="2020-11-24T09:56:24Z" level=error msg="Failed initialization: error while restoring corrupt data: failed to delete previous temporary data directory: failed to remove directory /
var/etcd/data/new.etcd.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.par
t.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part
.part.part.part.part.part with err: unlinkat /var/etcd/data/new.etcd.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.pa
rt.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.par
t.part.part.part.part.part.part.part.part.part.part.part.part.part.part: file name too long" actor=backup-restore-server

And this issue now happened when trying out the latest version of the backup-restore image v0.11.1.

Expected behavior:

Restoration succeeds.

How To Reproduce (as minimally and precisely as possible):

I don't really know. 😕

We can potentially provide a backup that produces the described behavior.

Logs:

etcd-main-0 backup-restore {"level":"info","ts":1606215948.1601083,"caller":"mvcc/kvstore.go:378","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":69260799}
etcd-main-0 backup-restore {"level":"info","ts":"2020-11-24T11:05:48.202Z","caller":"mvcc/kvstore.go:378","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":69260799}
etcd-main-0 backup-restore {"level":"info","ts":"2020-11-24T11:05:48.215Z","caller":"mvcc/kvstore.go:378","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":69260799}
etcd-main-0 backup-restore time="2020-11-24T11:05:48Z" level=error msg="Failed initialization: error while restoring corrupt data: Failed to restore snapshot: mismatched event revision while applying delta snapshot, expected 69263102 but applied 69263103 " actor=backup-restore-server

Screenshots (if applicable):

Environment (please complete the following information):

  • Etcd version/commit ID : v3.3.17
  • Etcd-backup-restore version/commit ID: v0.11.1
  • Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:

@Gerrit91 Gerrit91 added the kind/bug Bug label Nov 24, 2020
@amshuman-kr
Copy link
Collaborator

amshuman-kr commented Nov 27, 2020

The following part of the issue is already solved in #259 released in v0.11.1.

etcd-main-0 backup-restore time="2020-11-24T09:56:24Z" level=info msg="Removing directory(/var/etcd/data/new.etcd.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.pa
rt.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.par
t.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part)."                                                                                            
etcd-main-0 backup-restore time="2020-11-24T09:56:24Z" level=error msg="Failed initialization: error while restoring corrupt data: failed to delete previous temporary data directory: failed to remove directory /
var/etcd/data/new.etcd.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.par
t.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part
.part.part.part.part.part with err: unlinkat /var/etcd/data/new.etcd.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.pa
rt.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.part.par
t.part.part.part.part.part.part.part.part.part.part.part.part.part.part: file name too long" actor=backup-restore-server

@shreyas-s-rao @abdasgupta Do you think #282 would address the following part? I think this can only be solved by solving #283.

mismatched event revision while applying delta snapshot, expected 69263102 but applied 69263103

Update: I wrongly referred to #228 earlier. I have corrected the reference to #283.

@amshuman-kr
Copy link
Collaborator

amshuman-kr commented Dec 8, 2020

@Gerrit91 Regarding the second part of the error mismatched event revision while applying delta snapshot, expected 69263102 but applied 69263103, the related #283 was closed because the right thing was already being done regarding handling transactions during restoration as the revision is matched while applying each incremental snapshot.

We can potentially provide a backup that produces the described behavior.

It is possible that there is an edge case. Is this issue re-producible? Or would it be possible to share the the backup somehow which we can try to restore to re-produce the problem?

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Sep 22, 2021
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Mar 24, 2022
@abdasgupta abdasgupta self-assigned this Jan 5, 2023
@abdasgupta
Copy link
Contributor

/close we already addressed the concern in #263 and #283 . Please reopen if the issue occurs again

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

4 participants