Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Snapshot restoration failed due to ETCD max request bytes size #263

Closed
Gerrit91 opened this issue Sep 29, 2020 · 12 comments · Fixed by #282
Closed

[BUG] Snapshot restoration failed due to ETCD max request bytes size #263

Gerrit91 opened this issue Sep 29, 2020 · 12 comments · Fixed by #282
Assignees
Labels
area/backup Backup related component/etcd-backup-restore ETCD Backup & Restore kind/bug Bug status/accepted Issue was accepted as something we need to work on

Comments

@Gerrit91
Copy link

Gerrit91 commented Sep 29, 2020

Describe the bug:

In our production environment, restoration of an ETCD backup potentially fails with the error message trying to send message larger than max (2541915 vs. 2097152). The restoration process will be stuck in an endless loop.

Expected behavior:

Restoration succeeds.

How To Reproduce (as minimally and precisely as possible):

I don't know. :(

But I can provide a 4GB backup that is not restorable.

Logs:

...
etcd-main-0 backup-restore time="2020-09-29T11:09:55Z" level=info msg="Applying delta snapshot Backup-1601328720/Incr-28989619-29005178-1601360349" actor=restorer                                                 
etcd-main-0 backup-restore 2020-09-29 11:10:03.367348 I | wal: segmented wal file /var/etcd/data/new.etcd.part/member/wal/0000000000000012-000000000009d057.wal is created                                         
etcd-main-0 backup-restore {"level":"warn","ts":"2020-09-29T11:10:15.219Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-8df1b049-bb7d-4a2e-8a67
-1384108e55a5/127.0.0.1:44391","attempt":0,"error":"rpc error: code = ResourceExhausted desc = trying to send message larger than max (2541915 vs. 2097152)"}                                                      
etcd-main-0 backup-restore time="2020-09-29T11:10:15Z" level=info msg="Cleanup complete" actor=restorer                                                                                                            
etcd-main-0 backup-restore time="2020-09-29T11:10:15Z" level=error msg="Restoration failed." actor=restorer                                                                                                        
etcd-main-0 backup-restore 2020-09-29 11:10:15.314956 I | etcdserver: skipped leadership transfer for single member cluster
etcd-main-0 backup-restore 2020-09-29 11:10:15.349119 I | etcdserver: skipped leadership transfer for single member cluster
etcd-main-0 backup-restore time="2020-09-29T11:10:15Z" level=error msg="Failed initialization: error while restoring corrupt data: Failed to restore snapshot: failed to apply events to etcd for delta snapshot In
cr-28989619-29005178-1601360349 : rpc error: code = ResourceExhausted desc = trying to send message larger than max (2541915 vs. 2097152)" actor=backup-restore-server
...

Screenshots (if applicable):

Environment (please complete the following information):

  • Etcd version/commit ID : 3.3.17 (client in go.mod has 3.3.15?)
  • Etcd-backup-restore version/commit ID: v0.9.1
  • Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]: GCS

Anything else we need to know?:

We were experimenting to resolve this error with a self-built image of the etcd-backup-restore. We tried to increase the embedded ETCD values for MaxResourceBytes and MaxTxnOps and the client value for MaxCallSendMsgSize. However, this was leading to just another error:

etcd-main-0 backup-restore {"level":"warn","ts":"2020-09-29T13:42:51.482Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-29dfe419-fed5-46ea-96c5
-5fd69783d014/127.0.0.1:33497","attempt":0,"error":"rpc error: code = InvalidArgument desc = etcdserver: too many operations in txn request"}

Maybe the restore loop which builds transactions needs to be chunked into max 128 transaction objects (default)?

@Gerrit91 Gerrit91 added the kind/bug Bug label Sep 29, 2020
@vlerenc vlerenc added area/backup Backup related component/etcd-backup-restore ETCD Backup & Restore labels Sep 30, 2020
@vlerenc
Copy link
Member

vlerenc commented Sep 30, 2020

@shreyas-s-rao @amshuman-kr According to @majst01, ETCD request size limit wasn't changed and they use the ETCD as provisioned by the druid.

@shreyas-s-rao
Copy link
Collaborator

Thanks @Gerrit91 for reporting this issue. This is something that we haven't encountered before. From the error logs, what I gather is that there's a delta snapshot who's events sum up to larger than 2MiB, and the transaction built from this fails to be written to etcd. Seems surprising to me, since we never encountered this issue before, event with delta snapshots as large as 100MiB, unsure why though. I would like to try reproducing the bug locally with your 4GB backup. Could you please provide that?

This also motivates us to expose a lot of the etcd flags in the charts and copy them over to the embedded etcd config too @amshuman-kr

@amshuman-kr
Copy link
Collaborator

This also motivates us to expose a lot of the etcd flags in the charts and copy them over to the embedded etcd config too @amshuman-kr

Yes. I think it is generally a good idea to pass the same configuration as the main etcd container to the embedded etcd during restoration. With etcd-druid we have the place and opportunity to do that.

@majst01
Copy link

majst01 commented Oct 2, 2020

We found what caused this!
We wanted to move our etcd from standard storage to ssd (gardener.fast) storage. Before we migrated the etcd we measured the performance on the standard storage with:

ETCDCTL_API=3 etcdctl check perf --load="s"

Which simulates small load on the etcd and reports IOPS and slowest OP and stddev.

Once you have done this, the incremental snapshot created after this run can not be restored anymore with the above errors.

@Gerrit91
Copy link
Author

Gerrit91 commented Oct 2, 2020

Yep. This way you should be able to create your own "unrestorable" backup and we do not have to share GBs of files. :)

@shreyas-s-rao
Copy link
Collaborator

@Gerrit91 @majst01 Ah ok. This is something I'll try locally and try to reproduce. Thanks.

@gardener-robot gardener-robot added the status/accepted Issue was accepted as something we need to work on label Oct 22, 2020
@abdasgupta
Copy link
Contributor

/assign

@abdasgupta
Copy link
Contributor

Hi @Gerrit91 I could reproduce your use. So to solve, I modified the ETCDBR to have MaxRequestBytes and MaxTxnOps variables with 10 * 1024 * 1024 and 10 * 1024 as values respectively in pkg/snapshot/restorer/restorer.go . Along with that I had to create the client with MaxCallSendMsgSize: 10 * 1024 * 1024 in the same pkg/snapshot/restorer/restorer.go . I would send a PR after I remove the hardcoding and replace with options in ETCDBR. I would set the default value as what I hardcoded. Any suggestions?

@Gerrit91
Copy link
Author

Gerrit91 commented Nov 4, 2020

Sounds great, @abdasgupta. I guess it's a good idea that a user can increase the value for those rare moments where this error pops up. Did you also try this with a larger performance --load as well?

@abdasgupta
Copy link
Contributor

@Gerrit91 Yes. Tried with --load="xl" . It restored properly.

abdasgupta added a commit to abdasgupta/etcd-backup-restore that referenced this issue Nov 4, 2020
@abdasgupta
Copy link
Contributor

I raised this PR #282 . Please review it. One thing I noticed while checking with bigger data sizes, write operation in my setup is very slow. So etcdctl check perf is failing most of the time. Along with that restoration is giving request timed out! Though sometimes restoration is passing for bigger data as well.

@abdasgupta
Copy link
Contributor

I spent some time on how etcdbr is restoring from the delta snapshots. I also had to have a look on etcdctl check perf to understand what was going wrong!
So the problem which grabbed my attention in this bug was, why do we need to increase maxTxnOps parameter while restoring? check perf was not increasing maxTxnOps while writing. Then why do we need to increase maxTxnOps while restoring?
So Delta snapshot is basically recording the watch events from ETCD. https://etcd.io/docs/v3.4.0/dev-guide/interacting_v3/#watch-historical-changes-of-keys . Responses from watch events also include the revision number of the datastore. If only a single operation(PUT/DELETE) is committed, revision number will increase by one. Watch will list the event and show the revision number of the store in the event. If multiple operations are bundled in a single transaction, the revision number increases by one once the transaction commits after executing all the operations. Watch will list all the operations as list of events but each of the events will have same store revision number. Initially, I thought this scenario was not probably handled well here

if lastRev != 0 && nextRev > lastRev {
if _, err := client.Txn(ctx).Then(ops...).Commit(); err != nil {
return err
}
ops = []clientv3.Op{}
}
lastRev = nextRev

So I thought etcdctl check perf must be doing some big transactions! Though that was quite absurd because I didn't tweaked MaxTxnOps for ETCD process. Actually, check perf is not doing any transaction at all! It's only executing the operations one by one! https://github.com/etcd-io/etcd/blob/01844fd2856016c488fd0f8974252a0070f277ae/etcdctl/ctlv3/command/check.go#L185

Then I found what is actually happening. check perf program is also deleting all the keys that it created with Delete operation. But this Delete operation was taking prefix https://github.com/etcd-io/etcd/blob/01844fd2856016c488fd0f8974252a0070f277ae/etcdctl/ctlv3/command/check.go#L216

Unlike transaction, a delete operation with prefix is not sent to ETCD server with bunch of operations bundled together in a request. Instead, a key prefix is sent to the server. Server deletes the keys from the database as per the prefix. One Delete operation can delete many keys that matches the prefix but revision number is increased by only one after the Delete operation. Same as transaction.

Thus, Watch lists multiple Delete events though it was actually a single Delete operation with prefix. Earlier I pointed out how this scenario, when multiple operations performed but revision number is increased by only one, is handled. And so

_, err := client.Txn(ctx).Then(ops...).Commit()
was bundling delete operations that was actually deleting all the 9000 keys that were being generated during ETCDCTL_API=3 bin/etcdctl check perf --load="s". As all 9000 keys were sent as operations bundled in a single transaction, the default limit of MaxTxnOps (126) was getting crossed.

I removed the Delete operation from etcdctl check perf and ETCDBR restored the data fine.

As a solution, instead of transaction, restorer should issue Delete operation when prefix is used. Restorer may also try to implement algorithm to find prefix from the keys of Delete events. Yet revision number may be affected in that case. So, I don't see any fool proof way to counter this problem just by looking at watch events during restoration . Only options seem to be in our hand is to tweak MaxTxnOps (Or remove the revision check).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backup Backup related component/etcd-backup-restore ETCD Backup & Restore kind/bug Bug status/accepted Issue was accepted as something we need to work on
Projects
None yet
7 participants