[BUG] Snapshot restoration failed due to ETCD max request bytes size #263

Gerrit91 · 2020-09-29T15:33:55Z

Describe the bug:

In our production environment, restoration of an ETCD backup potentially fails with the error message trying to send message larger than max (2541915 vs. 2097152). The restoration process will be stuck in an endless loop.

Expected behavior:

Restoration succeeds.

How To Reproduce (as minimally and precisely as possible):

I don't know. :(

But I can provide a 4GB backup that is not restorable.

Logs:

...
etcd-main-0 backup-restore time="2020-09-29T11:09:55Z" level=info msg="Applying delta snapshot Backup-1601328720/Incr-28989619-29005178-1601360349" actor=restorer                                                 
etcd-main-0 backup-restore 2020-09-29 11:10:03.367348 I | wal: segmented wal file /var/etcd/data/new.etcd.part/member/wal/0000000000000012-000000000009d057.wal is created                                         
etcd-main-0 backup-restore {"level":"warn","ts":"2020-09-29T11:10:15.219Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-8df1b049-bb7d-4a2e-8a67
-1384108e55a5/127.0.0.1:44391","attempt":0,"error":"rpc error: code = ResourceExhausted desc = trying to send message larger than max (2541915 vs. 2097152)"}                                                      
etcd-main-0 backup-restore time="2020-09-29T11:10:15Z" level=info msg="Cleanup complete" actor=restorer                                                                                                            
etcd-main-0 backup-restore time="2020-09-29T11:10:15Z" level=error msg="Restoration failed." actor=restorer                                                                                                        
etcd-main-0 backup-restore 2020-09-29 11:10:15.314956 I | etcdserver: skipped leadership transfer for single member cluster
etcd-main-0 backup-restore 2020-09-29 11:10:15.349119 I | etcdserver: skipped leadership transfer for single member cluster
etcd-main-0 backup-restore time="2020-09-29T11:10:15Z" level=error msg="Failed initialization: error while restoring corrupt data: Failed to restore snapshot: failed to apply events to etcd for delta snapshot In
cr-28989619-29005178-1601360349 : rpc error: code = ResourceExhausted desc = trying to send message larger than max (2541915 vs. 2097152)" actor=backup-restore-server
...

Screenshots (if applicable):

Environment (please complete the following information):

Etcd version/commit ID : 3.3.17 (client in go.mod has 3.3.15?)
Etcd-backup-restore version/commit ID: v0.9.1
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]: GCS

Anything else we need to know?:

We were experimenting to resolve this error with a self-built image of the etcd-backup-restore. We tried to increase the embedded ETCD values for MaxResourceBytes and MaxTxnOps and the client value for MaxCallSendMsgSize. However, this was leading to just another error:

etcd-main-0 backup-restore {"level":"warn","ts":"2020-09-29T13:42:51.482Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-29dfe419-fed5-46ea-96c5
-5fd69783d014/127.0.0.1:33497","attempt":0,"error":"rpc error: code = InvalidArgument desc = etcdserver: too many operations in txn request"}

Maybe the restore loop which builds transactions needs to be chunked into max 128 transaction objects (default)?

The text was updated successfully, but these errors were encountered:

vlerenc · 2020-09-30T11:29:00Z

@shreyas-s-rao @amshuman-kr According to @majst01, ETCD request size limit wasn't changed and they use the ETCD as provisioned by the druid.

shreyas-s-rao · 2020-09-30T14:23:23Z

Thanks @Gerrit91 for reporting this issue. This is something that we haven't encountered before. From the error logs, what I gather is that there's a delta snapshot who's events sum up to larger than 2MiB, and the transaction built from this fails to be written to etcd. Seems surprising to me, since we never encountered this issue before, event with delta snapshots as large as 100MiB, unsure why though. I would like to try reproducing the bug locally with your 4GB backup. Could you please provide that?

This also motivates us to expose a lot of the etcd flags in the charts and copy them over to the embedded etcd config too @amshuman-kr

amshuman-kr · 2020-09-30T17:13:39Z

This also motivates us to expose a lot of the etcd flags in the charts and copy them over to the embedded etcd config too @amshuman-kr

Yes. I think it is generally a good idea to pass the same configuration as the main etcd container to the embedded etcd during restoration. With etcd-druid we have the place and opportunity to do that.

majst01 · 2020-10-02T12:38:58Z

We found what caused this!
We wanted to move our etcd from standard storage to ssd (gardener.fast) storage. Before we migrated the etcd we measured the performance on the standard storage with:

ETCDCTL_API=3 etcdctl check perf --load="s"

Which simulates small load on the etcd and reports IOPS and slowest OP and stddev.

Once you have done this, the incremental snapshot created after this run can not be restored anymore with the above errors.

Gerrit91 · 2020-10-02T13:21:47Z

Yep. This way you should be able to create your own "unrestorable" backup and we do not have to share GBs of files. :)

shreyas-s-rao · 2020-10-05T05:47:25Z

@Gerrit91 @majst01 Ah ok. This is something I'll try locally and try to reproduce. Thanks.

abdasgupta · 2020-11-03T18:23:09Z

/assign

abdasgupta · 2020-11-04T06:09:29Z

Hi @Gerrit91 I could reproduce your use. So to solve, I modified the ETCDBR to have MaxRequestBytes and MaxTxnOps variables with 10 * 1024 * 1024 and 10 * 1024 as values respectively in pkg/snapshot/restorer/restorer.go . Along with that I had to create the client with MaxCallSendMsgSize: 10 * 1024 * 1024 in the same pkg/snapshot/restorer/restorer.go . I would send a PR after I remove the hardcoding and replace with options in ETCDBR. I would set the default value as what I hardcoded. Any suggestions?

Gerrit91 · 2020-11-04T07:12:58Z

Sounds great, @abdasgupta. I guess it's a good idea that a user can increase the value for those rare moments where this error pops up. Did you also try this with a larger performance --load as well?

abdasgupta · 2020-11-04T08:33:59Z

@Gerrit91 Yes. Tried with --load="xl" . It restored properly.

abdasgupta · 2020-11-04T17:03:24Z

I raised this PR #282 . Please review it. One thing I noticed while checking with bigger data sizes, write operation in my setup is very slow. So etcdctl check perf is failing most of the time. Along with that restoration is giving request timed out! Though sometimes restoration is passing for bigger data as well.

abdasgupta · 2020-11-13T09:50:53Z

I spent some time on how etcdbr is restoring from the delta snapshots. I also had to have a look on etcdctl check perf to understand what was going wrong!
So the problem which grabbed my attention in this bug was, why do we need to increase maxTxnOps parameter while restoring? check perf was not increasing maxTxnOps while writing. Then why do we need to increase maxTxnOps while restoring?
So Delta snapshot is basically recording the watch events from ETCD. https://etcd.io/docs/v3.4.0/dev-guide/interacting_v3/#watch-historical-changes-of-keys . Responses from watch events also include the revision number of the datastore. If only a single operation(PUT/DELETE) is committed, revision number will increase by one. Watch will list the event and show the revision number of the store in the event. If multiple operations are bundled in a single transaction, the revision number increases by one once the transaction commits after executing all the operations. Watch will list all the operations as list of events but each of the events will have same store revision number. Initially, I thought this scenario was not probably handled well here

etcd-backup-restore/pkg/snapshot/restorer/restorer.go

Lines 645 to 651 in f68413f

    
           if lastRev != 0 && nextRev > lastRev { 
        
           	if _, err := client.Txn(ctx).Then(ops...).Commit(); err != nil { 
        
           		return err 
        
           	} 
        
           	ops = []clientv3.Op{} 
        
           } 
        
           lastRev = nextRev

So I thought etcdctl check perf must be doing some big transactions! Though that was quite absurd because I didn't tweaked MaxTxnOps for ETCD process. Actually, check perf is not doing any transaction at all! It's only executing the operations one by one! https://github.com/etcd-io/etcd/blob/01844fd2856016c488fd0f8974252a0070f277ae/etcdctl/ctlv3/command/check.go#L185

Then I found what is actually happening. check perf program is also deleting all the keys that it created with Delete operation. But this Delete operation was taking prefix https://github.com/etcd-io/etcd/blob/01844fd2856016c488fd0f8974252a0070f277ae/etcdctl/ctlv3/command/check.go#L216

Unlike transaction, a delete operation with prefix is not sent to ETCD server with bunch of operations bundled together in a request. Instead, a key prefix is sent to the server. Server deletes the keys from the database as per the prefix. One Delete operation can delete many keys that matches the prefix but revision number is increased by only one after the Delete operation. Same as transaction.

Thus, Watch lists multiple Delete events though it was actually a single Delete operation with prefix. Earlier I pointed out how this scenario, when multiple operations performed but revision number is increased by only one, is handled. And so

etcd-backup-restore/pkg/snapshot/restorer/restorer.go

Line 662 in f68413f

_, err := client.Txn(ctx).Then(ops...).Commit()

was bundling delete operations that was actually deleting all the 9000 keys that were being generated during ETCDCTL_API=3 bin/etcdctl check perf --load="s". As all 9000 keys were sent as operations bundled in a single transaction, the default limit of MaxTxnOps (126) was getting crossed.

I removed the Delete operation from etcdctl check perf and ETCDBR restored the data fine.

As a solution, instead of transaction, restorer should issue Delete operation when prefix is used. Restorer may also try to implement algorithm to find prefix from the keys of Delete events. Yet revision number may be affected in that case. So, I don't see any fool proof way to counter this problem just by looking at watch events during restoration . Only options seem to be in our hand is to tweak MaxTxnOps (Or remove the revision check).

Gerrit91 added the kind/bug Bug label Sep 29, 2020

vlerenc added area/backup Backup related component/etcd-backup-restore ETCD Backup & Restore labels Sep 30, 2020

majst01 mentioned this issue Sep 30, 2020

Compacting Incremental Snapshot Files gardener/etcd-druid#88

Closed

gardener-robot added the status/accepted Issue was accepted as something we need to work on label Oct 22, 2020

gardener-robot assigned abdasgupta Nov 3, 2020

abdasgupta added a commit to abdasgupta/etcd-backup-restore that referenced this issue Nov 4, 2020

Fixed the bug gardener#263

906eed0

amshuman-kr mentioned this issue Nov 5, 2020

Added CLI-flags to enable restoration for delta snapshots with large amount of data #282

Merged

abdasgupta mentioned this issue Nov 13, 2020

[Feature] Handling etcd transactions correctly during backup and restoration #283

Closed

shreyas-s-rao closed this as completed in #282 Nov 17, 2020

abdasgupta mentioned this issue Jan 10, 2023

[BUG] Restoring backup fails due to event revision mismatch #287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Snapshot restoration failed due to ETCD max request bytes size #263

[BUG] Snapshot restoration failed due to ETCD max request bytes size #263

Gerrit91 commented Sep 29, 2020 •

edited

Loading

vlerenc commented Sep 30, 2020

shreyas-s-rao commented Sep 30, 2020

amshuman-kr commented Sep 30, 2020

majst01 commented Oct 2, 2020

Gerrit91 commented Oct 2, 2020

shreyas-s-rao commented Oct 5, 2020

abdasgupta commented Nov 3, 2020

abdasgupta commented Nov 4, 2020

Gerrit91 commented Nov 4, 2020

abdasgupta commented Nov 4, 2020

abdasgupta commented Nov 4, 2020

abdasgupta commented Nov 13, 2020

[BUG] Snapshot restoration failed due to ETCD max request bytes size #263

[BUG] Snapshot restoration failed due to ETCD max request bytes size #263

Comments

Gerrit91 commented Sep 29, 2020 • edited Loading

vlerenc commented Sep 30, 2020

shreyas-s-rao commented Sep 30, 2020

amshuman-kr commented Sep 30, 2020

majst01 commented Oct 2, 2020

Gerrit91 commented Oct 2, 2020

shreyas-s-rao commented Oct 5, 2020

abdasgupta commented Nov 3, 2020

abdasgupta commented Nov 4, 2020

Gerrit91 commented Nov 4, 2020

abdasgupta commented Nov 4, 2020

abdasgupta commented Nov 4, 2020

abdasgupta commented Nov 13, 2020

Gerrit91 commented Sep 29, 2020 •

edited

Loading