-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Snapshot restoration failed due to ETCD max request bytes size #263
Comments
@shreyas-s-rao @amshuman-kr According to @majst01, ETCD request size limit wasn't changed and they use the ETCD as provisioned by the druid. |
Thanks @Gerrit91 for reporting this issue. This is something that we haven't encountered before. From the error logs, what I gather is that there's a delta snapshot who's events sum up to larger than 2MiB, and the transaction built from this fails to be written to etcd. Seems surprising to me, since we never encountered this issue before, event with delta snapshots as large as 100MiB, unsure why though. I would like to try reproducing the bug locally with your 4GB backup. Could you please provide that? This also motivates us to expose a lot of the etcd flags in the charts and copy them over to the embedded etcd config too @amshuman-kr |
Yes. I think it is generally a good idea to pass the same configuration as the main etcd container to the embedded etcd during restoration. With |
We found what caused this!
Which simulates small load on the etcd and reports IOPS and slowest OP and stddev. Once you have done this, the incremental snapshot created after this run can not be restored anymore with the above errors. |
Yep. This way you should be able to create your own "unrestorable" backup and we do not have to share GBs of files. :) |
/assign |
Hi @Gerrit91 I could reproduce your use. So to solve, I modified the ETCDBR to have |
Sounds great, @abdasgupta. I guess it's a good idea that a user can increase the value for those rare moments where this error pops up. Did you also try this with a larger performance |
@Gerrit91 Yes. Tried with |
I raised this PR #282 . Please review it. One thing I noticed while checking with bigger data sizes, write operation in my setup is very slow. So |
I spent some time on how etcdbr is restoring from the delta snapshots. I also had to have a look on etcd-backup-restore/pkg/snapshot/restorer/restorer.go Lines 645 to 651 in f68413f
So I thought etcdctl check perf must be doing some big transactions! Though that was quite absurd because I didn't tweaked MaxTxnOps for ETCD process. Actually, check perf is not doing any transaction at all! It's only executing the operations one by one! https://github.com/etcd-io/etcd/blob/01844fd2856016c488fd0f8974252a0070f277ae/etcdctl/ctlv3/command/check.go#L185
Then I found what is actually happening. Unlike transaction, a delete operation with prefix is not sent to ETCD server with bunch of operations bundled together in a request. Instead, a key prefix is sent to the server. Server deletes the keys from the database as per the prefix. One Delete operation can delete many keys that matches the prefix but revision number is increased by only one after the Delete operation. Same as transaction. Thus, Watch lists multiple Delete events though it was actually a single Delete operation with prefix. Earlier I pointed out how this scenario, when multiple operations performed but revision number is increased by only one, is handled. And so
ETCDCTL_API=3 bin/etcdctl check perf --load="s" . As all 9000 keys were sent as operations bundled in a single transaction, the default limit of MaxTxnOps (126) was getting crossed.
I removed the Delete operation from As a solution, instead of transaction, restorer should issue Delete operation when prefix is used. Restorer may also try to implement algorithm to find prefix from the keys of Delete events. Yet revision number may be affected in that case. So, I don't see any fool proof way to counter this problem just by looking at watch events during restoration . Only options seem to be in our hand is to tweak |
Describe the bug:
In our production environment, restoration of an ETCD backup potentially fails with the error message
trying to send message larger than max (2541915 vs. 2097152)
. The restoration process will be stuck in an endless loop.Expected behavior:
Restoration succeeds.
How To Reproduce (as minimally and precisely as possible):
I don't know. :(
But I can provide a 4GB backup that is not restorable.
Logs:
Screenshots (if applicable):
Environment (please complete the following information):
go.mod
has 3.3.15?)Anything else we need to know?:
We were experimenting to resolve this error with a self-built image of the etcd-backup-restore. We tried to increase the embedded ETCD values for
MaxResourceBytes
andMaxTxnOps
and the client value forMaxCallSendMsgSize
. However, this was leading to just another error:Maybe the restore loop which builds transactions needs to be chunked into max 128 transaction objects (default)?
The text was updated successfully, but these errors were encountered: