-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added CLI-flags to enable restoration for delta snapshots with large amount of data #282
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@abdasgupta I have updated the PR title, description and the releases notes. Please check if it makes sense and see anything needs to be changed. You can follow this kind template in the future PRs.
The integration test is failing. @shreyas-s-rao any idea why this is?
|
@abdasgupta Thanks for the well-written PR! Can you please check why the unit tests for restoration are failing? @amshuman-kr The issue was with two pipeline jobs running in parallel and trying to access the same bucket. I have temporarily deleted one of the PR jobs for running tm tests, since the tm tests run on head update anyway. I'll refactor the pipeline definitions today and re-enable on-demand tm tests for PRs later. |
@shreyas-s-rao I have updated the test cases now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very useful. Thanks for this PR. 👍
Hope it's fine that I added a review comment.
Quote from https://etcd.io/docs/v3.4.0/learning/api/#transaction.
But I doubt all the events in a single delta snapshot ever belong to a single transaction while the changes are originally made on etcd. After that, there is this 👇 in the same documentation.
I am not sure we are checking this either which is again another problem.
@abdasgupta I suppose you meant that a revision check after restoration fails with this error. But I don't know if we check after restoration if the restored revision matches the backup latest revision? From the above documentation, it looks like they may not match even based on the current implementation (before this PR)! Anyway, if I understood right, we have the following options. Anything else will be technically incorrect.
Or I have misunderstood completely, which is certainly possible 😉 |
@amshuman-kr I don't particularly feel comfortable giving up on revision checks during restoration, because right now that's our only safety net against restoration inconsistencies, if at all the snapshot was not uploaded correctly, or something went wrong in the bucket or even during the download of the object.
I too remember seeing this in the documentation, but there never seemed to an issue regarding this when we applied the transactions during restoration. Surprising indeed. Since delta snapshots record events (which are either PUT or DELETE), I think we should pursue restoring delta snapshots in the same way, serially applying PUT and DELETE events instead of batching them into transactions. @abdasgupta @amshuman-kr WDYT? |
I have tried to do only PUT and DELETE operations, instead of using transactions. First problem is, each operation would commit as I understood. So, for large Delta events, it should take larger time. Second problem, that also gave error of revision mismatch. The changes I Made:
The error:
Hope this is what you were suggesting? @shreyas-s-rao As a reason for this error, I think that the delta snapshot operations are queued to the ETCD server in a order which is not matching with the order the changes were made actually . |
@abdasgupta Yes, this is what I was thinking. If I'm right, most of the delta snapshots events are getting applied correctly, but that applying that last event is throwing the error right? |
I agree.
I am afraid this is not an option. As @abdasgupta showed above, if the original events were done using transaction then not using transaction while restoring will still lead to mismatch in revisions after restoration. If we do not want to give up on revision check after restoration (and I agree that we should not), then the correct thing is to group the events into transactions only if they share the same Furthermore, the implementation idea for compaction of compacting the delta snapshots asynchronously by periodically restoring and replacing past delta snapshots with full snapshots would work only if revisions are consistent across restorations. So, we don't really have a choice here. As far as I can see, we have to honour original transaction during restoration. The only tricky part for me is how to ensure events from the same original transaction do not get split into multiple delta snapshot files. If that can be ensured, then the rest is manageable, I think. |
I think this can be achieved if etcd watch returns all the events from the same transaction in one watch response. I am reasonably confident that it will. We need to check if this is guaranteed to be true. |
TL;DR 😉 Transactions are an ETCD functionality designed for use by clients of ETCD however they see fit. It clearly has semantics associated with it and not just an optimization hack. We should not use it for any other purpose. Equally, we should not avoid using it for any other purpose either. |
BTW I am ok merging this PR as is with the additional CLI flags without the chunking changes during restoration because the issue with transactions is clearly a pre-existing condition. We can create a separate issue to track it (and address it in a separate PR). |
There is a conflict appearing in this PR due to #275 |
@abdasgupta Yes, can you please rebase the PR? |
@amshuman-kr @shreyas-s-rao Took the liberty of passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abdasgupta Thanks for the PR. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Added CLI-flags to enable restoration for delta snapshots with large amount of data
What this PR does / why we need it:
Added CLI-flags (
max-call-send-message-size
,max-request-bytes
andmax-txn-ops
) to enable restoration for delta snapshots with large amount of data (large number of events or events with large data).Which issue(s) this PR fixes:
Fixes #263
Special notes for your reviewer:
Release note: