-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inventory-based backup tool #197
Comments
I assume you mean by the last part that keys deleted from the bucket should not be deleted from the backup. What about if a key is modified — then what should happen to the old version? What does "also be a true backup going back" mean?
I don't know what you're trying to say here (First problem: filename for what?), and as a result I can't make sense of the rest of the source paragraph.
|
correct!
NB. This whole design is just an idea ATM. So if you see shortcomings or have recommendations -- we can improve!
Good point. Let me state it as a requirement that we should be able to identify/recover any version of any file uploaded to the archive before not just current version. ( we might later do pruning though of
I was trying to say that we should adjust
we download |
to test etc, would be worth to add a few options:
overall -- I think the tool should not be dandi specific at all, and thus could be of general interest to public. (but that is why worth checking even more if something like that exist -- I failed to find so far) |
Please elaborate on exactly what behavior you want. |
@dandi/archive-admin Question: I'm looking at s3://dandiarchive/dandiarchive/dandiarchive/2024-11-06T01-00Z/manifest.json, which was apparently generated on 2024 Nov 6, yet the first CSV file listed in it contains only entries with mtimes on 2022 April 09. Why the discrepancy? |
those files are generated automatically by AWS Cloud Inventory. the CSVs together should be an inventory of all objects in that bucket. it's not a diff, but a total reflection of the bucket generated nightly. |
@yarikoptic Given Satra's comment above, should the backup program just process the latest |
Aiming for a generic tool, why not to make it flexible -- read
Let's
rollback or match the prior state: add a function which would ensure that current tree is matching specific inventory one:
But while thinking about it, I realized that overall approach does not cover the case of key switching between being a file and directory.
In principle - yes, it could be coded first so it just processes the given (e.g. latest)
|
That would be rather tedious to implement. Do you need this for the first working version? |
Because each set of inventories lists every single item in the bucket, this won't scale well. Just a single CSV file from the manifest you showed in the original comment contains three million entries. |
Are you actually saying "no" here?
If there are multiple inventories between the last processed and the most recent, why would we process each one? As Satra said, they're not diffs; each set of inventory files is a complete listing of the bucket. |
I am saying "no, it is not enough"
then you could potentially miss some versions of the files, which would be renamed into |
yes, there is a scalability concern as we are expecting hundreds of millions entries (e.g https://github.com/dandisets/000108 alone accounts for 300 million files across its zarrs). If those lists are sorted though -- might be quite easy since then all files in a folder would be a sequential batch and we would process that sequential batch from inventory + files on drive and in |
It appears that the inventories include old/non-latest versions of keys along with deleted keys. What's left to miss? |
But wouldn't the rollback have to be run against the entire backup tree, which would be huge? Doing that in response to an error or Ctrl-C seems absurdly time-consuming. |
oh, I was not aware of that! If I am understanding correctly that every next date contains information about all versions of all keys, and unless GC picked up any, would be a superset over prior ones. As we do have now "trailing delete" enabled for S3, then for the initial run we do not even want to go through prior snapshots -- we better start with the latest one since it would be a better chance to have access to all versions of the keys. Great. But it might be useful to start from a few dates back, so we could test correct operation while going to the "next" inventory version, as we would need to do daily (that is the motivation here - efficient daily backup of new changes, without explicitly listing entire bucket every day).
Indeed. FWIW, aiming for incremental changes, I think we can minimize the amount of time when interruption would lead to requiring such a full blown roll back. E.g. if we
WDYT? may be some better ways? |
@yarikoptic An alternative: Don't rollback on errors/Ctrl-C, just exit (after cleaning up any partially-downloaded files). Instead, make the command capable of both (a) running against a backup directory in which a previous backup run was interrupted (so that a failed command can be immediately rerun, hopefully this time to completion) and (b) reverting a backup directory to an earlier backup by specifying an earlier date (so that a failed command can be forcefully rolled back by running the command with the date that the backup was previously at). Also, I've written some code for this already. Can a repository be created to upload the code to and for individual discussion issues to be created in?
|
I have created https://github.com/dandi/s3invsync and made you admin. I do not have better preferences/ideas for a name. Re Alternative: it reads like what i have suggested as "rollback or match the prior state" feature above and you had a legitimate concern
or did you think back then I was suggesting it as something to do right upon error/Ctrl-C? Also -- how would you know if current state of the backup is "legit" and not some partial one? I think we better have an explicit annotation for that. In either case, I think that separating out analysis, from "fetching" and then actual "doing" as I suggested above might benefit greatly in minimizing time for when we could leave the backup in some incomplete/broken state. Don't you think so? |
@yarikoptic Let's continue the discussion about error recovery in dandi/s3invsync#14. |
A solution (or part of it) for
We have inventory which gives us transactional log of what has happened to the bucket. We have it dumped also into the bucket:
(note - I do not think that beginning date is the ultimate beginning of the bucket unfortunately but probably ok), where for each dated folder we have
and the manifest.json is actually pointers to the listing of items in the bucket
and those .csv is the compressed listings
dandi@drogon:/tmp$ aws s3 cp s3://dandiarchive/dandiarchive/dandiarchive/data/15a3a67a-6dec-44b6-9800-f8ddf5a44870.csv.gz . download: s3://dandiarchive/dandiarchive/dandiarchive/data/15a3a67a-6dec-44b6-9800-f8ddf5a44870.csv.gz to ./15a3a67a-6dec-44b6-9800-f8ddf5a44870.csv.gz
We need a tool which would efficiently download and then incrementally update local backup of the bucket based on those logs. Some features to target/keep in mind:
one possible "approach" could be
path
.versions.json
or alike in that directory.versions.json
for{etag}
and{versionid}
andmv
that{path}
to{path}.old.{versionid}.{etag}
(should preserve original mtime), and remove entry from.versions.json
this way we
.versions.json
or looking through*.old.*
files for the pathmtime
(stored within filesystem) so we could prune some "timed out"*.old.*
with a simplefind
command.versions.json
in each folder though .There is also that
hive/
:which I do not know yet what it is about.
NB ATM I am running a sync of the inventory under
drogon:/mnt/backup/dandi/dandiarchive-inventory
The text was updated successfully, but these errors were encountered: