-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s5cmd sync -- any way for efficient --incremental ? #746
Comments
Hi @yarikoptic, it looks like this may have been previously addressed (issues: #441, #447; fix: #483) and tested with the use cases of millions of files. |
|
s5cmd doesn't support it at the moment. Straightforward workaround would be manually grouping keys (potentially by directories if I may say) and running sync for each group (subset) of the bucket. |
That would nohow change total number of keys being listed (e.g. 1 billion) even in the case when none got changed since the last run. |
Right it wouldn't change the number of keys being listed. But it will help avoiding the sorting such a long list which is a performance bottleneck. Moreover, if this is going to be done once, calling s5cmd for each subset of directories seems an acceptable compromise to me. If s5cmd fails for any reason you'd only need to run only for the corresponding subset of directories instead of the full bucket which will hopefully reduce the # of total list requests. The incrementality would be provided manually. It is also necessary to make a list request, at least once, to find what is in the source & destination buckets. IIRC every list request brings 1000 keys, and 1000 List requests costs $0.005 . So for a billion documents listing cost will be about $5.
s5cmd doesn't store a restorable state/log, so not possible for now.
this might have been easier if AWS s3 api had a way to send a "modified since" option but, there isn't afaics. |
yeap -- "modified since" would have been perfect! Really shame they didn't provide it.
well, the idea is to do it once a day or so ;) FWIW, my initial attempt on our bucket without doing any fancy manual "splitting" -- interrupted `--dry-run sync` after about 8 hours and process reaching 76GB virtual memory utilizationdandi@drogon:~/proj/s5cmd-dandi$ duct ../s5cmd/s5cmd --dry-run sync s3://dandiarchive/* dandiarchive/
2024-10-28T11:08:51-0400 [INFO ] con-duct: duct is executing '../s5cmd/s5cmd --dry-run sync s3://dandiarchive/* dandiarchive/'...
2024-10-28T11:08:51-0400 [INFO ] con-duct: Log files will be written to .duct/logs/2024.10.28T11.08.51-2733714_
2024-10-28T18:33:10-0400 [INFO ] con-duct: Summary:
Exit Code: 137
Command: ../s5cmd/s5cmd --dry-run sync s3://dandiarchive/* dandiarchive/
Log files location: .duct/logs/2024.10.28T11.08.51-2733714_
Wall Clock Time: 26657.861 sec
Memory Peak Usage (RSS): 64.5 GB
Memory Average Usage (RSS): 27.8 GB
Virtual Memory Peak Usage (VSZ): 76.4 GB
Virtual Memory Average Usage (VSZ): 30.3 GB
Memory Peak Percentage: 95.7%
Memory Average Percentage: 41.20029702206471%
CPU Peak Usage: 100.0%
Average CPU Usage: 52.88295787687072% PS edit: for fun of it will now run it on a box with 1TB of RAM to see if it ever completes -- how long it would take ;-) |
FTR: that dry run finished, listing about 375M keys for the "dry" cp in 225841.831 sec (62 hours, so over 3 days)2024-11-02T05:43:28-0400 [INFO ] con-duct: Summary:
Exit Code: 0
Command: ../../s5cmd/s5cmd --dry-run --log debug sync s3://dandiarchive/* dandiarchive/
Log files location: .duct/logs/2024.10.30T14.59.27-418623_
Wall Clock Time: 225841.831 sec
Memory Peak Usage (RSS): 2.4 GB
Memory Average Usage (RSS): 966.7 MB
Virtual Memory Peak Usage (VSZ): 10.6 GB
Virtual Memory Average Usage (VSZ): 5.8 GB
Memory Peak Percentage: 0.2%
Memory Average Percentage: 0.059080347653249383%
CPU Peak Usage: 667.0%
Average CPU Usage: 413.55872652902883%
[INFO ] == Command exit (modification check follows) =====
run(ok): /home/yoh/proj/dandi/s5cmd-dandi (dataset) [duct ../../s5cmd/s5cmd --dry-run --log d...]
add(ok): .duct/logs/2024.10.30T14.59.27-418623_info.json (file)
add(ok): .duct/logs/2024.10.30T14.59.27-418623_stderr (file)
add(ok): .duct/logs/2024.10.30T14.59.27-418623_stdout (file)
add(ok): .duct/logs/2024.10.30T14.59.27-418623_usage.json (file)
save(ok): . (dataset)
yoh@typhon:~/proj/dandi/s5cmd-dandi$ wc -l .duct/logs/2024.10.30T14.59.27-418623_stdout
375455768 .duct/logs/2024.10.30T14.59.27-418623_stdout
yoh@typhon:~/proj/dandi/s5cmd-dandi$ duct --version
duct 0.8.0 |
We have a big (in TBs but also billions of keys) bucket to download/sync locally. FWIW, versioning turned on, so keys have versionIds assigned.
Is there some way to make sync efficient as to utilize some S3 API (or extra AWS service) to avoid listing/going through all keys it saw on the previous run, but rather operate on the "log" of the diff since prior state? (just get bunch of keys which were added or modified and some DeleteKey markers for deleted keys)
The text was updated successfully, but these errors were encountered: