s5cmd sync -- any way for efficient --incremental ? #746

yarikoptic · 2024-07-30T20:22:53Z

We have a big (in TBs but also billions of keys) bucket to download/sync locally. FWIW, versioning turned on, so keys have versionIds assigned.

Is there some way to make sync efficient as to utilize some S3 API (or extra AWS service) to avoid listing/going through all keys it saw on the previous run, but rather operate on the "log" of the diff since prior state? (just get bunch of keys which were added or modified and some DeleteKey markers for deleted keys)

kabilar · 2024-10-25T19:50:46Z

Hi @yarikoptic, it looks like this may have been previously addressed (issues: #441, #447; fix: #483) and tested with the use cases of millions of files.

cc @puja-trivedi @aaronkanzer

yarikoptic · 2024-10-25T22:27:51Z

code pointed there implies listing and sorting full bucket/destination first, not operating on "incremental"s
there they talk about thousands and a million of files, and concern of sorting those keys
I am talking about hundreds of millions or few billions of keys (that is where we are in DANDI ATM). Hence any solution requiring full listing first is doomed to be inefficient and expensive (if someone to pay for those listing queries).

kucukaslan · 2024-10-26T18:54:35Z

s5cmd doesn't support it at the moment.

Straightforward workaround would be manually grouping keys (potentially by directories if I may say) and running sync for each group (subset) of the bucket.

yarikoptic · 2024-10-27T01:03:28Z

Straightforward workaround would be manually grouping keys (potentially by directories if I may say) and running sync for each group (subset) of the bucket.

That would nohow change total number of keys being listed (e.g. 1 billion) even in the case when none got changed since the last run.

kucukaslan · 2024-10-27T15:13:42Z

Right it wouldn't change the number of keys being listed. But it will help avoiding the sorting such a long list which is a performance bottleneck.

Moreover, if this is going to be done once, calling s5cmd for each subset of directories seems an acceptable compromise to me. If s5cmd fails for any reason you'd only need to run only for the corresponding subset of directories instead of the full bucket which will hopefully reduce the # of total list requests. The incrementality would be provided manually.

It is also necessary to make a list request, at least once, to find what is in the source & destination buckets. IIRC every list request brings 1000 keys, and 1000 List requests costs $0.005 . So for a billion documents listing cost will be about $5.

Is there some way to make sync efficient as to utilize some S3 API (or extra AWS service) to avoid listing/going through all keys it saw on the previous run, but rather operate on the "log" of the diff since prior state?

s5cmd doesn't store a restorable state/log, so not possible for now.

just get bunch of keys which were added or modified and some DeleteKey markers for deleted keys

this might have been easier if AWS s3 api had a way to send a "modified since" option but, there isn't afaics.
https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/service/s3#ListObjectsV2Input

yarikoptic · 2024-10-29T15:34:45Z

yeap -- "modified since" would have been perfect! Really shame they didn't provide it.

Moreover, if this is going to be done once

well, the idea is to do it once a day or so ;)

FWIW, my initial attempt on our bucket without doing any fancy manual "splitting" -- interrupted `--dry-run sync` after about 8 hours and process reaching 76GB virtual memory utilization

dandi@drogon:~/proj/s5cmd-dandi$ duct ../s5cmd/s5cmd --dry-run sync s3://dandiarchive/* dandiarchive/
2024-10-28T11:08:51-0400 [INFO    ] con-duct: duct is executing '../s5cmd/s5cmd --dry-run sync s3://dandiarchive/* dandiarchive/'...
2024-10-28T11:08:51-0400 [INFO    ] con-duct: Log files will be written to .duct/logs/2024.10.28T11.08.51-2733714_
2024-10-28T18:33:10-0400 [INFO    ] con-duct: Summary:
Exit Code: 137
Command: ../s5cmd/s5cmd --dry-run sync s3://dandiarchive/* dandiarchive/
Log files location: .duct/logs/2024.10.28T11.08.51-2733714_
Wall Clock Time: 26657.861 sec
Memory Peak Usage (RSS): 64.5 GB
Memory Average Usage (RSS): 27.8 GB
Virtual Memory Peak Usage (VSZ): 76.4 GB
Virtual Memory Average Usage (VSZ): 30.3 GB
Memory Peak Percentage: 95.7%
Memory Average Percentage: 41.20029702206471%
CPU Peak Usage: 100.0%
Average CPU Usage: 52.88295787687072%

PS edit: for fun of it will now run it on a box with 1TB of RAM to see if it ever completes -- how long it would take ;-)

yarikoptic · 2024-11-04T17:31:37Z

FTR: that dry run finished, listing about 375M keys for the "dry" cp in 225841.831 sec (62 hours, so over 3 days)

2024-11-02T05:43:28-0400 [INFO    ] con-duct: Summary:
Exit Code: 0
Command: ../../s5cmd/s5cmd --dry-run --log debug sync s3://dandiarchive/* dandiarchive/
Log files location: .duct/logs/2024.10.30T14.59.27-418623_
Wall Clock Time: 225841.831 sec
Memory Peak Usage (RSS): 2.4 GB
Memory Average Usage (RSS): 966.7 MB
Virtual Memory Peak Usage (VSZ): 10.6 GB
Virtual Memory Average Usage (VSZ): 5.8 GB
Memory Peak Percentage: 0.2%
Memory Average Percentage: 0.059080347653249383%
CPU Peak Usage: 667.0%
Average CPU Usage: 413.55872652902883%

[INFO   ] == Command exit (modification check follows) =====
run(ok): /home/yoh/proj/dandi/s5cmd-dandi (dataset) [duct ../../s5cmd/s5cmd --dry-run --log d...]
add(ok): .duct/logs/2024.10.30T14.59.27-418623_info.json (file)
add(ok): .duct/logs/2024.10.30T14.59.27-418623_stderr (file)
add(ok): .duct/logs/2024.10.30T14.59.27-418623_stdout (file)
add(ok): .duct/logs/2024.10.30T14.59.27-418623_usage.json (file)
save(ok): . (dataset)
yoh@typhon:~/proj/dandi/s5cmd-dandi$ wc -l .duct/logs/2024.10.30T14.59.27-418623_stdout
375455768 .duct/logs/2024.10.30T14.59.27-418623_stdout
yoh@typhon:~/proj/dandi/s5cmd-dandi$ duct --version
duct 0.8.0

yarikoptic mentioned this issue Oct 21, 2024

DANDI sync to MIT Engaging dandi/dandi-infrastructure#189

Open

yarikoptic mentioned this issue Oct 29, 2024

Add threshold options to kill child if goes awry con/duct#215

Open

yarikoptic mentioned this issue Nov 4, 2024

double check max virt memory consumption tracking con/duct#220

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s5cmd sync -- any way for efficient --incremental ? #746

s5cmd sync -- any way for efficient --incremental ? #746

yarikoptic commented Jul 30, 2024

kabilar commented Oct 25, 2024

yarikoptic commented Oct 25, 2024

kucukaslan commented Oct 26, 2024

yarikoptic commented Oct 27, 2024

kucukaslan commented Oct 27, 2024

yarikoptic commented Oct 29, 2024 •

edited

Loading

yarikoptic commented Nov 4, 2024

s5cmd sync -- any way for efficient --incremental ? #746

s5cmd sync -- any way for efficient --incremental ? #746

Comments

yarikoptic commented Jul 30, 2024

kabilar commented Oct 25, 2024

yarikoptic commented Oct 25, 2024

kucukaslan commented Oct 26, 2024

yarikoptic commented Oct 27, 2024

kucukaslan commented Oct 27, 2024

yarikoptic commented Oct 29, 2024 • edited Loading

yarikoptic commented Nov 4, 2024

yarikoptic commented Oct 29, 2024 •

edited

Loading