Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s5cmd sync -- any way for efficient --incremental ? #746

Open
yarikoptic opened this issue Jul 30, 2024 · 7 comments
Open

s5cmd sync -- any way for efficient --incremental ? #746

yarikoptic opened this issue Jul 30, 2024 · 7 comments

Comments

@yarikoptic
Copy link

We have a big (in TBs but also billions of keys) bucket to download/sync locally. FWIW, versioning turned on, so keys have versionIds assigned.

Is there some way to make sync efficient as to utilize some S3 API (or extra AWS service) to avoid listing/going through all keys it saw on the previous run, but rather operate on the "log" of the diff since prior state? (just get bunch of keys which were added or modified and some DeleteKey markers for deleted keys)

@kabilar
Copy link

kabilar commented Oct 25, 2024

Hi @yarikoptic, it looks like this may have been previously addressed (issues: #441, #447; fix: #483) and tested with the use cases of millions of files.

cc @puja-trivedi @aaronkanzer

@yarikoptic
Copy link
Author

  • code pointed there implies listing and sorting full bucket/destination first, not operating on "incremental"s
  • there they talk about thousands and a million of files, and concern of sorting those keys
  • I am talking about hundreds of millions or few billions of keys (that is where we are in DANDI ATM). Hence any solution requiring full listing first is doomed to be inefficient and expensive (if someone to pay for those listing queries).

@kucukaslan
Copy link
Contributor

s5cmd doesn't support it at the moment.

Straightforward workaround would be manually grouping keys (potentially by directories if I may say) and running sync for each group (subset) of the bucket.

@yarikoptic
Copy link
Author

Straightforward workaround would be manually grouping keys (potentially by directories if I may say) and running sync for each group (subset) of the bucket.

That would nohow change total number of keys being listed (e.g. 1 billion) even in the case when none got changed since the last run.

@kucukaslan
Copy link
Contributor

Right it wouldn't change the number of keys being listed. But it will help avoiding the sorting such a long list which is a performance bottleneck.

Moreover, if this is going to be done once, calling s5cmd for each subset of directories seems an acceptable compromise to me. If s5cmd fails for any reason you'd only need to run only for the corresponding subset of directories instead of the full bucket which will hopefully reduce the # of total list requests. The incrementality would be provided manually.

It is also necessary to make a list request, at least once, to find what is in the source & destination buckets. IIRC every list request brings 1000 keys, and 1000 List requests costs $0.005 . So for a billion documents listing cost will be about $5.

Is there some way to make sync efficient as to utilize some S3 API (or extra AWS service) to avoid listing/going through all keys it saw on the previous run, but rather operate on the "log" of the diff since prior state?

s5cmd doesn't store a restorable state/log, so not possible for now.

just get bunch of keys which were added or modified and some DeleteKey markers for deleted keys

this might have been easier if AWS s3 api had a way to send a "modified since" option but, there isn't afaics.
https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/service/s3#ListObjectsV2Input

@yarikoptic
Copy link
Author

yarikoptic commented Oct 29, 2024

yeap -- "modified since" would have been perfect! Really shame they didn't provide it.

Moreover, if this is going to be done once

well, the idea is to do it once a day or so ;)

FWIW, my initial attempt on our bucket without doing any fancy manual "splitting" -- interrupted `--dry-run sync` after about 8 hours and process reaching 76GB virtual memory utilization
dandi@drogon:~/proj/s5cmd-dandi$ duct ../s5cmd/s5cmd --dry-run sync s3://dandiarchive/* dandiarchive/
2024-10-28T11:08:51-0400 [INFO    ] con-duct: duct is executing '../s5cmd/s5cmd --dry-run sync s3://dandiarchive/* dandiarchive/'...
2024-10-28T11:08:51-0400 [INFO    ] con-duct: Log files will be written to .duct/logs/2024.10.28T11.08.51-2733714_
2024-10-28T18:33:10-0400 [INFO    ] con-duct: Summary:
Exit Code: 137
Command: ../s5cmd/s5cmd --dry-run sync s3://dandiarchive/* dandiarchive/
Log files location: .duct/logs/2024.10.28T11.08.51-2733714_
Wall Clock Time: 26657.861 sec
Memory Peak Usage (RSS): 64.5 GB
Memory Average Usage (RSS): 27.8 GB
Virtual Memory Peak Usage (VSZ): 76.4 GB
Virtual Memory Average Usage (VSZ): 30.3 GB
Memory Peak Percentage: 95.7%
Memory Average Percentage: 41.20029702206471%
CPU Peak Usage: 100.0%
Average CPU Usage: 52.88295787687072%

PS edit: for fun of it will now run it on a box with 1TB of RAM to see if it ever completes -- how long it would take ;-)

@yarikoptic
Copy link
Author

FTR: that dry run finished, listing about 375M keys for the "dry" cp in 225841.831 sec (62 hours, so over 3 days)
2024-11-02T05:43:28-0400 [INFO    ] con-duct: Summary:
Exit Code: 0
Command: ../../s5cmd/s5cmd --dry-run --log debug sync s3://dandiarchive/* dandiarchive/
Log files location: .duct/logs/2024.10.30T14.59.27-418623_
Wall Clock Time: 225841.831 sec
Memory Peak Usage (RSS): 2.4 GB
Memory Average Usage (RSS): 966.7 MB
Virtual Memory Peak Usage (VSZ): 10.6 GB
Virtual Memory Average Usage (VSZ): 5.8 GB
Memory Peak Percentage: 0.2%
Memory Average Percentage: 0.059080347653249383%
CPU Peak Usage: 667.0%
Average CPU Usage: 413.55872652902883%

[INFO   ] == Command exit (modification check follows) =====
run(ok): /home/yoh/proj/dandi/s5cmd-dandi (dataset) [duct ../../s5cmd/s5cmd --dry-run --log d...]
add(ok): .duct/logs/2024.10.30T14.59.27-418623_info.json (file)
add(ok): .duct/logs/2024.10.30T14.59.27-418623_stderr (file)
add(ok): .duct/logs/2024.10.30T14.59.27-418623_stdout (file)
add(ok): .duct/logs/2024.10.30T14.59.27-418623_usage.json (file)
save(ok): . (dataset)
yoh@typhon:~/proj/dandi/s5cmd-dandi$ wc -l .duct/logs/2024.10.30T14.59.27-418623_stdout
375455768 .duct/logs/2024.10.30T14.59.27-418623_stdout
yoh@typhon:~/proj/dandi/s5cmd-dandi$ duct --version
duct 0.8.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants