-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gc: parallelize garbage collection #5961
Comments
@isidentical, is this required for |
@skshetry is #4218 a blocker for this issue? I don't have the whole context, though it seems something relevant with the used cache calculation and what this issue proposes is speeding up the part after that (running Another occurrence we hit during my support duty is |
@isidentical, sorry for misleading, no it's not related. I was just trying to push #4218 up for prioritization. Please feel free to work on this. 🙂 Regarding #4218, the |
I have 6M items on s3 storage and wanted to |
Coming from #8549 . I have faced similar performance issues (latest version). It gets worse with a larger number of files but it is already a real pain around the order of 1000. Even for just 100 files, there is too much overhead added:
# List
$ aws s3 ls s3://${BUCKET} --recursive
real 0m1.247s
user 0m0.327s
sys 0m0.072s
# Wipe
$ aws s3 rm s3://${BUCKET} --recursive
real 0m3.570s
user 0m0.755s
sys 0m0.128s
Note in the reproduction script below I have tried to minimize the overhead from unrelated operations (i.e. local gc, etc) ; you can verify in the attached profile that most time is spent on our equivalent of $ time dvc gc -f -w -c
real 2m52.810s
user 0m5.885s
sys 0m0.790s Reproduction scriptimport random
from pathlib import Path
data = Path("data")
data.mkdir(exist_ok=True)
for i in range(100):
n = random.random()
file = data / str(n)
file.write_text(str(random.random())) #!/bin/bash
BUCKET=diglesia-gc-testing
aws s3 rm s3://$BUCKET --recursive
rm -rf tmp
mkdir tmp
cp create_data.py tmp
cd tmp
git init
dvc init
dvc remote add -d myremote s3://$BUCKET
git add .
git commit -m "init"
python create_data.py
dvc add data
# Just because is faster than push
aws s3 cp .dvc/cache s3://$BUCKET --recursive
# Don't spend time on local gc
rm -rf data
rm -rf .dvc/cache
python create_data.py
dvc add data
git add data.dvc
git commit -m "track data"
time dvc gc -f -w -c dvc doctor
Viztracer profile (see https://github.com/iterative/dvc/wiki/Debugging,-Profiling-and-Benchmarking-DVC#generating-viztracer-data): |
Using iterative/dvc-data#244 , for 1000 files the overhead is significantly reduced:
$ time dvc gc -f -w -c
real 18m54.253s
$ time dvc gc -f -w -c
real 8m19.650s However, there is still a ridiculous overhead compared to $ time aws s3 ls s3://${BUCKET} --recursive
real 0m2.702s
$ time aws s3 rm s3://${BUCKET} --recursive
real 0m23.851s For the |
A quick and dirty change of removing Still need to discuss after vacation how to properly address it in fsspec but IMO makes sense to be able to bypass All the overhead left now comes from our |
I checked the I am very confused about the usage of _list_oids_traverse inside Replacing Not using For the same 1000 files setup above: $ time dvc gc -f -w -c
real 0m7.169s I think all the changes mentioned above make sense in general for any remote. Need to properly discuss how to integrate them in |
@daavoo Do you know how performance is on other filesystems or clouds? Edit: From a quick look, the base AsyncFileSystem in https://github.com/fsspec/filesystem_spec/blob/master/fsspec/asyn.py uses similar logic for |
There are 3 changes:
This depends on whether the underlying filesystem implements "batch delete" or not.
What you commented on the edit. It affects all filesystems following the official spec. The impact basically grows with the number of objects passed, it is not only about being expensive in cloud but also about blocking the async thread.
Affects all filesystems as it's on our side. Did a quick test for Azure, which doesn't implement bulk delete and just does a plain For 1000 files:
$ time dvc gc -f -w -c
real 9m48.269s
$ time dvc gc -f -w -c
real 5m02.686s Given that there are no gains from "batch delete", all overhead was introduced by points 2 and 3. |
@daavoo Could you compare those results to az cli? |
# LS
$ az storage blob list -c gc-testing
real 0m2.078s # RM
$ az storage blob delete-batch -s gc-testing
real 0m17.557s
This uses an unmerged patch I have sent upstream fsspec/adlfs#383 $ dvc gc -f -w -c
real 0m3.909s I think it's actually faster because I might be misusing |
So my takeaway is that bulk delete matters a lot 😄 |
It seems like there is nothing blocking this, and for cloud providers, this might mean up to 16-20x speed up (
dvc gc -c
). Just for some numbers as the motivation, removing 1000~ cache files from the s3 takes about 20-25 minutes.The text was updated successfully, but these errors were encountered: