Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gc: parallelize garbage collection #5961

Closed
isidentical opened this issue May 4, 2021 · 13 comments · Fixed by iterative/dvc-data#244
Closed

gc: parallelize garbage collection #5961

isidentical opened this issue May 4, 2021 · 13 comments · Fixed by iterative/dvc-data#244
Labels
A: gc Related go garbage collection performance improvement over resource / time consuming tasks

Comments

@isidentical
Copy link
Contributor

It seems like there is nothing blocking this, and for cloud providers, this might mean up to 16-20x speed up (dvc gc -c). Just for some numbers as the motivation, removing 1000~ cache files from the s3 takes about 20-25 minutes.

@isidentical isidentical added the performance improvement over resource / time consuming tasks label May 4, 2021
@isidentical isidentical self-assigned this May 4, 2021
@skshetry
Copy link
Member

skshetry commented May 4, 2021

@isidentical, is this required for fsspec? If not, I'd prefer that we prioritize this #4218 over the performance. I am not trying to say that they should not be a priority, not trying to discourage you, but I feel strongly that we should have a way to remove a file from the cache.

@isidentical isidentical removed their assignment May 25, 2021
@isidentical
Copy link
Contributor Author

@skshetry is #4218 a blocker for this issue? I don't have the whole context, though it seems something relevant with the used cache calculation and what this issue proposes is speeding up the part after that (running fs.remove() calls concurrently).

Another occurrence we hit during my support duty is dvc gc -c being slow (taking 95+ hours): https://discord.com/channels/485586884165107732/563406153334128681/846717782145368134

@skshetry
Copy link
Member

@skshetry is #4218 a blocker for this issue?

@isidentical, sorry for misleading, no it's not related. I was just trying to push #4218 up for prioritization. Please feel free to work on this. 🙂

Regarding #4218, the gc in whole is just stuck in the discussion for more than a year (see #5928). Internally, it might be already possible to implement that (even more so after #6008 with filter).

@daavoo daavoo added the A: gc Related go garbage collection label May 5, 2022
@SolomidHero
Copy link

I have 6M items on s3 storage and wanted to dvc gc them. It never finished. Even querying stage

@pmrowla pmrowla changed the title gc: pararellize garbage collection gc: parallelize garbage collection Nov 12, 2022
@daavoo
Copy link
Contributor

daavoo commented Dec 21, 2022

Coming from #8549 .

I have faced similar performance issues (latest version). It gets worse with a larger number of files but it is already a real pain around the order of 1000.

Even for just 100 files, there is too much overhead added:

  • AWS
# List
$ aws s3 ls s3://${BUCKET} --recursive
real    0m1.247s                                                                                                         
user    0m0.327s                                                                                                         
sys     0m0.072s
# Wipe
$ aws s3 rm s3://${BUCKET} --recursive
real    0m3.570s                                                                                                         
user    0m0.755s      
sys     0m0.128s
  • DVC

Note in the reproduction script below I have tried to minimize the overhead from unrelated operations (i.e. local gc, etc) ; you can verify in the attached profile that most time is spent on our equivalent of ls and rm.

$  time dvc gc -f -w -c
real    2m52.810s
user    0m5.885s
sys     0m0.790s
Reproduction script
import random
from pathlib import Path

data = Path("data")
data.mkdir(exist_ok=True)

for i in range(100):
    n = random.random()
    file = data / str(n)
    file.write_text(str(random.random()))
#!/bin/bash
BUCKET=diglesia-gc-testing
aws s3 rm s3://$BUCKET --recursive
rm -rf tmp
mkdir tmp

cp create_data.py tmp
cd tmp

git init
dvc init
dvc remote add -d myremote s3://$BUCKET            
git add .
git commit -m "init"

python create_data.py
dvc add data
# Just because is faster than push
aws s3 cp .dvc/cache s3://$BUCKET --recursive
# Don't spend time on local gc
rm -rf data
rm -rf .dvc/cache

python create_data.py
dvc add data
git add data.dvc
git commit -m "track data"

time dvc gc -f -w -c
dvc doctor
DVC version: 2.38.1 (pip)
---------------------------------
Platform: Python 3.9.5 on macOS-10.16-x86_64-i386-64bit
Subprojects:
        dvc_data = 0.28.4
        dvc_objects = 0.14.0
        dvc_render = 0.0.15
        dvc_task = 0.1.8
        dvclive = 1.2.2
        scmrepo = 0.1.4
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.11.0, boto3 = 1.24.59)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git

Viztracer profile (see https://github.com/iterative/dvc/wiki/Debugging,-Profiling-and-Benchmarking-DVC#generating-viztracer-data):

viztracer.dvc-20221221_185659.json.zip

@daavoo
Copy link
Contributor

daavoo commented Dec 27, 2022

Using iterative/dvc-data#244 , for 1000 files the overhead is significantly reduced:

$ time dvc gc -f -w -c
real    18m54.253s
$ time dvc gc -f -w -c
real    8m19.650s

However, there is still a ridiculous overhead compared to aws CLI:

$ time aws s3 ls s3://${BUCKET} --recursive
real    0m2.702s
$ time aws s3 rm s3://${BUCKET} --recursive
real    0m23.851s

For the rm part, checking the profiles shows that all overhead is introduced by _expand_path in https://github.com/fsspec/s3fs/blob/5917684f96226ee855921d1c614d21cc46b3edeb/s3fs/core.py#L1836 (which in this case for us is just a very expensive noop)

@daavoo
Copy link
Contributor

daavoo commented Dec 28, 2022

For the rm part, checking the profiles shows that all overhead is introduced by _expand_path in https://github.com/fsspec/s3fs/blob/5917684f96226ee855921d1c614d21cc46b3edeb/s3fs/core.py#L1836 (which in this case for us is just a very expensive noop)

A quick and dirty change of removing _expand_path calls (which in our case have no effect) makes our rm operation to be on pair with aws s3 rm, so no overhead is introduced.

Still need to discuss after vacation how to properly address it in fsspec but IMO makes sense to be able to bypass _expand_path if you know you don't need it (like in our use case).

All the overhead left now comes from our ls vs aws s3 ls, will look into it if I find some time.

@daavoo
Copy link
Contributor

daavoo commented Dec 28, 2022

I checked the ls part.

I am very confused about the usage of _list_oids_traverse inside odb.all and why it is so slow. Either something is just wrong with the mix of ThreadPool and async fsspec or there is some scenario where it's faster but I couldn't find it.

Replacing _list_oids_traverse with _list_oids (current behavior for "small" remotes) makes the operation extremely faster in all the remote sizes I have tried (and a different number of jobs for _list_oids_traverse), even thought it doesn't use multiple threads 🤷

Not using _list_oids_traverse in odb.all makes our gc -c fully on pair with aws s3 ls + aws s3 rm.

For the same 1000 files setup above:

$ time dvc gc -f -w -c
real    0m7.169s

I think all the changes mentioned above make sense in general for any remote.

Need to properly discuss how to integrate them in fsspec (not calling _expand_path if not needed) and dvc-objects (_list_oids is just much faster than _list_oids_traverse).

@dberenbaum
Copy link
Collaborator

dberenbaum commented Jan 3, 2023

@daavoo Do you know how performance is on other filesystems or clouds?

Edit: From a quick look, the base AsyncFileSystem in https://github.com/fsspec/filesystem_spec/blob/master/fsspec/asyn.py uses similar logic for _rm and _expand_path, so any other fs that inherits from it is likely to be similar. Seems like _expand_path will call at least _find and _exists, which I would guess are both relatively expensive operations on cloud.

@daavoo
Copy link
Contributor

daavoo commented Jan 4, 2023

@daavoo Do you know how performance is on other filesystems or clouds?

There are 3 changes:

  1. Use batch delete (gc: Pass list of paths to fs.remove. dvc-data#244)

This depends on whether the underlying filesystem implements "batch delete" or not.
For example, s3fs does implement it so the improvements are significant.
For filesystems where they just do a for loop, it won't be much of a difference.

  1. Avoid unnecessary calls to expand_path.

What you commented on the edit.

It affects all filesystems following the official spec. The impact basically grows with the number of objects passed, it is not only about being expensive in cloud but also about blocking the async thread.

  1. Avoid list_oids_traverse (_list_oids_traverse is much slower than _list_oids. dvc-objects#178)

Affects all filesystems as it's on our side.


Did a quick test for Azure, which doesn't implement bulk delete and just does a plain for loop.

For 1000 files:

  • Before
$ time dvc gc -f -w -c
real    9m48.269s
  • After
$ time dvc gc -f -w -c
real    5m02.686s

Given that there are no gains from "batch delete", all overhead was introduced by points 2 and 3.

@dberenbaum
Copy link
Collaborator

@daavoo Could you compare those results to az cli?

@daavoo
Copy link
Contributor

daavoo commented Jan 10, 2023

@daavoo Could you compare those results to az cli?

  • AZ
# LS
$ az storage blob list -c gc-testing
real    0m2.078s
# RM
$ az storage blob delete-batch -s gc-testing
real    0m17.557s
  • DVC

This uses an unmerged patch I have sent upstream fsspec/adlfs#383

$ dvc gc -f -w -c
real    0m3.909s

I think it's actually faster because I might be misusing az cli and/or the upstream patch used by DVC better handles concurrency

@dberenbaum
Copy link
Collaborator

So my takeaway is that bulk delete matters a lot 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: gc Related go garbage collection performance improvement over resource / time consuming tasks
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants