Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: loading dirs is exteremely slow because of PathInfo #3635

Closed
efiop opened this issue Apr 15, 2020 · 5 comments · Fixed by #3672
Closed

remote: loading dirs is exteremely slow because of PathInfo #3635

efiop opened this issue Apr 15, 2020 · 5 comments · Fixed by #3672
Assignees
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks

Comments

@efiop
Copy link
Contributor

efiop commented Apr 15, 2020

This https://github.com/iterative/dvc/blob/0.93.0/dvc/remote/base.py#L290 takes ~50 seconds compared to ~0.4 for simple replace() that we used to have on our imagenet dataset.

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Apr 15, 2020
@efiop efiop added bug Did we break something? p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks labels Apr 15, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Apr 15, 2020
@efiop efiop self-assigned this Apr 15, 2020
@efiop
Copy link
Contributor Author

efiop commented Apr 15, 2020

Actually, same with dumping. PathInfo is extremely slow when we need lots of it.

efiop added a commit to efiop/dvc that referenced this issue Apr 15, 2020
Using it to load dirs takes around 50 sec compared to 0.4 with simple
`replace()`. Path objects are great, but not when you need lots of them.

Related to iterative#3635
efiop added a commit that referenced this issue Apr 15, 2020
Using it to load dirs takes around 50 sec compared to 0.4 with simple
`replace()`. Path objects are great, but not when you need lots of them.

Related to #3635
@efiop efiop removed their assignment Apr 21, 2020
@efiop
Copy link
Contributor Author

efiop commented Apr 21, 2020

Another confirmation #3634 (comment)

@efiop efiop assigned efiop and pmrowla and unassigned efiop Apr 21, 2020
@pmrowla
Copy link
Contributor

pmrowla commented Apr 22, 2020

The issue is that in OutputBase._collect_used_dir_cache(), we add str(path_info) to our used_cache dict, rather than just adding path_info.

dvc/dvc/output/base.py

Lines 393 to 397 in 52369bd

for entry in self.dir_cache:
checksum = entry[self.remote.PARAM_CHECKSUM]
info = self.path_info / entry[self.remote.PARAM_RELPATH]
if not filter_info or info.isin_or_eq(filter_info):
cache.add(self.scheme, checksum, str(info))

PathInfo.__str__() is overridden to always compute the relative path from our current working dir, rather than than returning a string fspath.

dvc/dvc/path_info.py

Lines 45 to 47 in 52369bd

def __str__(self):
path = self.__fspath__()
return relpath(path)

For *nix systems, relpath() just wraps os.path.relpath, but in the event that we have a directory with millions of files, we end up making millions of os.path.relpath/os.path.abspath/os.getcwd calls here, which is very slow (see cpython posixpath.py).

@pmrowla
Copy link
Contributor

pmrowla commented Apr 22, 2020

Since we don't actually need the relative path until we are printing/logging status for modified/pushed/pulled files, one easy optimization is that we can just put path_info directly into used_cache, and delay calling str(path_info) until we actually use the relative path string. For the case where we only have a few modified files, we will only compute relpath for those modified files rather than for the entire directory.

But this does not cover the case where we eventually have to print/log (and compute relpath for) millions of modified files. To get better behavior in that case, we could implement our own versions of posixpath.abspath/posixpath.relpath that would take advantage of the fact that things like os.getcwd() should not change between abspath and relpath calls for our use case.

for S3 remote with 2M files, deferring the relpath call drops the Repo._cloud_status runtime from 938s to 374s with cProfile

@efiop thoughts?

@efiop
Copy link
Contributor Author

efiop commented Apr 22, 2020

@pmrowla Very interesting idea about relpath/abspath re-implementation! But I'm a bit unsure if it is really worth creating such a general hack for it, when we could do a local hack something like so:

        path = str(self.path_info)
        filter_path = str(filter_info) if filter_info else None
        for entry in self.dir_cache:
            # NOTE: not using PathInfo's here, as they become really slow when you need a lot of them
            checksum = entry[self.remote.PARAM_CHECKSUM]
            entry_path = os.path.join(path, entry[self.remote.PARAM_RELPATH])
            if not filter_path or entry_path == filter_path or entry_path.startswith(filter_path + os.sep):
                cache.add(self.scheme, checksum, entry_path)

At least it won't lose the hack context this way. Plus, I feel like path_info might bite us not only in relpath department but in others too, where we won't be able to provide a good general hack in the future, so it is better to keep it local for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants