Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collecting checksum for an external dir dependency takes too much time #2891

Closed
shcheklein opened this issue Dec 5, 2019 · 16 comments
Closed
Labels
performance improvement over resource / time consuming tasks

Comments

@shcheklein
Copy link
Member

shcheklein commented Dec 5, 2019

DVC version:

DVC version: 0.71.0+d42a04
Python version: 3.7.5
Platform: Darwin-18.2.0-x86_64-i386-64bit
Binary: False
Package: None
Cache: reflink - True, hardlink - True, symlink - True
Filesystem type (cache directory): ('apfs', '/dev/disk1s1')
Filesystem type (workspace): ('apfs', '/dev/disk1s1')

Reproduce:

python -m cProfile -o s3-dep-images.prof -m dvc run -v -d s3://dvc-common/images -o out.txt "aws s3 ls s3://dvc-storage/get-started/fe/ >out.txt"

takes 3-4 hours to complete.

s3://dvc-common/images "Directory" has ~0.5M files.

Profile file:

https://www.dropbox.com/s/uqdjmm4pjkn0rj1/s3-dep-images.prof?dl=0

96% of time goes to thread lock.

Output:

Mon Dec  2 23:11:43 2019    s3-dep-images.prof

         325633688 function calls (322461350 primitive calls) in 10724.500 seconds

   Ordered by: internal time
   List reduced from 7431 to 100 due to restriction <100>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   596400 10371.995    0.017 10371.995    0.017 {method 'acquire' of '_thread.lock' objects}
    23894  152.922    0.006  152.922    0.006 {method 'read' of '_ssl._SSLSocket' objects}
  5782566   12.714    0.000   21.243    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:83(get_token)
  4475479    9.278    0.000   10.816    0.000 {method 'write' of '_io.TextIOWrapper' objects}
   444810    5.790    0.000   52.679    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:669(_parse)
      459    5.261    0.011    5.261    0.011 {method 'feed' of 'xml.etree.ElementTree.XMLParser' objects}
 16273231    5.019    0.000    8.242    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/utils.py:300(<genexpr>)
   440822    4.726    0.000    5.103    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:216(__init__)
   440796    4.490    0.000 10374.041    0.024 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py:408(result)
   134568    4.240    0.000   28.491    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/std.py:309(format_meter)
1352110/1352109    4.128    0.000    5.655    0.000 {method 'format' of 'str' objects}
   889629    3.983    0.000   13.170    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:878(_parse_numeric_token)
   297700    3.369    0.000 10376.913    0.035 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:264(wait)
   269133    3.349    0.000    4.986    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/utils.py:209(_is_ascii)
445259/463    3.343    0.000   83.457    0.180 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/botocore/parsers.py:362(_handle_structure)
 16138665    3.222    0.000    3.222    0.000 {built-in method unicodedata.east_asian_width}
   579377    2.974    0.000   12.928    0.000 {built-in method builtins.sum}
  1334439    2.732    0.000    2.732    0.000 {method 'replace' of 'datetime.datetime' objects}
  2203982    2.699    0.000    5.480    0.000 /Users/ivan/Projects/dvc/dvc/path_info.py:183(parts)
   440797    2.623    0.000 10418.901    0.024 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/std.py:1061(__iter__)
2673319/521    2.482    0.000   83.474    0.160 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/botocore/parsers.py:299(_parse_shape)
   440797    2.433    0.000 10376.699    0.024 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py:591(result_iterator)
   441675    2.391    0.000    2.621    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py:700(__str__)
  1334430    2.356    0.000    3.138    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:434(append)
  2807755    2.261    0.000    2.261    0.000 {method 'sub' of 're.Pattern' objects}
 12454653    2.224    0.000    3.369    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:214(isnum)
   881628    1.954    0.000    4.412    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py:63(parse_parts)
   440797    1.850    0.000    8.336    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py:145(submit)
  2203982    1.828    0.000    7.840    0.000 /Users/ivan/Projects/dvc/dvc/path_info.py:209(__hash__)
  5782566    1.813    0.000   23.057    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:195(__next__)
 11596386    1.809    0.000    1.811    0.000 {built-in method builtins.getattr}
        1    1.726    1.726 10433.537 10433.537 /Users/ivan/Projects/dvc/dvc/remote/base.py:180(_calculate_checksums)

Artboard 1

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Dec 5, 2019
@shcheklein
Copy link
Member Author

@casperdcl could it be related to the locking issue we solved before?

@shcheklein shcheklein added the performance improvement over resource / time consuming tasks label Dec 5, 2019
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Dec 5, 2019
@casperdcl
Copy link
Contributor

@shcheklein the callstack graph shows dvc.remote.base.RemoteBASE._calculate_checksums:__iter__:result_iterator:result:wait before that lock so it seems that it's the "waiting for checksum to finish calculating" rather than "waiting to print output" that's taking most of the time.

@shcheklein
Copy link
Member Author

@casperdcl right. 🤔 not have too much experience profiling Python multithreading tbh ... probably, that wait is fine - it's a main thread waiting for the workers to complete. Trying to understand is cPorfile captures the stuff workers do.

@casperdcl
Copy link
Contributor

I don't think it does - it's captured the main thread mostly waiting for the worker results.

@casperdcl
Copy link
Contributor

Also note there are 596,400 calls which must be roughly the number of files to calculate sums for.

@menshikh-iv
Copy link

@shcheklein cProfile isn't a good choice for multithread profiling (cProfile doesn't support it without "hacks"), for that case you can use

Also, you can try to use strace tool and get aggregated information about syscalls (esp. for locks), because this can be a GIL issue (i.e. program spend most of the time on GIL lock/release or waiting for GIL). This will be something like strace -c -fp <MAIN_PROCESS_PID>.

@shcheklein
Copy link
Member Author

@menshikh-iv right! thanks. I doubt that it's GIL bounded (it should be IO mostly - HEAD s3://X/file). I'll check yappi or something similar.

Also, was able to speed it up 8-10x by passing max_pool_connections to S3 client:

return session.client(
            "s3", endpoint_url=self.endpoint_url, use_ssl=self.use_ssl,
            config=botocore.config.Config(max_pool_connections=100)
        )

in remote/s3.py, otherwise it is bounded by 10 simultaneous connections.

I wonder what else is bounded by this limit - status collection?

@shcheklein
Copy link
Member Author

shcheklein commented Dec 7, 2019

Ok, more results on this stuff:

  • max_pool_connections not the reason. It might affect the performance but not 7x-8x on Linux and Mac. (So, no significant changes to status, pull, push, etc. Need to check if there are small improvements - was not focusing on that.)
  • when we do --jobs 100 (for push/pull/status) I don't see 100 ESTABLISHED connections - might mean that we indeed hit GIL - will need to run yappi or strace.
  • the actual bug is here remote/base.py:
        core = config.get(Config.SECTION_CORE, {})
        self.checksum_jobs = core.get(
            Config.SECTION_CORE_CHECKSUM_JOBS, self.CHECKSUM_JOBS
        )

config is a remote section remote_config = RemoteConfig(repo.config) not full config, so core section is always empty and checksum_jobs is always default - around 2 or 4. Which in case of collecting them from S3 ETags is way too small - it can be set to 100.

Also, it's obvious that requirements are so different here that we can't use the same setting and the same defaults - for calculating checksums locally (CPU) vs collecting them from S3 (IO).

@efiop
Copy link
Contributor

efiop commented Dec 8, 2019

@shcheklein Thanks for investigating! Created #2920 .

@Suor
Copy link
Contributor

Suor commented Dec 9, 2019

We should go away from --jobs really and use whatever resources we have scaling back when we get errors.

Related: #2473

@efiop
Copy link
Contributor

efiop commented Dec 9, 2019

@Suor taking up all the resources we have might not be ideal if you don't want to freeze your system 🙂 So --jobs in one shape or another is probably here to stay. We should optimize the default behavior though, I agree.

@isidentical
Copy link
Contributor

Is this bucket still accessible? I wonder how much it takes right now, and whether we could benefit from the same optimization that we have for the import-url;

dvc/dvc/objects/stage.py

Lines 103 to 111 in 3670bf4

if not upload and name in fs.DETAIL_FIELDS:
for details in fs.find(path_info, detail=True):
file_info = path_info.replace(path=details["name"])
hash_info = HashInfo(
name, details[name], size=details.get("size"),
)
yield file_info, HashFile(file_info, fs, hash_info)
return None

@shcheklein
Copy link
Member Author

@isidentical it was based on this data I think https://github.com/iterative/dataset-registry-private/blob/master/images.dvc ... (please check number of files in that directory though)

@abdellahrami
Copy link

Hi !
We're having the same problem as you can see in the profiles in the attached photo, where dvc pull spends a lot of time on Querying remote cache, and when looking at the CProfile it looks like the majority of that time is spent on a thread.lock function.
Our dataset is not that big, about a 1000 files of ~5 mb in average, but they are distributed in different folders, and also we're tracking most of them from the yaml stages.
Don't know if this issue is still unresolved, or if I'm doing something wrong.

image

@dberenbaum
Copy link
Collaborator

@abdellahrami Is it about external dependencies? From what I understand, you are doing dvc pull on ~1000 files across many folders that are outputs of dvc.yaml stages?

Re: external dependencies, I tried it with a dataset of 1,000,000 annotations files and the operation completed in <10 minutes. I didn't go all the way back to 0.71.0, but in 2.0.0 the same operation took 34 minutes. Most time was still spent in thread lock, but I think we could close this one if we have reduced the time that much.

@dberenbaum
Copy link
Collaborator

Closing as stale/resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance improvement over resource / time consuming tasks
Projects
None yet
Development

No branches or pull requests

8 participants