collecting checksum for an external dir dependency takes too much time #2891

shcheklein · 2019-12-05T00:53:15Z

DVC version:

DVC version: 0.71.0+d42a04
Python version: 3.7.5
Platform: Darwin-18.2.0-x86_64-i386-64bit
Binary: False
Package: None
Cache: reflink - True, hardlink - True, symlink - True
Filesystem type (cache directory): ('apfs', '/dev/disk1s1')
Filesystem type (workspace): ('apfs', '/dev/disk1s1')

Reproduce:

python -m cProfile -o s3-dep-images.prof -m dvc run -v -d s3://dvc-common/images -o out.txt "aws s3 ls s3://dvc-storage/get-started/fe/ >out.txt"

takes 3-4 hours to complete.

s3://dvc-common/images "Directory" has ~0.5M files.

Profile file:

https://www.dropbox.com/s/uqdjmm4pjkn0rj1/s3-dep-images.prof?dl=0

96% of time goes to thread lock.

Output:

Mon Dec  2 23:11:43 2019    s3-dep-images.prof

         325633688 function calls (322461350 primitive calls) in 10724.500 seconds

   Ordered by: internal time
   List reduced from 7431 to 100 due to restriction <100>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   596400 10371.995    0.017 10371.995    0.017 {method 'acquire' of '_thread.lock' objects}
    23894  152.922    0.006  152.922    0.006 {method 'read' of '_ssl._SSLSocket' objects}
  5782566   12.714    0.000   21.243    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:83(get_token)
  4475479    9.278    0.000   10.816    0.000 {method 'write' of '_io.TextIOWrapper' objects}
   444810    5.790    0.000   52.679    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:669(_parse)
      459    5.261    0.011    5.261    0.011 {method 'feed' of 'xml.etree.ElementTree.XMLParser' objects}
 16273231    5.019    0.000    8.242    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/utils.py:300(<genexpr>)
   440822    4.726    0.000    5.103    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:216(__init__)
   440796    4.490    0.000 10374.041    0.024 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py:408(result)
   134568    4.240    0.000   28.491    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/std.py:309(format_meter)
1352110/1352109    4.128    0.000    5.655    0.000 {method 'format' of 'str' objects}
   889629    3.983    0.000   13.170    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:878(_parse_numeric_token)
   297700    3.369    0.000 10376.913    0.035 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py:264(wait)
   269133    3.349    0.000    4.986    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/utils.py:209(_is_ascii)
445259/463    3.343    0.000   83.457    0.180 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/botocore/parsers.py:362(_handle_structure)
 16138665    3.222    0.000    3.222    0.000 {built-in method unicodedata.east_asian_width}
   579377    2.974    0.000   12.928    0.000 {built-in method builtins.sum}
  1334439    2.732    0.000    2.732    0.000 {method 'replace' of 'datetime.datetime' objects}
  2203982    2.699    0.000    5.480    0.000 /Users/ivan/Projects/dvc/dvc/path_info.py:183(parts)
   440797    2.623    0.000 10418.901    0.024 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/tqdm/std.py:1061(__iter__)
2673319/521    2.482    0.000   83.474    0.160 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/botocore/parsers.py:299(_parse_shape)
   440797    2.433    0.000 10376.699    0.024 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py:591(result_iterator)
   441675    2.391    0.000    2.621    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py:700(__str__)
  1334430    2.356    0.000    3.138    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:434(append)
  2807755    2.261    0.000    2.261    0.000 {method 'sub' of 're.Pattern' objects}
 12454653    2.224    0.000    3.369    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:214(isnum)
   881628    1.954    0.000    4.412    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py:63(parse_parts)
   440797    1.850    0.000    8.336    0.000 /usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py:145(submit)
  2203982    1.828    0.000    7.840    0.000 /Users/ivan/Projects/dvc/dvc/path_info.py:209(__hash__)
  5782566    1.813    0.000   23.057    0.000 /Users/ivan/Projects/dvc/.env/lib/python3.7/site-packages/dateutil/parser/_parser.py:195(__next__)
 11596386    1.809    0.000    1.811    0.000 {built-in method builtins.getattr}
        1    1.726    1.726 10433.537 10433.537 /Users/ivan/Projects/dvc/dvc/remote/base.py:180(_calculate_checksums)

The text was updated successfully, but these errors were encountered:

shcheklein · 2019-12-05T00:54:16Z

@casperdcl could it be related to the locking issue we solved before?

casperdcl · 2019-12-05T01:23:14Z

@shcheklein the callstack graph shows dvc.remote.base.RemoteBASE._calculate_checksums:__iter__:result_iterator:result:wait before that lock so it seems that it's the "waiting for checksum to finish calculating" rather than "waiting to print output" that's taking most of the time.

shcheklein · 2019-12-05T01:51:34Z

@casperdcl right. 🤔 not have too much experience profiling Python multithreading tbh ... probably, that wait is fine - it's a main thread waiting for the workers to complete. Trying to understand is cPorfile captures the stuff workers do.

casperdcl · 2019-12-05T02:27:47Z

I don't think it does - it's captured the main thread mostly waiting for the worker results.

casperdcl · 2019-12-05T02:30:00Z

Also note there are 596,400 calls which must be roughly the number of files to calculate sums for.

menshikh-iv · 2019-12-05T03:29:40Z

@shcheklein cProfile isn't a good choice for multithread profiling (cProfile doesn't support it without "hacks"), for that case you can use

Also, you can try to use strace tool and get aggregated information about syscalls (esp. for locks), because this can be a GIL issue (i.e. program spend most of the time on GIL lock/release or waiting for GIL). This will be something like strace -c -fp <MAIN_PROCESS_PID>.

shcheklein · 2019-12-05T03:41:11Z

@menshikh-iv right! thanks. I doubt that it's GIL bounded (it should be IO mostly - HEAD s3://X/file). I'll check yappi or something similar.

Also, was able to speed it up 8-10x by passing max_pool_connections to S3 client:

return session.client(
            "s3", endpoint_url=self.endpoint_url, use_ssl=self.use_ssl,
            config=botocore.config.Config(max_pool_connections=100)
        )

in remote/s3.py, otherwise it is bounded by 10 simultaneous connections.

I wonder what else is bounded by this limit - status collection?

shcheklein · 2019-12-07T04:15:00Z

Ok, more results on this stuff:

max_pool_connections not the reason. It might affect the performance but not 7x-8x on Linux and Mac. (So, no significant changes to status, pull, push, etc. Need to check if there are small improvements - was not focusing on that.)
when we do --jobs 100 (for push/pull/status) I don't see 100 ESTABLISHED connections - might mean that we indeed hit GIL - will need to run yappi or strace.
the actual bug is here remote/base.py:

        core = config.get(Config.SECTION_CORE, {})
        self.checksum_jobs = core.get(
            Config.SECTION_CORE_CHECKSUM_JOBS, self.CHECKSUM_JOBS
        )

config is a remote section remote_config = RemoteConfig(repo.config) not full config, so core section is always empty and checksum_jobs is always default - around 2 or 4. Which in case of collecting them from S3 ETags is way too small - it can be set to 100.

Also, it's obvious that requirements are so different here that we can't use the same setting and the same defaults - for calculating checksums locally (CPU) vs collecting them from S3 (IO).

efiop · 2019-12-08T18:47:50Z

@shcheklein Thanks for investigating! Created #2920 .

Suor · 2019-12-09T08:38:23Z

We should go away from --jobs really and use whatever resources we have scaling back when we get errors.

Related: #2473

efiop · 2019-12-09T13:39:26Z

@Suor taking up all the resources we have might not be ideal if you don't want to freeze your system 🙂 So --jobs in one shape or another is probably here to stay. We should optimize the default behavior though, I agree.

isidentical · 2021-05-06T07:38:25Z

Is this bucket still accessible? I wonder how much it takes right now, and whether we could benefit from the same optimization that we have for the import-url;

dvc/dvc/objects/stage.py

Lines 103 to 111 in 3670bf4

    
           if not upload and name in fs.DETAIL_FIELDS: 
        
               for details in fs.find(path_info, detail=True): 
        
                   file_info = path_info.replace(path=details["name"]) 
        
                   hash_info = HashInfo( 
        
                       name, details[name], size=details.get("size"), 
        
                   ) 
        
                   yield file_info, HashFile(file_info, fs, hash_info) 
        
               return None

shcheklein · 2021-05-07T00:54:50Z

@isidentical it was based on this data I think https://github.com/iterative/dataset-registry-private/blob/master/images.dvc ... (please check number of files in that directory though)

abdellahrami · 2023-03-16T14:23:26Z

Hi !
We're having the same problem as you can see in the profiles in the attached photo, where dvc pull spends a lot of time on Querying remote cache, and when looking at the CProfile it looks like the majority of that time is spent on a thread.lock function.
Our dataset is not that big, about a 1000 files of ~5 mb in average, but they are distributed in different folders, and also we're tracking most of them from the yaml stages.
Don't know if this issue is still unresolved, or if I'm doing something wrong.

dberenbaum · 2023-04-10T19:22:56Z

@abdellahrami Is it about external dependencies? From what I understand, you are doing dvc pull on ~1000 files across many folders that are outputs of dvc.yaml stages?

Re: external dependencies, I tried it with a dataset of 1,000,000 annotations files and the operation completed in <10 minutes. I didn't go all the way back to 0.71.0, but in 2.0.0 the same operation took 34 minutes. Most time was still spent in thread lock, but I think we could close this one if we have reduced the time that much.

dberenbaum · 2023-04-21T21:12:08Z

Closing as stale/resolved

triage-new-issues bot added the triage Needs to be triaged label Dec 5, 2019

shcheklein added the performance improvement over resource / time consuming tasks label Dec 5, 2019

triage-new-issues bot removed the triage Needs to be triaged label Dec 5, 2019

weekly-digest bot mentioned this issue Dec 8, 2019

Weekly Digest (1 December, 2019 - 8 December, 2019) #2919

Closed

efiop mentioned this issue Dec 8, 2019

core.checksum_jobs doesn't work for remotes #2920

Closed

pared mentioned this issue Sep 16, 2021

commands: enhance our profiling capabilities #6627

Closed

dberenbaum closed this as completed Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collecting checksum for an external dir dependency takes too much time #2891

collecting checksum for an external dir dependency takes too much time #2891

shcheklein commented Dec 5, 2019 •

edited

Loading

shcheklein commented Dec 5, 2019

casperdcl commented Dec 5, 2019

shcheklein commented Dec 5, 2019

casperdcl commented Dec 5, 2019

casperdcl commented Dec 5, 2019

menshikh-iv commented Dec 5, 2019

shcheklein commented Dec 5, 2019

shcheklein commented Dec 7, 2019 •

edited

Loading

efiop commented Dec 8, 2019

Suor commented Dec 9, 2019

efiop commented Dec 9, 2019

isidentical commented May 6, 2021

shcheklein commented May 7, 2021

abdellahrami commented Mar 16, 2023

dberenbaum commented Apr 10, 2023

dberenbaum commented Apr 21, 2023

collecting checksum for an external dir dependency takes too much time #2891

collecting checksum for an external dir dependency takes too much time #2891

Comments

shcheklein commented Dec 5, 2019 • edited Loading

shcheklein commented Dec 5, 2019

casperdcl commented Dec 5, 2019

shcheklein commented Dec 5, 2019

casperdcl commented Dec 5, 2019

casperdcl commented Dec 5, 2019

menshikh-iv commented Dec 5, 2019

shcheklein commented Dec 5, 2019

shcheklein commented Dec 7, 2019 • edited Loading

efiop commented Dec 8, 2019

Suor commented Dec 9, 2019

efiop commented Dec 9, 2019

isidentical commented May 6, 2021

shcheklein commented May 7, 2021

abdellahrami commented Mar 16, 2023

dberenbaum commented Apr 10, 2023

dberenbaum commented Apr 21, 2023

shcheklein commented Dec 5, 2019 •

edited

Loading

shcheklein commented Dec 7, 2019 •

edited

Loading