-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collecting checksum for an external dir dependency takes too much time #2891
Comments
@casperdcl could it be related to the locking issue we solved before? |
@shcheklein the callstack graph shows |
@casperdcl right. 🤔 not have too much experience profiling Python multithreading tbh ... probably, that wait is fine - it's a main thread waiting for the workers to complete. Trying to understand is cPorfile captures the stuff workers do. |
I don't think it does - it's captured the main thread mostly waiting for the worker results. |
Also note there are 596,400 calls which must be roughly the number of files to calculate sums for. |
@shcheklein Also, you can try to use |
@menshikh-iv right! thanks. I doubt that it's GIL bounded (it should be IO mostly - HEAD s3://X/file). I'll check yappi or something similar. Also, was able to speed it up 8-10x by passing return session.client(
"s3", endpoint_url=self.endpoint_url, use_ssl=self.use_ssl,
config=botocore.config.Config(max_pool_connections=100)
) in I wonder what else is bounded by this limit - status collection? |
Ok, more results on this stuff:
core = config.get(Config.SECTION_CORE, {})
self.checksum_jobs = core.get(
Config.SECTION_CORE_CHECKSUM_JOBS, self.CHECKSUM_JOBS
)
Also, it's obvious that requirements are so different here that we can't use the same setting and the same defaults - for calculating checksums locally (CPU) vs collecting them from S3 (IO). |
@shcheklein Thanks for investigating! Created #2920 . |
We should go away from Related: #2473 |
@Suor taking up all the resources we have might not be ideal if you don't want to freeze your system 🙂 So |
Is this bucket still accessible? I wonder how much it takes right now, and whether we could benefit from the same optimization that we have for the Lines 103 to 111 in 3670bf4
|
@isidentical it was based on this data I think https://github.com/iterative/dataset-registry-private/blob/master/images.dvc ... (please check number of files in that directory though) |
Hi ! |
@abdellahrami Is it about external dependencies? From what I understand, you are doing Re: external dependencies, I tried it with a dataset of 1,000,000 annotations files and the operation completed in <10 minutes. I didn't go all the way back to 0.71.0, but in 2.0.0 the same operation took 34 minutes. Most time was still spent in thread lock, but I think we could close this one if we have reduced the time that much. |
Closing as stale/resolved |
DVC version:
Reproduce:
takes 3-4 hours to complete.
s3://dvc-common/images "Directory" has ~0.5M files.
Profile file:
https://www.dropbox.com/s/uqdjmm4pjkn0rj1/s3-dep-images.prof?dl=0
96% of time goes to thread lock.
Output:
The text was updated successfully, but these errors were encountered: