-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: remove fs exists check in plots, parallel data collect #8777
Conversation
@@ -39,13 +38,6 @@ def _collect_paths( | |||
for fs_path in fs_paths: | |||
if recursive and fs.isdir(fs_path): | |||
target_paths.extend(fs.find(fs_path)) | |||
|
|||
rel = fs.path.relpath(fs_path) | |||
if not fs.exists(fs_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: Here it is internally also hits remote storage. Warnings can be moved to the bottom instead? And we need to make them compact anyways + in the #7692 we will need to handle this better also
9912f98
to
cef55fa
Compare
cef55fa
to
194502a
Compare
Codecov ReportBase: 93.53% // Head: 93.57% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #8777 +/- ##
==========================================
+ Coverage 93.53% 93.57% +0.03%
==========================================
Files 457 457
Lines 36253 36251 -2
Branches 5261 5258 -3
==========================================
+ Hits 33910 33922 +12
+ Misses 1836 1824 -12
+ Partials 507 505 -2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
194502a
to
6e41880
Compare
I haven't seen the difference as stated in the description above in my machine (Linux). For me, it's quite opposite, looks like there's an overhead of |
@skshetry what exactly did you try to run? Could you share the commands, repo? |
I cloned $ (cd ../../dvc && git checkout main)
$ rm -rf .dvc/cache
$ time dvc plots diff 11295f0 e29c9be workspace -o .dvc/tmp/plots --split --json > /dev/null
1.44s user 0.10s system 3% cpu 41.534 total
$ rm -rf .dvc/cache
$ time dvc plots show
0.66s user 0.06s system 5% cpu 13.611 total
# with cache
$ time dvc plots show
0.67s user 0.06s system 4% cpu 15.944 total
$ time dvc plots diff 11295f0 e29c9be workspace -o .dvc/tmp/plots --split --json > /dev/null
1.61s user 0.08s system 3% cpu 43.946 total $ (cd ../../dvc && git checkout parallel-plots)
$ rm -rf .dvc/cache
$ time dvc plots diff 11295f0 e29c9be workspace -o .dvc/tmp/plots --split --json > /dev/null
1.50s user 0.11s system 4% cpu 33.071 total
$ rm -rf .dvc/cache
$ time dvc plots show
0.63s user 0.06s system 8% cpu 7.994 total
# with cache
$ time dvc plots show
0.65s user 0.05s system 11% cpu 6.156 total
$ time dvc plots diff 11295f0 e29c9be workspace -o .dvc/tmp/plots --split --json > /dev/null
1.51s user 0.11s system 8% cpu 18.143 total I think it was my internet itself that was slow yesterday, so there's a clear speedup. |
Thanks @skshetry , I'll migrate to the a different ThreadPool and merge this tomorrow. |
We can merge this and followup as well. |
6e41880
to
25f9785
Compare
dvc/repo/plots/__init__.py
Outdated
assert callable(data_source) | ||
value.update(data_source()) | ||
|
||
if len(to_resolve) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skshetry do you know what it the overhead creating a TP? It feels it can up to a second? I put some optimizations to avoid that ... I wonder of we can detect that files are in cache already on all of them are local to avoid creating it at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disregard this, I'm going to remove all these micro optimizations - I don't think it's worth doing this tbh
dvc/utils/threadpool.py
Outdated
|
||
# Yield must be hidden in closure so that the futures are submitted | ||
# before the first iterator value is required. | ||
def result_iterator(tasks): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skshetry had to apply a fix from the Executor.map for this, otherwise it was not working in my case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably propagate this change to dvc_obejcts/dvc_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shcheklein, let's not change this. You can do something like the following:
with executor:
list(executor.imap_unordered(resolve, to_resolve))
It is called imap_unordered
, because it is lazy and does not guarantee ordering.
Also see multiprocessing.pool.Pool.imap
and multiprocessing.pool.Pool.imap_unordered
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I am not against a new blocking map
API, but that has expectation of ordering due to existing ThreadPoolExecutor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, agreed. Let's keep it. It still stays somewhat lazy, but I see that we want it to be strictly lazy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s not change. We want lazier, stricter but one with weaker guarantees. We could add other APIs too, but that might add more maintenance burden as we also have some backports here. Also we need to keep this in sync with other projects.
It is simpler to use list()
to make it block.
6299300
to
1f1ed6e
Compare
1f1ed6e
to
e09c42f
Compare
Related #8786
Before:
/Users/ivan/Projects/vscode-dvc-demo/.venv/bin/python -m dvc plots diff 11295f0 9aa0603 e29c9be workspace -o .dvc/tmp/plots --split --json - COMPLETED (53199ms)
After:
/Users/ivan/Projects/vscode-dvc-demo/.venv/bin/python -m dvc plots diff 11295f0 9aa0603 e29c9be workspace -o .dvc/tmp/plots --split --json - COMPLETED (9404ms)
.. and not a huge change to maintain to my mind.
still very slow though, but I think it now goes into brancher, path collection, etc - which also should be parallelized, indexed and cached.
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏